The L2F Spoken Web Search system for Mediaeval 2013

                           Alberto Abad                                             Ramón F. Astudillo
         INESC-ID Lisboa / Instituto Superior Técnico                                  INESC-ID Lisboa
                    alberto@l2f.inesc-id.pt                                      ramon@l2f.inesc-id.pt
                                                            Isabel Trancoso
                                            INESC-ID Lisboa/ Instituto Superior Técnico
                                                          imt@l2f.inesc-id.pt

ABSTRACT                                                             based on hybrid connectionist methods or as a feature ex-
The INESC-ID’s Spoken Language Systems Laboratory (L2 F)             traction component for DTW based term detection.
primary system developed for the Spoken Web Search task
of the Mediaeval 2013 evaluation campaign consists of the            2.1   Phonetic network classifiers
fusion of six individual sub-systems exploiting 3 different          L2 F systems exploit multi-layer perceptron (MLP) networks
language-dependent phonetic classifiers. For each phonetic           that are part of our in-house hybrid connectionist ASR sys-
classifier, an acoustic keyword spotting (AKWS) sub-system           tem. The phonetic class posterior probabilities are in fact
based on connectionist speech recognition and a dynamic              the result of the combination of four MLP outputs trained
time warping (DTW) based sub-system have been devel-                 with Perceptual Linear Prediction features (PLP, 13 static
oped. The diversity in terms of phonetic classifiers and             + first derivative), PLP with log-RelAtive SpecTrAl speech
methods, together with the efficient fusion and calibration          processing features (PLP-RASTA, 13 static + first deriva-
approach applied for heterogeneous sub-systems, are the key          tive), Modulation SpectroGram features (MSG, 28 static)
elements of the L2 F submission. Besides the primary sub-            and Advanced Font-End from ETSI features (ETSI, 13 static
mission, two additional systems based on the fusion of only          + first and second derivatives). The language-dependent
the AKWS and the DTW sub-systems have been developed                 MLP networks were trained using different amounts of an-
for comparison purposes. A final multi-site system formed            notated data [2]. Each MLP network is characterized by
by the fusion of the L2F and the GTTS primary submis-                the size of its input layer that depends on the particular
sions has been also submitted to explore the potential of the        parametrization and the frame context size (13 for PLP,
fusion approach for very heterogeneous systems.                      PLP-RASTA and ETSI; 15 for MSG), the number of units
                                                                     of the two hidden layers (500), and the size of the output
1.    INTRODUCTION                                                   layer. In this case, only monophone units are modelled, re-
This document introduces the Spoken Web Search systems               sulting in MLP networks of 39 (38 phonemes + 1 silence)
developed by the INESC-ID’s Spoken Language Systems                  soft-max outputs in the case of pt, 40 for br (39 phonemes
Laboratory (L2 F) for the Mediaeval 2013 campaign. The               + 1 silence) and 30 for es (29 phonemes + 1 silence).
targeted task in this challenge is query-by-example spoken
term detection. Detailed information about the task and
                                                                     2.2   Acoustic KWS systems
                                                                     AKWS sub-systems exploit the phonetic networks as acous-
the data used can be found in the evaluation plan [5]. One
                                                                     tic models for both phonetic tokenization and query search
primary and three contrastive systems (one of them in col-
                                                                     based on hybrid ANN/HMM approaches for ASR [6]. The
laboration with another participating group) have been sub-
                                                                     decoder used is based on a weighted finite-state transducer
mitted. The primary system consists of the fusion of six in-
                                                                     (WFST) approach to large vocabulary speech recognition.
dividual sub-systems. The proposed systems present three
                                                                     First, the phonetic transcription of each spoken query is ob-
main novelties with respect to the systems developed for
                                                                     tained for every sub-system using a phone-loop grammar.
the previous year evaluation campaign [1]: 1) the number of
                                                                     Simple 1-best phoneme chain output has been used. Then,
language-dependent phonetic networks has been limited to
                                                                     search is carried out with a sliding window of 5 seconds (2.5
three; 2) DTW-based sub-systems exploiting log-posterior
                                                                     seconds time shift) using an equally-likely 1-gram language
features have been incorporated; and 3) a recently proposed
                                                                     model formed by the target query and a competing speech
method for discriminative calibration and fusion of hetero-
                                                                     background model. On the one hand, keyword/query mod-
geneous spoken term detection systems [4] has been applied.
                                                                     els are described by the sequence of phonetic units obtained
                                                                     in the tokenization. On the other hand, the likelihood of
2.    THE L2 F SWS SYSTEM DESCRIPTION                                a background speech unit representing “general speech” is
Six sub-systems form the core of the L2 F SWS system ex-             estimated based on the other phonetic classes [3]. The out-
ploiting three different language-dependent phonetic net-            put score for each candidate detection is computed as the
works trained for European Portuguese (pt), Brazilian Por-           average of the phonetic log-likelihood ratios that form the
tuguese (br ) and European Spanish (es). The phonetic net-           detected query term. More details can be found in [1].
works are used either as acoustic models in acoustic KWS
                                                                     2.3   Dynamic Time Warping systems
Copyright is held by the author/owner(s).                            DTW sub-systems use the language-dependent phonetic net-
Mediaeval 2013 Workshop, October 18-19 2013, Barcelona, Spain        works to extract log-posterior features. The silence class
of the phonetic network is also used for voice activity de-
tection. To this end, the segments identified as silence at           Table 1: L2 F SWS2013 performance scores
the beginning and end of each query and document are re-                                 dev          eval
                                                                      System
moved. For each query-document pair, N euclidean distance                           mtwv     atwv mtwv     atwv
based DTWs are run on N starting candidate positions of               primary      0.3905 0.3883 0.3420 0.3376
the document. To select the candidate positions, the query-           contrastive1 0.3205 0.3071 0.2515 0.2364
document euclidean distance matrix of the DTW is used.                contrastive2 0.2753 0.2743 0.2463 0.2459
The minimum of each column of the matrix represents the               contrastive3 0.4865 0.4850 0.4658 0.4639
minimum distance among all query feature vectors to a given
document feature vector. The average of these minima on          Table 1 shows the actual and maximum TWV official scores
a sliding window of query size is used as an approximation       obtained by the L2 F SWS systems for the two query sets:
of DTW without the warping constraints, from which the           dev and eval. Notice that the theoretical Bayes threshold
best N candidates are selected. The number of candidates         has been used in both dev and eval experiments. It is
N was made equal to the length of the document in fea-           worth noticing the remarkable performance improvements
ture vectors divided by 100 with a minimum of 100 candi-         when very heterogeneous (from different sites) systems are
dates. In a second stage, DTWs of the size of the query          combined, like in the case of the contrastive3 system. Re-
are evaluated at each one of the N candidate positions, and      garding the amount of processing resources, we have used a
the three candidates with lower normalized cumulative dis-       cluster of machines with 90 nodes. The estimated cost fig-
tance, and separated by at least 0.5 seconds, are kept. The      ures [7] are pessimistic since the cluster was not exclusively
reduction of the search space to N candidates as explained       used for the challenge. For each AKWS sub-system, the
above provided a reduction of the search time by a factor        indexing speed factor (ISF), searching speed factor (SSF),
of around 5, while having a minimal impact on the perfor-        maximum memory indexing (MMI) and maximum mem-
mance. It should be noted that the DTW, including the            ory searching (MMS) values are 0.75, 77.33, 0.17 GBytes
distance matrix, was computed using the R programming            and 0.073 GBytes, respectively. For the DTW based sub-
language, while the candidate selection and remaining tasks      systems, the ISF, SSF, MMI and MMS are 0.17, 193.34, 0.18
were implemented in Python1 . This framework benefited           GBytes and 0.43 GBytes, respectively. Considering these
particularly from the candidate selection scheme proposed.       values, the total processing load (PL) is 239.76: 3 times the
                                                                 PL of AKWS (5.09) and DTW (74.83) sub-systems.
2.4      Discriminative calibration and fusion                   4.   ACKNOWLEDGEMENTS
The combination of systems is based on a recently pro-
                                                                 This work was partially funded by the DIRHA European
posed method for discriminative calibration/fusion of het-
                                                                 project (FP7-ICT-2011-7-288121) and the Portuguese Foun-
erogeneous spoken term detection (STD) systems [4]2 . Un-
                                                                 dation for Science and Technology (FCT), through the projects
der this approach, missing scores for systems that do not de-
                                                                 PEst-OE/EEI/LA0021/2013 and PTDC/EIA-CCO/122542/
tect a given candidate are hypothesized based on heuristics.
                                                                 2010, and the grant number SFRH/BPD/68428/2010.
In this way, the original problem of several unaligned detec-
tion candidates is converted into a verification task. As for
other verification tasks, system weights and offsets are then
                                                                 5.   REFERENCES
                                                                 [1] A. Abad and R. F. Astudillo. The L2F Spoken Web
estimated through linear logistic regression. As a result,
                                                                     Search system for Mediaeval 2012. In MediaEval 2012
the combined scores are well calibrated, and the detection
                                                                     Workshop, Pisa, Italy, October 4-5 2012.
threshold is automatically given by application parameters
                                                                 [2] A. Abad, J. Luque, and I. Trancoso. Parallel
(priors and costs). The method permits easy integration
                                                                     Transformation Network features for Speaker
with majority voting schemes and it is convenient if scores
                                                                     Recognition. In ICASSP, May 2011.
from heterogeneous systems are in the same ranges (we ap-
ply a per-query zero-mean and unit-variance normalization        [3] A. Abad, A. Pompili, A. Costa, and I. Trancoso.
q-norm [1]). Moreover, the maximum number of detection               Automatic word naming recognition for treatment and
candidates for a certain query provided by any sub-system            assessment of aphasia. In Interspeech 2012, Sep 2012.
was limited to 200 before score normalization and fusion.        [4] A. Abad, L. J. Rodriguez Fuentes, M. Penagarikano,
                                                                     A. Varona, M. Diez, and G. Bordel. On the Calibration
                                                                     and Fusion of Heterogeneous Spoken Term Detection
3.     SUBMITTED SYSTEMS AND RESULTS                                 Systems. In Interspeech 2013, August 25-29 2013.
One primary and two contrastive “on-time” systems were
                                                                 [5] X. Anguera, F. Metze, A. Buso, I. Szoke, and L. J.
submitted. The primary system consists of the fusion of the
                                                                     Rodriguez-Fuentes. The Spoken Web Search Task. In
six sub-systems previously described, while the contrastive1
                                                                     MediaEval 2013 Workshop, October 18-19 2013.
and contrastive2 submissions correspond to the fusion of
                                                                 [6] N. Morgan and H. Bourlad. An introduction to hybrid
only the DTW and only the AKWS sub-systems, respec-
                                                                     HMM/connectionist continuous speech recognition.
tively. Additionally, a “late” contrastive3 system based on
                                                                     IEEE Signal Processing Magazine, 12(3):25–42, 1995.
the fusion of the primary systems of the L2 F and GTTS[8]
teams was also submitted. All the submitted systems are ex-      [7] L. Rodriguez-Fuentes and M. Penagarikano. MediaEval
pected to generate well-calibrated log-likelihood ratios, such       2013 Spoken Web Search Task: System Performance
that the theoretical minimum expected cost Bayes thresh-             Measures. Technical report, 2013.
old can be used (θBayes = log β, see [4] for more details).      [8] L. J. Rodriguez-Fuentes, A. Varona, M. Penagarikano,
                                                                     G. Bordel, and M. Diez. GTTS Systems for the SWS
1
    https://www.l2f.inesc-id.pt/wiki/index.php/DTW                   Task at MediaEval 2013. In MediaEval 2013 Workshop,
2
    https://www.l2f.inesc-id.pt/wiki/index.php/STDfusion             Barcelona, Spain, October 18-19 2013.