=Paper= {{Paper |id=None |storemode=property |title=Irisa MediaEval 2011 Spoken Web Search System |pdfUrl=https://ceur-ws.org/Vol-807/Muscariello_IRISA_SWS_me11wn.pdf |volume=Vol-807 |dblpUrl=https://dblp.org/rec/conf/mediaeval/MuscarielloG11 }} ==Irisa MediaEval 2011 Spoken Web Search System== https://ceur-ws.org/Vol-807/Muscariello_IRISA_SWS_me11wn.pdf
           Irisa MediaEval 2011 Spoken Web Search System

                     Armando Muscariello                                                Guillaume Gravier
                             Irisa/Inria                                                        Irisa/CNRS
                           Rennes, France                                                     Rennes, France
                        amuscari@irisa.fr                                               ggravier@irisa.fr



ABSTRACT                                                                                               DISTANCE MATRIX (QUERY VS UTTERANCE)


These working notes describe the main aspects of IRISA                        10
                                                                              20                                                             DTW BEST PATH
submission for the Spoken Web Search at the MediaEval                         30




                                                                      QUERY
                                                                              40
2011 campaign. We test a language-independent audio-only                      50

system based on a combination of template matching tech-                      60
                                                                              70
niques. A brief overview of the main components of the                                   50             100            150                 200               250
architecture is followed by reporting on the evaluation on                               SSM (query)
                                                                                                                   UTTERANCE
                                                                                                                                    SSM (matching subsegment in utterance)
the development and test data provided by the organizers.
                                                                              20                                            20


Categories and Subject Descriptors                                            40

                                                                              60
                                                                                                                        40
                                                                                                               SSM COMPARISON

                                                                                                                            60
H.3.3 [Information Systems Applications]: Spoken Term
                                                                              80                                            80
Detection—zero-resource speech processing, template match-
ing, posteriorgrams                                                                20    40       60      80                          20         40        60        80




1.    MOTIVATION                                                Figure 1: Example of combined use of DTW and
In [1] we have recently proposed a zero-resource audio-only
                                                                SSM-based comparisons for similarity scoring of
system for spoken term detection (STD), i.e a system for
                                                                templates.
performing keyword spotting at the acoustic level, in the ab-
sence of any language or domain-specific knowledge, training
speech data and models. Main motivation behind our parte-       phoneme recognizer [3], independently trained on (Czech,
cipation at the campaign is the opportunity of benchmarking     Hungary, Russian) 8 KHz telephonic data.
the system on a diﬀerent, more challenging data set [2], and
learn of alternative solutions and respective performance.      We have used the Euclidean distance to computed the pair-
                                                                wise distance between feature frames, and −log(p � q) as a
2.    ARCHITECTURE OF THE SYSTEM                                distance-like measure of closeness between two posterior vec-
The STD computational system relies on two main compo-          tors p and q.
nents: the acoustic features that represent queries and ut-
terances, and the pattern matching techniques that identify     2.2           Pattern matching combination
occurrences of the queries within the utterances and provide
                                                                The search for an occurrence of the query within the ut-
the respective measure of (dis)-similarity.
                                                                terance is performed directly on the feature sequences by a
                                                                cascade of two diﬀerent pattern matching techniques. A seg-
2.1    Acoustic features                                        mental variant of DTW, named segmental locally-normalized
We have experimented diﬀerent type of speech parametriza-       dynamic time warping (SLNDTW) is responsible of select-
tions, namely MFCC features and several type of posterior-      ing the subsegment of the utterance most similar to the
grams, that is 1) posteriors estimated from a Gaussian mix-     searched query, according to a DTW score DDTW . This
ture model (GMM) trained in an unsupervised fashion on          score can directly be used to decide upon the similarity of
the same development data provided [2], and posteriors out-     the two segments, or refined by the use of additional scores.
put by a language-specific (Czech, Hungary, Russian) BUT        In our system, the two candidate keyword occurences are
                                                                further subjected to the comparison of the respective self-
                                                                                                                         �
                                                                similarity matrices (SSMs), and the two SSM scores, DSSM
                                                                        ��
                                                                and DSSM , resulting from such comparison are then com-
                                                                bined with DDTW to obtain a unique dissimilarity score S
                                                                (see figure 1).

                                                                The global score S is computed as:
Copyright is held by the author/owner(s).
MediaEval 2010 Workshop, September 1-2, 2011, Pisa, Italy                                DDTW          D�                ��
                                                                                                                       DSSM
                                                                   S = αDTW ·                     �
                                                                                               + αSSM · SSM   + α ��
                                                                                                                 SSM ·                                                       (1)
                                                                                         thDTW         th�SSM          th��SSM
Table 1: Development queries on development ut-                        Table 2: Evaluation runs: primary system
terances: results
                                                                   DTW+SSM           DEV-         EVAL-EVAL       EVAL-
                                                                                     EVAL                         DEV
DTW+SSM           MFCC      GMM      HU          CZ     RU         P(FA)             0.0003         0.00007       0.00006
P(FA)             0         0.0003   0.02        0.02   0.016      P(Mis)            0.999           0.831        0.962
P(Mis)            0.82        0.77   0.66       0.66    0.70       AWTV              -0.29            0.10        -0.022
AWTV              0.18       -0.10   -19.9      -20.7   -15.9
MAP (%)           0.26        6.61   1.10       0.82    0.72

                                                                       Table 3: Evaluation runs: secondary system

so that S < 1 implies the detection of a match.                    DTW+SSM           DEV-         EVAL-EVAL       EVAL-
                                                                                     EVAL                         DEV
                                                                   P(FA)             0.00019        0.00013       0.00017
3.   SYSTEM TUNING                                                 P(Mis)            0.97            0.788        0.97
The data set described in [2] is particulary challenging for       AWTV              -0.17           -0.10        -0.14
such a system, because it is 8 KHz telephonic quality, presents
portion of silences in the queries and a large pronunciation
variability due to non native English speakers. We have
preliminarily removed silences from the queries thanks to         It is worth noting that searching for the evaluation queries
a speech detector, both for the development end evalua-           on the evaluation utterances perform better than conducting
tion queries. The thresholds thDTW , th�SSM , th��SSM have been   a cross-dataset spoken term detection, which is likely due to
tuned on word samples from a diﬀerent data set (see [1])          the limited variability among patterns extracted from the
and the pattern matching weights have been set to αDTW =          same set.
       �               ��
0.50, αSSM   = 0.20, αSSM  = 0.30 following [1]. Despite the
availability of the ground truth for the development data set,
reliable tuning of the thresholds on this data has not been
                                                                  5.   CONCLUSION
                                                                  The IRISA architecture for spoken term detection, presented
successful, as many true hits exhibit a dissimilarity score
                                                                  in [1], was evaluated on the data set provided by the Me-
higher than false alarms. This highlights the poor discrimi-
                                                                  diaEval 2011 Spoken Web Search. This dataset has proven
native properties of the employed features in this task. The
                                                                  extremely challenging for the system in its current form,
results for the diﬀerent features are shown on table 1, for the
                                                                  yielding poor results for all type of acoustic features em-
system jointly employing the DTW and SSM-based compar-
                                                                  ployed. For this particular data set, given the presence of
isons, and the metrics: P(FA), that is the average false alarm
                                                                  many English keywords, training a phone recognizer based
rate, P(Mis), the average false rejection rate, the average
                                                                  on English phone models would have likely improved perfor-
weighted term value AWTV (the primary performance indi-
                                                                  mance, although our team did not dispose of such training
cator), and the mean average precision MAP. The posterior
                                                                  data (indeed one of the reasons why pursuing research on
features estimated by the BUT recognizer are the least per-
                                                                  zero-resource systems would benefit the community). One
forming according to the AWTV, as their P(FA), weighted
                                                                  possible idea is to combine posteriors from diﬀerent recog-
by a factor β = 1000, is greater by order of magnitudes
                                                                  nizers to increase robustness to multiple languages, although
than the P(FA) for the MFCC and GMM features. Gaus-
                                                                  in this specific case the results for Hungarian, Czech and
sian posteriorgrams yield the highest MAP value among the
                                                                  Russian-based posteriorgrams were bad enough to prevent
features tested, although very disappointing if compared to
                                                                  any satisfying application of this solution. Also, the Gaus-
the values reported by this same system and features in the
                                                                  sian posteriors were only estimated from models trained on
evaluation conducted in [1]. While yielding the highest miss
                                                                  the development utterances; performance could have been,
detection rate P(Mis), the raw MFCC features report the
                                                                  at least slightly, improved by training the GMM on the com-
best AWTV, as no false alarm has been collected. Accord-
                                                                  bined development-evaluation data set, in particular for the
ing to this metric, the MFCC-based system has been selected
                                                                  cross-data detection that yielded the poorest results.
as the primary one.

4.   RESULTS ON EVALUATION DATA                                   6.   REFERENCES
                                                                  [1] A. Muscariello, G. Gravier, and F. Bimbot.
The results of the evaluation of the system on the test data
                                                                      Zero-resource audio-only spoken term detection based
are summarized by table 2, as for the primary runs and
                                                                      on a combination of template matching techniques. In
table 3, as for the secondary runs, where Gaussian posteri-
                                                                      Interspeech, 2011.
orgrams have been used. Not suprisingly, the figures reflect
                                                                  [2] N. Rajput and F. Metze. Spoken web search. In
substantially the poor results of the experiments on the de-
                                                                      MediaEval Workshop, 2011.
velopment data set. The system operates in a completely
unsupervised fashion and the knowledge of the performance         [3] P. Schwartz, P. M. P., and J. Černocký. Towards lower
on the development data are not exploited in any way, and             error rates in phoneme recognition. In International
therefore do not bear any impact on the result. Indeed, the           Conference on Text, Speech and Dialogue, 2004., 2004.
only parameteres needed to be tuned were estimated on a
diﬀerent data set.