Telefonica System for the Spoken Web Search Task at
                        Mediaeval 2011

                                                       Xavier Anguera
                                                     Telefonica Research
                                                 Torre Telefonica-Diagonal 00
                                                   08019 Barcelona, Spain
                                                      xanguera@tid.es

ABSTRACT                                                          that contain information about what has been said while
This working paper describes the system proposed by Tele-         they are speaker independent so that the system is able to
fonica research for the task of spoken voice search within the    recognize two instances of the same spoken word, even if
Mediaeval benchmarking evaluation campaign in 2011. The           they were spoken by different speakers. The query search
proposed system is based exclusively on a pattern match-          doe a search for every particular query over all acoustic ma-
ing approach, which is able to perform a query-by-example         terial to identify whether (and where) the query appears.
search with no prior knowledge of the acoustics or language       Transversal to both modules we applied a simple silence de-
being spoken. The system’s main contributions are the us-         tection algorithm to eliminate long silences in the queries
age of a novel method to obtain speaker independent acous-        and in the audio content. Next we will describe these three
tic features to later perform the matching through a DTW-         modules more in detail.
like matching algorithm. Obtained results are promising and       2.1    Silence Detection and Removal
show, in our opinion, the potential of such class of techniques
                                                                     Early on in our development we noticed that most queries
for this task.
                                                                  were spoken in isolation. This means that the spoken query
                                                                  is always accompanied with some silence at the beginning
Categories and Subject Descriptors                                and end. In addition, some phone call excerpts also con-
H.303 [Information Search and Retrieval]: Miscella-               tained non-wanted long amounts of silence frames. In order
neous                                                             to eliminate most silence regions without jeopardizing the
                                                                  non-silence ones we applied a simple energy-based thresh-
                                                                  olding algorithm, individually in every file, as follows: first,
General Terms                                                     we compute the average energy of the signal over windows of
Algorithms, Performance, Experimentation                          200ms, every 5ms. Then we search for the smallest energy
                                                                  value and the average of the top 1% highest energy values
Keywords                                                          (we do not choose a single value in order to mellow down
                                                                  the effect of outliers) . Next we compute a threshold at the
Pattern matching, query-by-example, spoken query, search          5% of the resulting dynamic range, above the minimum en-
                                                                  ergy value. Finally we apply such threshold to every 5ms of
1.   INTRODUCTION                                                 the input signal to differentiate between speech and silence.
   The objective of the spoken web search task is to search for   To avoid fast changes between speech/silence we apply a
some given audio query within a set of given audio content,       top-hat algorithm with a window of 100ms to the binary
for a detailed explanation refer to [4]. The audio content        output of the previous step to ensure that no silence/speech
in this particular evaluation contains phone call excerpts        segments are output with less than 100ms length.
recorded in 4 different languages within the World Wide           2.2    Acoustic Features Extraction
Telecom Web project [3] conducted by IBM. The system
                                                                     Most of our effort in this year’s evaluation went to design
we propose to tackle this task is based on audio pattern-
                                                                  a good acoustic feature extraction module. Our goal was
matching between the query and the audio content to re-
                                                                  to extract from the audio signal some features that retained
trieve putative matches. No information at all is used re-
                                                                  all acoustic information about what was said while being
garding the language that the queries are spoken in or the
                                                                  speaker and background independent. As a side objective,
content (i.e. the transcription).
                                                                  we also wanted to be as much independent as possible to
                                                                  outside training data.
2.   SYSTEM DESCRIPTION                                              We focused the design of our feature extractor in previous
  The proposed system can be split into two main blocks:          work that started with [1] on using phone posterior proba-
the acoustic feature extraction and the query search. For         bilities as features, which was then extended by [5] to apply
the acoustic feature extraction the goal is to obtain features    it to the automatic word discovery task. Similarly to [5],
                                                                  for our main submission we construct a Gaussian Mixture
                                                                  Model and store the Gaussian posterior probabilities (nor-
Copyright is held by the author/owner(s).                         malized to sum 1) as our features. In our case we decided to
MediaEval 2010 Workshop, September 1-2, 2011, Pisa, Italy         only use the development data available for the SWS task,
therefore no external data was used on the training of this
model. In addition, once the GMM has been trained with            Table 1: Term Weighted Max Value for the submit-
the EMML algorithm we perform a hard assignment of each           ted systems
frame to their most likely Gaussian and retrain the Gaus-            Dataset-termlist Posteriorgrams binary features
sian’s mean and variance to optimally model these frames.                dev-dev          0.156           0.205
This last step tries to solve the problem most EMML sys-                 dev-eval         0.019           0.022
tems have, which is focusing on optimizing the Gaussians                 eval-dev         0.000           0.000
parameters to maximize the overall likelihood of the model               eval-eval        0.173           0.222
on the input data, but not to discriminate between the dif-
ferent sounds in it. By performing the last assignment and
retraining step we push Gaussians apart from each other to        stead of the actual value as we did not place much emphasis
better model individual groups of frames depending on their       on in the development stage at finding an optimum thresh-
location and density.                                             old for our system. Still, we observed that for any given set
   Alternatively, we also submitted a contrastive system that     threshold the results remain similar both in dev-dev and in
consists on the binarization of the posterior probabilities for   eval-eval.
each frame to binary form. This is inspired by our recent            In general, we find results for dev-dev and eval-eval to be
developments in speaker verification [2] where we show that       very acceptable. On the other hand we were surprised to see
we can effectively build binary models to identify between        that our system does not work nearly as well for the cross
speakers. Such representations are much smaller for storage       conditions. We have observed that channel missmatch might
purposes and can be processed much faster as binary dis-          have played a major role in these results, as we observed in
tances are usually very fast. In this case, for every posterior   several cases that development files contain many recordings
probabilities vector we turned to 1 the 20%-best probabil-        with a much poorer signal quality than those from evaluation
ities, and to 0 the rest. The chosen distance between two         files. We consider we have achieved a reasonable speaker
binary vectors x and y was defined as                             independence with our features but we are still to apply
                                                                  ways to compensate for differences in the channel.
                              PN                                     Comparing the two submissions we observe that the bi-
                                      (x[i]∧y[i])
                  Sd (x, y) = Pi=1
                               N (x[i]∨y[i])               (1)    nary features are always outperforming the standard pos-
                                i=1
                                                                  teriorgrams. In our point of view this is a very interesting
where ∧ indicates the boolean AND operator and ∨ indicates        finding that can be used in the near future to speedup the
the boolean OR operator.                                          spoken word search and automatic pattern discovery sys-
                                                                  tems, which together with the proposed novel way to com-
2.3    Query search Algorithm                                     pute the GMM model can achieve fast and quite accurate
   Given two sequences, X and Y of posterior probabilities,       results.
respectively obtained from the query and any given phone
recording, we compare them using a DTW-like algorithm.
The standard DTW algorithm returns the optimum align-
                                                                  4.   REFERENCES
ment between any two sequences by finding the optimum              [1] G. Aradilla, J. Vepa, and H. Bourlard. Using
path between their start (0, 0) and end (xend , yend ) points.         posterior-based features in template matching for
In our case we constraint the query signal to match between            speech recognition. In Proc. ICSLP, 2006.
start and end, but we allow the phone recording to start           [2] J.-F. Bonastre, X. Anguera, G. H. Sierra, and P.-M.
its alignment at any position (0, y) and finish its alignment          Bousquet. Speaker modeling using local binary
in whenever the dynamic programming algorithm reaches                  decisions. In Proc. Interspeech, 2011.
x = xend . Although we do not set any global constraints,          [3] A. Kumar, N. Rajput, D. Chakraborty, S. K. Agarwal,
the local constraints are set so that at maximum 2-times or            and A. A. Nanavati. Wwtw: The world wide telecom
1                                                                      web. In Proc. NSDR 2007 (SIGCOMM workshop),
2
  -times warping is allowed by choosing the path that mini-
mizes the cost to reach position (i, j) as                             Kyoto, Japan, August 2007.
                                                                   [4] N. Rajput and F. Metze. Spoken websearch. In
                                                                      MediaEval 2011 Workshop, Pisa, Italy, September 1-2
                           D(i − 2, j − 1))/(#(i − 2, j − 1) + 3)
cost(i, j) = (d(i, j)+min   D(i − 2, j − 2))/(#(i − 2, j − 2) + 4)     2011.
                           D(i − 1, j − 2))/(#(i − 1, j − 2) + 3) [5] Y. Zhang and J. Glass. Unsupervised spoken keyword
                                                             (2)       spotting via segmental dtw on gaussian posteriorgrams.
Where D(i, j) is the accumulated (non-normalized) distance             In Proc. ASRU, pages 398–403, Merano, Italy,
of all optimum paths until position (i, j), d(i, j) is the local       December 2009.
distance between frames xi and yj from both compared se-
quences, and #(i, j) is the number of jumps of the optimum
path until that point. Note than when normalizing the dif-
ferent possible paths we slightly favor the diagonal match.

3.    RESULTS AND DISCUSSION
  Table 1 shows the official results we obtained with our
systems, for the primary (posteriorgrams features) and con-
trastive (binarized features) submissions. In all cases we
report the Term Weighted Maximum Value (TWMV) in-