=Paper= {{Paper |id=None |storemode=property |title=The CMTECH Spoken Web Search System for MediaEval 2013 |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_69.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/GraciaAB13 }} ==The CMTECH Spoken Web Search System for MediaEval 2013== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_69.pdf
     The CMTECH Spoken Web Search System for MediaEval
                         2013

                 Ciro Gracia                          Xavier Anguera                       Xavier Binefa
           University Pompeu fabra                   Telefonica Research              University Pompeu fabra
              Barcelona, Spain                        Barcelona, Spain                   Barcelona, Spain
           ciro.gracia@upf.edu                      xanguera@tid.es                 xavier.binefa@upf.edu


ABSTRACT                                                           features. In order to obtain meaninfull acoustic models with
We present a system for query by example on zero-resources         unsupervised data we introduce linguistic prior information
languages. The system compares speech patterns by fusing           to the unsupervised training by using an specific pre-trained
the contributions of two acoustic models to cover both their       model as initialization. In addition, instead of use standard
spectral characteristics and their temporal evolution. The         dot product to compare normalized posteriorgram vectors,
spectral model uses standard Gaussian mixtures to model            we extend this approach by incorporating to the comparison
classical MFCC features. We introduce phonetic priors in           a specially crafted matrix defining an inter-cluster similarity.
order to bias the unsupervised training of the model. In ad-          Previous approaches [8] to Mediaeval data have shown
dition, we extend the standard similarity metric used com-         that using different acoustic models to fuse different sources
paring vector posteriors by incorporating inter cluster dis-       of knowledge provides a significant improvement on evalu-
tances. To model temporal evolution patterns we use long           ation. Despite of that, it is important to determine which
temporal context models. We combine the information ob-            types of information can complement each other in order
tained by both models when computing the similarity matrix         to guarantee a gain for the extra computational cost. Our
to allow subsequence-DTW algorithm to find optimal sub-            appoach to fusion is to combine temporal and spectral infor-
sequece alignment paths between query and reference data.          mation. As stated above, one of the models is focused into
Resulting alignment paths are locally filtered and globally        spectral configuration of the acoustic vectors while the com-
normalized. Our experiments on Mediaeval data shows that           plementary model is focused into model temporal evolution
this approach provides state of the art results and signifi-       of the feature dimensions.
cantly improves the single model and the standard metric              For sequences matching we use the subsequence-dynamic
baseline.                                                          time warping algorithm (s-DTW) [7]. With it we obtain the
                                                                   alignment paths and the scores of all the potential matches
                                                                   of the query inside the utterance. The major difficulty re-
1.   INTRODUCTION                                                  lies in how to decide which ones of the provided alignments
   The task of searching for speech queries within a speech        are acceptable as potential query instances and how to deal
corpus without a priori knowledge of the language or acous-        with intra-inter query results overlap. In our system we used
tic conditions of the data is gaining interest in the scientific   lowpass filtering to reduce the number of spurious detections
community. Within the Spoken Web Search task (SWS)                 and keept only the highest score of the intra query overlap-
in the Mediaeval evaluation campaign for 2013 [3] systems          ping paths. Inter-query overlap is complex and remains for
are given a set of acoustic queries that have to be searched       future work. Finally, We explore two different approaches to
for within a corpus of audio composed of several languages         global score normalization: the standard Z-norm approach
and different recording conditions. No information about           and score mapping based on continuous density function.
the transcription of the queries or speech corpus, nor the
language spoken is given.
   To tackle this task we propose a system using a zero re-        2.    THE CMTECH SYSTEM DESCRIPTION
sources approach by extending some ideas from the state of           The system is based on standard MFCC39 features com-
the art.                                                           puted by means of HTK at (25ms windows , 10 ms shift
   We adopt posteriorgram features[9, 5] in order to improve       time).
comparison between speech features. Posteriorgram features
are obtained from an acoustic model and allow to consistenly
compare acoustic vectors by removing factors of feature vari-      2.1     Spectral Acoustic Model
ance. The difficulty at this point relies into how to obtain         The first acoustic model based on a gaussian mixture
meaningful acoustic models in an unsupervised manner and           model (GMM). We originally trained this model using TIMIT
how to properly compare posterior features. The difficulty at      phonetic ground truth. We trained a 4 gaussians GMM for
this point rely into how to obtain meaningful acoustic mod-        each of the 39 Lee and Hon [6] phonetic classes and then
els unsupervisedly and how to properly compare posterior           combined all of them into a single GMM. This GMM is used
                                                                   as initialization for an unsupervised training of the final 156
                                                                   components GMM using SWS2013 utterances.
Copyright is held by the author/owner(s).                            Using this model we build an inter-cluster distance matrix
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain     D (156x156) using Kullback Leibler divergence:
                                                                        Table 1: System results: MTWV/ATWV
                  1       |Σi |                                           Normalization     Dev-Dev    Dev-Eval
      D(i, j) =     (log(       ) + tr(Σi Σj + Σj Σi − 2I)
                  2       |Σj |                                         CDF equalization 0.2685-0.2683 0.2623-0.2619
                       +(µi − µj )(Σi + Σj )(µi − µj )> )    (1)         Z-normalization 0.2642-0.2638 0.2575-0.2552
  When comparing posterior features ~
                                    x, ~
                                       y we use:
                                                                   effectively maps the scores distribution into a uniform dis-
                      ds (~
                          x, ~    xe−D ~
                             y) = ~    y>                    (2)
                                                                   tribution and their cdf as a linear function. Our second
  We found this extended comparison providing above 0.05           system (contrastive) replaces global Z-normalization by the
absolute MTWV points gain in mediaeval 2012 data.                  cdf equalization aproach.

2.2     Temporal Acoustic Model                                    3.   RESULTS
   The objective of this temporal model is to extend the con-         Table 1 shows the results obtained by our systems. We
text information and to effectively complement the frame           can see how CDF equalization system obtains slightly bet-
based acoustic model. The temporal model is based on long          ter results than the Z-normalization system. The Runtime
temporal context approach [1] trained on Mediaeval 2012            Factor is 0.0056 and the average memory usage is 11,5GB.
data. We process each of the MFCC39 dimensions indepen-            Many of the difficulties in the results come from a set of noisy
dently. We first segmented Mediaeval 2012 data using an un-        and reververant examples. We feel that denoising algorithms
supervised phonetic segmentation approach[4] and extraced          like spectral substraction would be useful to improve models
a 150 ms context from the center of each of the segments           training and performance on these samples.
forming a collection of R31 vector. Each context vector is
standarized to zero mean and unity variance, windowed us-
ing a Hanning window, decorrelated using discrete cosinus
                                                                   4.   CONCLUSIONS
transform and only the 15 first coefficients become the fi-           Our future work will be related to explore the relation-
nal R15 vector. The modeling is performed by hierarchical          ship between system performance and voice activity detec-
k-medioid together with a final covariance matrices estima-        tion. Face the inter query overlap problem its inherent open
tion. The resulting model is composed of a Gaussian Mix-           set classification problem. We are interested into distiguish
ture model of 128 components for each of the original 39           which are the key elements that garantee the suitability of
dimensions.                                                        an acoustic model for the task, Specially interesting is ex-
   The comparison between two input vector is done in each         plore rigid and elastic distribution matching methods like
band b indepently by means of its model posterior ~    xb , ~
                                                            yb ,   maximum likelihood linear transforms in order to be able to
and then we fuse them using the median operator:                   adapt pre-trained models to new data unsupervisedly.

                                       ~  yb>
                                       xb ~                        5.   REFERENCES
                        dt (~
                            x, ~
                               y , b) =
                                    k~xb k k~ yb k                 [1] P. Ace, P. Schwarz, and V. P. Ace. Phoneme
              dt (~x, ~
                      y ) = median(dt (~
                                       x, ~
                                          y , b));           (3)       recognition based on long temporal context.
                                                                   [2] T. Acharya and A. K. Ray. Image processing: principles
   Inside Mediaeval 2012 data, the incorporation of this acous-        and applications. Wiley. com, 2005.
tic model boosted our system MTWV results from 0.47 to
                                                                   [3] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
0.53 points.
                                                                       Rodriguez-Fuentes. The spoken web search task. In
2.3     Query Search                                                   MediaEval 2013 Workshop, Barcelona, Spain, October
                                                                       18-19 2013.
  For each pair of Query q and utterance u patterns we build
                                                                   [4] C. Gracia and X. Binefa. On hierarchical clustering for
a distance matrix M of size (|q|x|u|) using:
                                                                       speech phonetic segmentation. 2011.
                                                                   [5] T. J. Hazen, W. Shen, and C. White.
                  M (q, u) = −log(dt (q, u)ds (q, u))        (4)       Query-by-example spoken term detection using
                                                                       phonetic posteriorgram templates. In Automatic Speech
  We use S-DTW to obtain the score of alignment paths for
                                                                       Recognition & Understanding, 2009. ASRU 2009. IEEE
each possible ending position in u. In order to select rele-
                                                                       Workshop on, pages 421–426. IEEE, 2009.
vant local maxima scores, we first lowpass filter the results
by using a 25 frames gaussian window. Depite that the re-          [6] C. Lopes and F. Perdigão. Broad phonetic class
sulting selected alignment paths retain their original score           definition driven by phone confusions. EURASIP
values.                                                                Journal on Advances in Signal Processing,
                                                                       2012(1):1–12, 2012.
2.4     Global normalization                                       [7] M. Müller. Dynamic time warping. Information
   When all utterances have been processed for a given query,          Retrieval for Music and Motion, pages 69–84, 2007.
we perform a normalization step. The first system presented        [8] H. Wang and T. Lee. Cuhk system for the spoken web
(primary) uses a standard Z-normalization excluding the                search task at mediaeval 2012. In MediaEval, 2012.
first 500 results from the parametter estimation. Similarly        [9] Y. Zhang and J. R. Glass. Unsupervised spoken
to contrast enhacing performed by histogram equalization               keyword spotting via segmental dtw on gaussian
in image processing[2], our mapping approach replaces re-              posteriorgrams. In Automatic Speech Recognition &
sulting query scores with their corresponing value at the              Understanding, 2009. ASRU 2009. IEEE Workshop on,
query probability continuous density function (cdf). This              pages 398–403. IEEE, 2009.