=Paper= {{Paper |id=None |storemode=property |title=BUT-HCTLab approaches for Spoken Web Search - MediaEval 2011 |pdfUrl=https://ceur-ws.org/Vol-807/Szoeke_BUT_SWS_me11wn.pdf |volume=Vol-807 |dblpUrl=https://dblp.org/rec/conf/mediaeval/SzokeTFC11 }} ==BUT-HCTLab approaches for Spoken Web Search - MediaEval 2011== https://ceur-ws.org/Vol-807/Szoeke_BUT_SWS_me11wn.pdf
BUT-HCTLab APPROACHES FOR SPOKEN WEB SEARCH -
                MEDIAEVAL 2011

                    Igor Szöke                          Javier Tejedor                             Michal Fapšo
         Speech@FIT, Brno University                 HCTLab, Universidad                 Speech@FIT, Brno University
         of Technology, Czech Republic            Autónoma de Madrid, Spain              of Technology, Czech Republic
               szoke@fit.vutbr.cz                 javier.tejedor@uam.es                      ifapso@fit.vutbr.cz
                                                            José Colás
                                                     HCTLab, Universidad
                                                  Autónoma de Madrid, Spain
                                                    jose.colas@uam.es

ABSTRACT                                                           which employs 5 context net outputs. Relevant parameters
We present the three approaches submitted to the Spoken            are in Table 1 and more details are in [5].
Web Search. Two of them rely on Acoustic Keyword Spot-             Table 1: Language parameter specifications: Number of phones, |U C|
ting (AKWS) while the other relies on Dynamic Time Warp-           which represents the size of the hidden layer in the universal context
ing. Features are 3-state phone posterior. Results suggest         NN, |M er| which represents the size of the hidden layers in the merger
that applying a Karhunen-Loeve transform to the log-phone          with 3-state phone posteriors output, and |out| which represents the size
posteriors representing the query to build a GMM/HMM               of the 3-state phone posteriors output layer.
                                                                                           |U C|    |M er|  |Out|
for each query and a subsequent AKWS system performs                 Language      Phones
                                                                    Czech           37      1373    495      114
the best.                                                           English         44      1298    488      135
                                                                    Hungarian       64      1128    470      193
                                                                    Levantine       32      1432    500      99
Categories and Subject Descriptors                                  Polish
                                                                    Russian
                                                                                    33
                                                                                    49
                                                                                            1408
                                                                                            1228
                                                                                                    498
                                                                                                    481
                                                                                                             105
                                                                                                             157
                                                                    Slovak          41      1305    489      133
H.3.3 [Information systems]: Information Storage and
Retrieval, Information Search and Retrieval, Search process
                                                                   3.    APPROACHES
General Terms                                                      3.1     Parallel Acoustic Keyword Spotting
Experimentation                                                            (PAKWS)
                                                                      We combined decisions from 6-language dependent Acous-
Keywords                                                           tic Keyword Spotters (AKWS) (from all the languages in Ta-
query-by-example spoken term detection, acoustic keyword           ble 1 except the Polish one due to its worse performance on
spotting, dynamic time warping, spoken web search                  dev data). One AKWS consists of two steps: Query recogni-
                                                                   tion done by a phone recognizer and query detection done by
                                                                   AKWS. They only differ in the decoder. Features (3-state
1. MOTIVATION                                                      phone posteriors) extracted from the audio are fed into a
   The Spoken Web Search (SWS) task aims at building a             phone decoder – unrestricted phone loop without any phone
language-independent query-by-example spoken term detec-           insertion penalty. AKWS filler model-based recognition net-
tion system without any knowledge of the target language           works [4] are built according to the detected phone string per
and query transcriptions. In so doing, our approaches are          each query. The filler/background models are represented by
based on the combination of as many language-dependent             a phone loop. Each phone model is represented by a 3-state
“recognizers” as possible. 1                                       HMM tied to 3-state phone posteriors. The output of the
                                                                   AKWS is a set of putative hits. The score is logarithm of
2. FEATURE EXTRACTION                                              likelihood ratio normalized by the length of the detection.
                                                                   These detections are converted into a matrix for each ut-
  Our feature extractor outputs 3-state phone posteriors as
                                                                   terance. The size of this matrix is #queries × #f rames.
features [2]. The phone posterior estimator [5] contains a
                                                                   Next, matrices for all 6 languages are “log added”. Finally,
Neural Network (NN) classifier with a hierarchical structure
                                                                   the combined matrix is converted back to the list of detec-
called bottle-neck universal context network. It consists of
                                                                   tions and the detection for which all 6 detectors agree has a
a context network, trained as a 5-layer NN, and a merger
                                                                   higher score.
1
  Part of this work was done will JT was a visitor research
at Speech@FIT, BUT.                                                3.2     GMM/HMM term modeling
                                                                      Inspired in the previous approach, this relies on a sin-
                                                                   gle AKWS as query detection with these differences: (1)
Copyright is held by the author/owner(s).                          the background model is a GMM/HMM with 1-state mod-
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy          eled with 10-GMM components, (2) the query model is rep-
resented with a GMM/HMM whose number of states is 3
times the number of phones according to the phone recogni-                  Results for the required runs [1] are given in Table 2. The
tion with 1 GMM component each and (3) all the languages                 PAKWS approach has two versions (with and without score
in Table 1 have been employed to produce the final fea-                  calibration). We clearly see that the GMM/HMM term
ture super-vector. Queries represented by a single phone                 modeling approach outperforms the two other in a great
have been modeled with 6 states, as if the query contained               extent for unseen queries/data even with score calibration
2 phones. We used the number of phones output by the Slo-                applied on the PAKWS approach. We consider this is due
vak recognizer due to its best performance in terms of the               to: (1) The KLT statistics, computed from dev data and
Upper-bound Term Weighted Value metric (UBTWV) [3].                      applied on queries and data, plays the role of “adaptation”
Features used for background and query modeling were got                 towards the target domain, which differs from that used to
as follows: (1) the log-phone posteriors got from the feature            train the phone estimators from which the 3-state phone
extractor are applied a Karhunen-Loeve transform (KLT)                   posteriors are computed, (2) the use of a single example to
for each invididual language, (2) we keep the features that              train the query model is more robust against uncertainties
explain up to 95% of the variance after KLT for each individ-            than the set of features itself (used in the DTW approach)
ual language, (3) we build a 152-dimensional feature super-              and the phone transcription got from phone decoding (used
vector, from them. The KLT statistics have been computed                 in the PAKWS approach), (3) the prior combination of the
from the dev data and next applied over both the dev/eval                most relevant features after KLT given to the GMM/HMM
queries and dev/eval data.                                               approach, opposite to the PAKWS approach, based on a
                                                                         posterior combination from the detections got from each in-
3.3 Dynamic Time Warping (DTW)                                           dividual AKWS system and (4) by comparing the MTWV
   A similarity matrix from phonetic posteriorgrams [5] stores           (pooled) and UBTWV (non pooled) for PAKWS Qdev-Ddev
the similarity between each query frame and each utterance               0.133 and 0.253, Qeval-Ddev 0.002 and 0.056, Qdev-Deval
frame with the cosine distance as similarity function and a              0.030 and 0.157 and Qeval-Deval 0.033 and 0.223 respec-
DTW search hypothesises putative hits. DTW is run iter-                  tively, suggests that the PAKWS system is the most sensi-
atively starting in every frame in the utterance and ending              tive to data mismatch. The GMM/HMM-based term mod-
in a frame on the utterance [5]. The features that represent             eling approach is less sensitive to data mismatch with the
the phonetic posteriorgrams are the concatenation of the 3-              following values: Qdev-Ddev 0.103 and 0.238, Qeval-Ddev
state phone posteriors corresponding to every language in                0.019 and 0.035, Qdev-Deval 0.010 and 0.179 and Qeval-
Table 1.                                                                 Deval 0.131 and 0.267 respectively. For the DTW similar
                                                                         pattern as that of GMM/HMM is observed: Qdev-Ddev
4. FILTERING AND CALIBRATION                                             0.020 and 0.106, Qeval-Ddev 0 and 0.011, Qdev-Deval 0 and
                                                                         0.099 and Qeval-Deval 0.014 and 0.055.
  To deal with the score calibration and some problem-
atic query length under certain approaches issues, detections            Table 2:        ATWV results for the approaches. “PAKWS-cal” denotes the PAKWS ap-

were post-processed in the following steps: 1]“Filtering” de-            proach with score calibration and “PAKWS-nocal” denotes the PAKWS approach without score

tections according to length difference from “average length”.           calibration. “Qx-Dy” denotes the set of “x” queries searched on the set of “y” data.
                                                                            Approach            Qdev-Ddev          Qeval-Ddev       Qdev-Deval         Qeval-Deval
Average length of a query is calculated as the average length               PAKWS-cal               0.133            −0.221            −1.141            −0.307
                                                                            PAKWS-nocal             0.093            −0.359             0.009            −0.110
of speech (phones) across the 6 phone recognizers used in                   GMM/HMM                 0.103             0.008             0.024              0.101
the PAKWS approach. It was applied on all the approaches                    DTW                     0.020            −0.005            −0.032            −0.115

except the DTW as follows:
                               Q
                                                                         6.      CONCLUSIONS
                  
                             L     −L(det)         Q
                               min
                  
                  
                  
                  
                    Sc(det) −
                                  L
                                    Q
                                            ,   L
                                                    min
                                                        > L(det)           Our GMM/HMM-based term modeling approach achieves
                                    min
      ScF (det) =
                  
                             L(det)−L
                                       Q
                                       max ,     Q
                                                                   (1)   the best performance, whereas the two other, PAKWS and
                   Sc(det) −
                  
                                   Q
                                                Lmax > L(det)
                  
                  
                    Sc(det),
                                  Lmax
                              otherwise
                                                                         DTW, fail due to the unreliable phone transcription derived
                                                                         in the former and the “meaningless” phone posteriors by
where Q identifies the query to which the detection be-                  themselves used in the latter when facing to the language-
longs, Sc(det) is the original score, ScF (det) is the “fil-             independency issue. Future work will investigate new fea-
tered” score, L(det) is the length of the detection in frames,           tures to enhance the performance of the best approach.
LQ             Q
  min = 0.8Laver is 80% of the average query length and
LQ                                                                       7.      REFERENCES
              Q
  max =  1.4L aver is 140% of the average query length. The
detection score remains the same if the detection length is              [1] N. Rajput and F. Metze. Spoken web search. In
longer than 80% and shorter than 140% of the average query                   MediaEval 2011 Workshop, Pisa, Italy, 2011.
length. Otherwise the score is lowered the shorter/longer the
                                                                         [2] P. Schwarz et al. Towards lower error rates in phoneme
detection is according to the original query.
                                                                             recognition. In Proc. of TSD, pages 465–472, 2004.
  2] Calibration, applied only in our PAKWS approach, pro-
duces the final score of each detection as follows:                      [3] I. Szöke. Hybrid word-subword spoken term detection.
                                                                             PhD thesis, 2010.
              ScC(det) = ScF (det) + A1 + A2 ∗ Occ(Q),             (2)
                                                                         [4] I. Szöke et al. Phoneme based acoustics keyword
where Occ(Q) is number of query detection occurrences in                     spotting in informal continuous speech. LNAI,
the data, and A1 = −1.0807 and A2 = −0.0001 are calibra-                     3658(2005):302–309, 2005.
tion parameters. These were estimated from best thresholds               [5] J. Tejedor et al. Novel methods for query selection and
(UBTWV) on dev data using linear regression.                                 query combination in query-by-example spoken term
                                                                             detection. In Proc. of SSCS, pages 15–20, Florence,
5. RESULTS AND DISCUSSION                                                    Italy, 2010.