GTM-UVigo Systems for the Query-by-Example Search on
               Speech Task at MediaEval 2015

                    Paula Lopez-Otero, Laura Docio-Fernandez, Carmen Garcia-Mateo
                                               AtlantTIC Research Center
                               E.E. Telecomunicación, Campus Universitario S/N, 36310 Vigo
                                           {plopez,ldocio,carmen}@gts.uvigo.es


ABSTRACT                                                                     with 2048 units, and it was trained on LDA-STC-
In this paper, we present the systems developed by GTM-                      fMLLR features obtained from auxiliary Gaussian mix-
UVigo team for the query by example search on speech task                    ture models (GMM) [15]. The dimension of the input
(QUESST) at MediaEval 2015. The systems consist in a                         layer was 440 and the output layer dimension was the
fusion of 11 dynamic time warping based systems that use                     number of context-dependent states.
phoneme posteriorgrams for speech representation; the pri-                • traps: the phone decoder based on long temporal con-
mary system introduces a technique to select the most rele-                 text developed at the Brno University of Technology
vant phonetic units on each phoneme decoder, leading to an                  (BUT) was used [11].
improvement of the search results.
                                                                         11 models, summarized in Table 1, were trained using data
                                                                       in 6 languages: Galician (GA), Spanish (ES), English (EN),
1. INTRODUCTION                                                        Czech (CZ), Hungarian (HU) and Russian (RU).
  The query by example search on speech task (QUESST)
aims at searching for audio content within audio content
using an audio content query [13], having special focus on             Table 1: Databases used to train the acoustic mod-
low-resource languages. This paper describes the systems               els. BUT models were used in the traps systems.
developed by GTM-UVigo team to address this task1 .
                                                                                     System              Database          Duration (h)
                                                                              GAdnn, GAlstm  Transcrigal [3]                      35
2. GTM-UVIGO SYSTEM DESCRIPTION                                                ESdnn, ESlstm  TC-STAR [2]                         78
   GTM-UVigo systems consist in the fusion of 11 individual                   ENdnn, ESlstm  LibriSpeech [8]                     100
systems that represent the documents and queries by means                     CZdnn, ENlstm Vystadial 2013 [6]                    15
of phoneme posteriorgrams, and then subsequence dynamic                       CZ/HU/RUtraps    Speech-Dat                        n/a
time warping (S-DTW) is used to perform the search. The
primary system features a phonetic unit selection strategy,
which is briefly described in this Section.
                                                                       2.2    Dynamic Time Warping Strategy
2.1 Phoneme posteriorgrams                                                The search of the spoken queries within the audio docu-
   Three architectures were used to obtain phoneme poste-              ments is performed by means of S-DTW [7]. First, a cost ma-
riorgrams:                                                             trix M ∈ ℜn×m is defined, where the rows and the columns
                                                                       correspond to the frames of the query Q and the document
    • lstm: a context-independent phone recognizer based               D, respectively:
      on a long short-term memory (LSTM) neural network
      was trained using the KALDI toolkit [5]. A 2-layer                         
                                                                                  c(qi , dj )                   if         i=0
      LSTM was used; the input of the first layer consists              Mi,j =     c(qi , dj ) + Mi−1,0          if         i > 0, j = 0   (1)
                                                                                  c(q , d ) + M ∗ (i, j)     otherwise
      of 40 log filter-bank energies augmented with 3 pitch                           i    j

      related features [4] and the output layer dimension was
                                                                          where c(qi , dj ) represents the cost between query vector
      the number of context independent phone units.                   qi and document vector dj , both of dimension U , and
    • dnn: a deep neural network (DNN)-based context-                                 ∗
                                                                                   M (i, j) = min (Mi−1,j , Mi−1,j−1 , Mi,j−1 )            (2)
      dependent phone recognizer was trained using the KALDI
      toolkit following Karel Veselý’s DNN training imple-               Pearson’s correlation coefficient r is used as distance met-
      mentation [15]. The network has 6 hidden layers, each            ric [14]:
1                                                                                                    U (qi · dj ) − kqi kkdj k
 The code of GTM-UVigo systems will be                 released   at             r(qi , dj ) = q                                           (3)
https://github.com/gtm-uvigo/MediaEval_QUESST2015                                               (U kqi2 k − kqi k2 )(U kd2j k − kdj k2 )


                                                                         In order to use r as a cost function, it is linearly mapped
Copyright is held by the author/owner(s).                              to the range [0,1], where 0 corresponds to correlations equal
MediaEval 2015 Workshop, Sept. 14–15, 2015, Wurzen, Germany.           to 1 and 1 corresponds to correlations equal to -1.
                               Table 2: Performance of systems submitted by GTM-UVigo team.
                                                   Dev                              Eval                          Dev-late                           Eval-late
     System        Metric           All       T1         T2   T3        All   T1       T2         T3     All      T1     T2        T3       All      T1         T2        T3
                   actCnxe  0.917 0.881 0.943 0.918 0.919 0.864 0.959 0.913 0.875 0.841 0.890 0.882 0.871 0.815 0.916 0.866
     Primary       minCnxe 0.905 0.861 0.928 0.904 0.905 0.844 0.946 0.882 0.847 0.788 0.865 0.860 0.838 0.758 0.895 0.824
                 lowerbound 0.627 0.562 0.672 0.631 0.629 0.532 0.702 0.627 0.593 0.526 0.633 0.606 0.592 0.490 0.657 0.601
                   actCnxe  0.998 0.998 0.997 1.000 0.999 0.999 0.997 1.000 0.907 0.897 0.916 0.904 0.898 0.852 0.933 0.896
   Contrastive     minCnxe 0.918 0.874 0.942 0.898 0.923 0.865 0.953 0.907 0.864 0.811 0.877 0.880 0.852 0.785 0.900 0.843
                 lowerbound 0.635 0.588 0.681 0.627 0.633 0.555 0.693 0.624 0.618 0.559 0.655 0.633 0.613 0.521 0.669 0.622


  In order to detect nc candidate matches of a query in a                                  systems. Late systems feature z-norm normalization of the
spoken document, every time a candidate match is detected,                                 query scores, obtaining an improvement with respect to the
which ends at frame b∗ , M (n, b∗ ) is set to ∞ in order to                                original submissions, where only path-length normalization
ignore this match.                                                                         was applied. In Table 3, actCnxe obtained with and with-
                                                                                           out applying the phoneme unit selection approach in some
2.3 Phoneme Unit Selection                                                                 individual systems are compared.
  A technique to select the most relevant phonemes among
the phonetic units of the different decoders was used in the
primary system. Given the best alignment path P(Q,D) of                                    Table 3: actCnxe of some individual systems with
length K between a query and a matching document, the                                      and without applying phoneme unit selection.
correlation and the cost at each step of the path can be
decomposed so there is a different term for each phonetic                                                              With                                Without
unit u:                                                                                        System    Global    T1         T2       T3    Global        T1        T2        T3
                                                                                               CZdnn     0.889    0.829 0.902 0.906          0.915    0.867 0.927 0.922
                                               1
                                 U qi,u dj,u − U kqi kkdj k                                    CZlstm    0.902    0.864 0.922 0.901          0.907    0.864 0.932 0.904
        r(qi , dj , u) = q                                                    (4)
                                                                                               CZtraps   0.902    0.840 0.924 0.910          0.931    0.883 0.945 0.938
                             (U kqi2 k − kqi k2 )(U kd2j k − kdj k2 )
                                                                                            HUtraps      0.903    0.856 0.926 0.899          0.934    0.894 0.950 0.936
                                                                                            RUtraps      0.895    0.844 0.918 0.894          0.925    0.886 0.944 0.922
  In this way, the cost accumulated by each phonetic unit
through the best alignment path can be computed:

                                      1 X
                                          K                                                   Table 4 shows the indexing speed factor (ISF), searching
               R(P (Q, D), u) =             c(qik , djk , u)                  (5)          speed factor (SSF), peak memory usage for indexing (PMUI )
                                      K k=1
                                                                                           and searching (PMUS ) and processing load (PL)2 , computed
  This value R(P (Q, D), u) can be considered as the rele-                                 as described in [9]. ISF and PMUI are rather high because,
vance of the phonetic unit u (the lower the contribution to                                in the dnn systems, first an automatic speech recognition
the cost, the more relevant the phonetic unit). Hence, the                                 system (ASR) is applied in order to obtain the input fea-
phonetic units can be sorted from more relevant to less rele-                              tures to the DNN; hence, the peak memory usage is so large
vant in order to keep the most relevant ones and to discard                                due to the memory requirements of the language model, and
those who increased the cost of the best alignment path.                                   the large computation time is caused by the two recognition
  Using only one alignment path may not provide a good                                     steps that are performed to estimate the transformation ma-
estimate of the relevance of the phonetic units; hence, the                                trix used to obtain the fMLLR features that are the input
relevance of the different pairs query-matching document in                                to the DNN. In future work, the ASR step of dnn systems
the development set were accumulated in order to robustly                                  will be replaced with a phonetic network in order to avoid
estimate the relevance. The number of relevant phonetic                                    these time and memory consuming steps.
units was empirically selected for each system.

2.4 Normalization and fusion                                                               Table 4: Required amount of processing resources.
                                                                                                                  ISF SSF PMUI PMUS PL
   Score normalization and fusion were performed following
[12]. First, the scores were normalized by the length of the                                                      12.1 0.09        6        0.014    7.3
warping path. A binary logistic regression was used for fu-
sion, as described in [1].                                                                 4.      ACKNOWLEDGEMENTS
                                                                                              This research was funded by the Spanish Government
3. RESULTS AND DISCUSSION                                                                  (‘SpeechTech4All Project’ TEC2012-38939-C03-01), the Gali-
                                                                                           cian Government through the research contract GRC2014/024
  Table 2 shows the results obtained on QUESST 2015 data
                                                                                           (Modalidade: Grupos de Referencia Competitiva 2014) and
using the submitted systems. The Table shows that the pri-
                                                                                           ‘AtlantTIC Project’ CN2012/160, and also by the Spanish
mary system, that features phoneme unit selection, clearly
                                                                                           Government and the European Regional Development Fund
outperforms the contrastive system, suggesting that the pro-
                                                                                           (ERDF) under project TACTICA.
posed technique obtains the expected improvement. An-
other fact that can be observed is that Dev and Eval results                               2
                                                                                            These values were computed using 2xIntel(R) Xeon(R) CPU E5-2620
are very similar, showing the generalization capability of the                             v3 @ 2.40GHz, 12cores/24threads, 128GB RAM.
5. REFERENCES                                                 [8] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur.
[1] M. Akbacak, L. Burget, W. Weng, and J. Houtvan.               LibriSpeech: an ASR corpus based on public domain
    Rich system combination for keyword spotting in               audio books. In Proceedings of ICASSP, pages 99–105,
    noisy and acoustically heterogeneous audio streams. In        2015.
    Proceedings of ICASSP, pages 8267–8271, 2013.             [9] L. Rodriguez-Fuentes and M. Penagarikano.
[2] L. Docio-Fernandez, A. Cardenal-Lopez, and                    MediaEval 2013 spoken web search task: system
    C. Garcia-Mateo. TC-STAR 2006 automatic speech                performance measures. Technical report, Dept.
    recognition evaluation: The UVIGO system. In                  Electricity and Electronics, University of the Basque
    TC-STAR Workshop on Speech-to-Speech Translation,             Country, 2013.
    2006.                                                    [10] L. Rodriguez-Fuentes, A. Varona, M. Penagarikano,
[3] C. Garcia-Mateo, J. Dieguez-Tirado,                           G. Bordel, and M. Diez. GTTS systems for the SWS
    L. Docio-Fernandez, and A. Cardenal-Lopez.                    task at MediaEval 2013. In Proceedings of the
    Transcrigal: A bilingual system for automatic                 MediaEval 2013 Workshop, 2013.
    indexing of broadcast news. In Proceedings of the        [11] P. Schwarz. Phoneme Recognition based on Long
    Fourth International Conference on Language                   Temporal Context. PhD thesis, Brno University of
    Resources and Evaluation (LREC), 2004.                        Technology, 2009.
[4] P. Ghahremani, B. BabaAli, D. Povey,                     [12] I. Szöke, L. Burget, F. Grézl, J. Černocký, and
    K. Riedhammer, J. Trmal, and S. Khudanpur. A pitch            L. Ondel. Calibration and fusion of query-by-example
    extraction algorithm tuned for automatic speech               systems - BUT SWS 2013. In Proceedings of ICASSP
    recognition. In IEEE International Conference on              2014, pages 7899–7903. IEEE Signal Processing
    Acoustics, Speech and Signal Processing, ICASSP               Society, 2014.
    2014, Florence, Italy, May 4-9, 2014, pages              [13] I. Szöke, L. Rodriguez-Fuentes, A. Buzo, X. Anguera,
    2494–2498, 2014.                                              F. Metze, J. Proenca, M. Lojka, and X. Xiong. Query
[5] A. Graves, A. Mohamed, and G. E. Hinton. Speech               by example search on speech at Mediaeval 2015. In
    recognition with deep recurrent neural networks. In           Proceedings of the MediaEval 2015 Workshop, 2015.
    IEEE International Conference on Acoustics, Speech       [14] I. Szöke, M. Skácel, and L. Burget. BUT
    and Signal Processing, ICASSP 2013, Vancouver, BC,            QUESST2014 system description. In Proceedings of
    Canada, May 26-31, 2013, pages 6645–6649, 2013.               the MediaEval 2014 Workshop, 2014.
[6] M. Korvas, O. Plátek, O. Dušek, L. Žilka, and         [15] K. Veselý, A. Ghoshal, L. Burget, and D. Povey.
    F. Jurčı́ček. Vystadial 2013 - Czech data, 2014.            Sequence-discriminative training of deep neural
    LINDAT/CLARIN digital library at Institute of                 networks. In INTERSPEECH 2013, 14th Annual
    Formal and Applied Linguistics, Charles University in         Conference of the International Speech
    Prague.                                                       Communication Association, Lyon, France, August
[7] M. Müller. Information Retrieval for Music and               25-29, 2013, pages 2345–2349, 2013.
    Motion. Springer-Verlag, 2007.