GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 Paula Lopez-Otero, Laura Docio-Fernandez, Carmen Garcia-Mateo AtlantTIC Research Center E.E. Telecomunicación, Campus Universitario S/N, 36310 Vigo {plopez,ldocio,carmen}@gts.uvigo.es ABSTRACT with 2048 units, and it was trained on LDA-STC- In this paper, we present the systems developed by GTM- fMLLR features obtained from auxiliary Gaussian mix- UVigo team for the query by example search on speech task ture models (GMM) [15]. The dimension of the input (QUESST) at MediaEval 2015. The systems consist in a layer was 440 and the output layer dimension was the fusion of 11 dynamic time warping based systems that use number of context-dependent states. phoneme posteriorgrams for speech representation; the pri- • traps: the phone decoder based on long temporal con- mary system introduces a technique to select the most rele- text developed at the Brno University of Technology vant phonetic units on each phoneme decoder, leading to an (BUT) was used [11]. improvement of the search results. 11 models, summarized in Table 1, were trained using data in 6 languages: Galician (GA), Spanish (ES), English (EN), 1. INTRODUCTION Czech (CZ), Hungarian (HU) and Russian (RU). The query by example search on speech task (QUESST) aims at searching for audio content within audio content using an audio content query [13], having special focus on Table 1: Databases used to train the acoustic mod- low-resource languages. This paper describes the systems els. BUT models were used in the traps systems. developed by GTM-UVigo team to address this task1 . System Database Duration (h) GAdnn, GAlstm Transcrigal [3] 35 2. GTM-UVIGO SYSTEM DESCRIPTION ESdnn, ESlstm TC-STAR [2] 78 GTM-UVigo systems consist in the fusion of 11 individual ENdnn, ESlstm LibriSpeech [8] 100 systems that represent the documents and queries by means CZdnn, ENlstm Vystadial 2013 [6] 15 of phoneme posteriorgrams, and then subsequence dynamic CZ/HU/RUtraps Speech-Dat n/a time warping (S-DTW) is used to perform the search. The primary system features a phonetic unit selection strategy, which is briefly described in this Section. 2.2 Dynamic Time Warping Strategy 2.1 Phoneme posteriorgrams The search of the spoken queries within the audio docu- Three architectures were used to obtain phoneme poste- ments is performed by means of S-DTW [7]. First, a cost ma- riorgrams: trix M ∈ ℜn×m is defined, where the rows and the columns correspond to the frames of the query Q and the document • lstm: a context-independent phone recognizer based D, respectively: on a long short-term memory (LSTM) neural network was trained using the KALDI toolkit [5]. A 2-layer   c(qi , dj ) if i=0 LSTM was used; the input of the first layer consists Mi,j = c(qi , dj ) + Mi−1,0 if i > 0, j = 0 (1)  c(q , d ) + M ∗ (i, j) otherwise of 40 log filter-bank energies augmented with 3 pitch i j related features [4] and the output layer dimension was where c(qi , dj ) represents the cost between query vector the number of context independent phone units. qi and document vector dj , both of dimension U , and • dnn: a deep neural network (DNN)-based context- ∗ M (i, j) = min (Mi−1,j , Mi−1,j−1 , Mi,j−1 ) (2) dependent phone recognizer was trained using the KALDI toolkit following Karel Veselý’s DNN training imple- Pearson’s correlation coefficient r is used as distance met- mentation [15]. The network has 6 hidden layers, each ric [14]: 1 U (qi · dj ) − kqi kkdj k The code of GTM-UVigo systems will be released at r(qi , dj ) = q (3) https://github.com/gtm-uvigo/MediaEval_QUESST2015 (U kqi2 k − kqi k2 )(U kd2j k − kdj k2 ) In order to use r as a cost function, it is linearly mapped Copyright is held by the author/owner(s). to the range [0,1], where 0 corresponds to correlations equal MediaEval 2015 Workshop, Sept. 14–15, 2015, Wurzen, Germany. to 1 and 1 corresponds to correlations equal to -1. Table 2: Performance of systems submitted by GTM-UVigo team. Dev Eval Dev-late Eval-late System Metric All T1 T2 T3 All T1 T2 T3 All T1 T2 T3 All T1 T2 T3 actCnxe 0.917 0.881 0.943 0.918 0.919 0.864 0.959 0.913 0.875 0.841 0.890 0.882 0.871 0.815 0.916 0.866 Primary minCnxe 0.905 0.861 0.928 0.904 0.905 0.844 0.946 0.882 0.847 0.788 0.865 0.860 0.838 0.758 0.895 0.824 lowerbound 0.627 0.562 0.672 0.631 0.629 0.532 0.702 0.627 0.593 0.526 0.633 0.606 0.592 0.490 0.657 0.601 actCnxe 0.998 0.998 0.997 1.000 0.999 0.999 0.997 1.000 0.907 0.897 0.916 0.904 0.898 0.852 0.933 0.896 Contrastive minCnxe 0.918 0.874 0.942 0.898 0.923 0.865 0.953 0.907 0.864 0.811 0.877 0.880 0.852 0.785 0.900 0.843 lowerbound 0.635 0.588 0.681 0.627 0.633 0.555 0.693 0.624 0.618 0.559 0.655 0.633 0.613 0.521 0.669 0.622 In order to detect nc candidate matches of a query in a systems. Late systems feature z-norm normalization of the spoken document, every time a candidate match is detected, query scores, obtaining an improvement with respect to the which ends at frame b∗ , M (n, b∗ ) is set to ∞ in order to original submissions, where only path-length normalization ignore this match. was applied. In Table 3, actCnxe obtained with and with- out applying the phoneme unit selection approach in some 2.3 Phoneme Unit Selection individual systems are compared. A technique to select the most relevant phonemes among the phonetic units of the different decoders was used in the primary system. Given the best alignment path P(Q,D) of Table 3: actCnxe of some individual systems with length K between a query and a matching document, the and without applying phoneme unit selection. correlation and the cost at each step of the path can be decomposed so there is a different term for each phonetic With Without unit u: System Global T1 T2 T3 Global T1 T2 T3 CZdnn 0.889 0.829 0.902 0.906 0.915 0.867 0.927 0.922 1 U qi,u dj,u − U kqi kkdj k CZlstm 0.902 0.864 0.922 0.901 0.907 0.864 0.932 0.904 r(qi , dj , u) = q (4) CZtraps 0.902 0.840 0.924 0.910 0.931 0.883 0.945 0.938 (U kqi2 k − kqi k2 )(U kd2j k − kdj k2 ) HUtraps 0.903 0.856 0.926 0.899 0.934 0.894 0.950 0.936 RUtraps 0.895 0.844 0.918 0.894 0.925 0.886 0.944 0.922 In this way, the cost accumulated by each phonetic unit through the best alignment path can be computed: 1 X K Table 4 shows the indexing speed factor (ISF), searching R(P (Q, D), u) = c(qik , djk , u) (5) speed factor (SSF), peak memory usage for indexing (PMUI ) K k=1 and searching (PMUS ) and processing load (PL)2 , computed This value R(P (Q, D), u) can be considered as the rele- as described in [9]. ISF and PMUI are rather high because, vance of the phonetic unit u (the lower the contribution to in the dnn systems, first an automatic speech recognition the cost, the more relevant the phonetic unit). Hence, the system (ASR) is applied in order to obtain the input fea- phonetic units can be sorted from more relevant to less rele- tures to the DNN; hence, the peak memory usage is so large vant in order to keep the most relevant ones and to discard due to the memory requirements of the language model, and those who increased the cost of the best alignment path. the large computation time is caused by the two recognition Using only one alignment path may not provide a good steps that are performed to estimate the transformation ma- estimate of the relevance of the phonetic units; hence, the trix used to obtain the fMLLR features that are the input relevance of the different pairs query-matching document in to the DNN. In future work, the ASR step of dnn systems the development set were accumulated in order to robustly will be replaced with a phonetic network in order to avoid estimate the relevance. The number of relevant phonetic these time and memory consuming steps. units was empirically selected for each system. 2.4 Normalization and fusion Table 4: Required amount of processing resources. ISF SSF PMUI PMUS PL Score normalization and fusion were performed following [12]. First, the scores were normalized by the length of the 12.1 0.09 6 0.014 7.3 warping path. A binary logistic regression was used for fu- sion, as described in [1]. 4. ACKNOWLEDGEMENTS This research was funded by the Spanish Government 3. RESULTS AND DISCUSSION (‘SpeechTech4All Project’ TEC2012-38939-C03-01), the Gali- cian Government through the research contract GRC2014/024 Table 2 shows the results obtained on QUESST 2015 data (Modalidade: Grupos de Referencia Competitiva 2014) and using the submitted systems. The Table shows that the pri- ‘AtlantTIC Project’ CN2012/160, and also by the Spanish mary system, that features phoneme unit selection, clearly Government and the European Regional Development Fund outperforms the contrastive system, suggesting that the pro- (ERDF) under project TACTICA. posed technique obtains the expected improvement. An- other fact that can be observed is that Dev and Eval results 2 These values were computed using 2xIntel(R) Xeon(R) CPU E5-2620 are very similar, showing the generalization capability of the v3 @ 2.40GHz, 12cores/24threads, 128GB RAM. 5. REFERENCES [8] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. [1] M. Akbacak, L. Burget, W. Weng, and J. Houtvan. LibriSpeech: an ASR corpus based on public domain Rich system combination for keyword spotting in audio books. In Proceedings of ICASSP, pages 99–105, noisy and acoustically heterogeneous audio streams. In 2015. Proceedings of ICASSP, pages 8267–8271, 2013. [9] L. Rodriguez-Fuentes and M. Penagarikano. [2] L. Docio-Fernandez, A. Cardenal-Lopez, and MediaEval 2013 spoken web search task: system C. Garcia-Mateo. TC-STAR 2006 automatic speech performance measures. Technical report, Dept. recognition evaluation: The UVIGO system. In Electricity and Electronics, University of the Basque TC-STAR Workshop on Speech-to-Speech Translation, Country, 2013. 2006. [10] L. Rodriguez-Fuentes, A. Varona, M. Penagarikano, [3] C. Garcia-Mateo, J. Dieguez-Tirado, G. Bordel, and M. Diez. GTTS systems for the SWS L. Docio-Fernandez, and A. Cardenal-Lopez. task at MediaEval 2013. In Proceedings of the Transcrigal: A bilingual system for automatic MediaEval 2013 Workshop, 2013. indexing of broadcast news. In Proceedings of the [11] P. Schwarz. Phoneme Recognition based on Long Fourth International Conference on Language Temporal Context. PhD thesis, Brno University of Resources and Evaluation (LREC), 2004. Technology, 2009. [4] P. Ghahremani, B. BabaAli, D. Povey, [12] I. Szöke, L. Burget, F. Grézl, J. Černocký, and K. Riedhammer, J. Trmal, and S. Khudanpur. A pitch L. Ondel. Calibration and fusion of query-by-example extraction algorithm tuned for automatic speech systems - BUT SWS 2013. In Proceedings of ICASSP recognition. In IEEE International Conference on 2014, pages 7899–7903. IEEE Signal Processing Acoustics, Speech and Signal Processing, ICASSP Society, 2014. 2014, Florence, Italy, May 4-9, 2014, pages [13] I. Szöke, L. Rodriguez-Fuentes, A. Buzo, X. Anguera, 2494–2498, 2014. F. Metze, J. Proenca, M. Lojka, and X. Xiong. Query [5] A. Graves, A. Mohamed, and G. E. Hinton. Speech by example search on speech at Mediaeval 2015. In recognition with deep recurrent neural networks. In Proceedings of the MediaEval 2015 Workshop, 2015. IEEE International Conference on Acoustics, Speech [14] I. Szöke, M. Skácel, and L. Burget. BUT and Signal Processing, ICASSP 2013, Vancouver, BC, QUESST2014 system description. In Proceedings of Canada, May 26-31, 2013, pages 6645–6649, 2013. the MediaEval 2014 Workshop, 2014. [6] M. Korvas, O. Plátek, O. Dušek, L. Žilka, and [15] K. Veselý, A. Ghoshal, L. Burget, and D. Povey. F. Jurčı́ček. Vystadial 2013 - Czech data, 2014. Sequence-discriminative training of deep neural LINDAT/CLARIN digital library at Institute of networks. In INTERSPEECH 2013, 14th Annual Formal and Applied Linguistics, Charles University in Conference of the International Speech Prague. Communication Association, Lyon, France, August [7] M. Müller. Information Retrieval for Music and 25-29, 2013, pages 2345–2349, 2013. Motion. Springer-Verlag, 2007.