1. INTRODUCTION

GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015

Paula Lopez-Otero

Laura Docio-Fernandez

Carmen Garcia-Mateo

carmen@gts.uvigo.es 0 0 AtlantTIC Research Center E.E. Telecomunicación , Campus Universitario S/N, 36310 Vigo

2015

14 15

In this paper, we present the systems developed by GTMUVigo team for the query by example search on speech task (QUESST) at MediaEval 2015. The systems consist in a fusion of 11 dynamic time warping based systems that use phoneme posteriorgrams for speech representation; the primary system introduces a technique to select the most relevant phonetic units on each phoneme decoder, leading to an improvement of the search results.

1. INTRODUCTION

The query by example search on speech task (QUESST) aims at searching for audio content within audio content using an audio content query [ 13 ], having special focus on low-resource languages. This paper describes the systems developed by GTM-UVigo team to address this task1.

GTM-UVIGO SYSTEM DESCRIPTION GTM-UVigo systems consist in the fusion of 11 individual systems that represent the documents and queries by means of phoneme posteriorgrams, and then subsequence dynamic time warping (S-DTW) is used to perform the search. The primary system features a phonetic unit selection strategy, which is briefly described in this Section.

Phoneme posteriorgrams

Three architectures were used to obtain phoneme posteriorgrams: • lstm: a context-independent phone recognizer based on a long short-term memory (LSTM) neural network was trained using the KALDI toolkit [ 5 ]. A 2-layer LSTM was used; the input of the first layer consists of 40 log filter-bank energies augmented with 3 pitch related features [ 4 ] and the output layer dimension was the number of context independent phone units. • dnn: a deep neural network (DNN)-based contextdependent phone recognizer was trained using the KALDI toolkit following Karel Vesely´’s DNN training implementation [ 15 ]. The network has 6 hidden layers, each 1The code of GTM-UVigo systems will be released at https://github.com/gtm-uvigo/MediaEval_QUESST2015 with 2048 units, and it was trained on LDA-STCfMLLR features obtained from auxiliary Gaussian mixture models (GMM) [ 15 ]. The dimension of the input layer was 440 and the output layer dimension was the number of context-dependent states. • traps: the phone decoder based on long temporal context developed at the Brno University of Technology (BUT) was used [ 11 ].

11 models, summarized in Table 1, were trained using data in 6 languages: Galician (GA), Spanish (ES), English (EN), Czech (CZ), Hungarian (HU) and Russian (RU).

Database Transcrigal [ 3 ] TC-STAR [ 2 ]

LibriSpeech [ 8 ] Vystadial 2013 [ 6 ]

Speech-Dat

Duration (h) 35 78 100 15 n/a 2.2

Dynamic Time Warping Strategy

The search of the spoken queries within the audio documents is performed by means of S-DTW [ 7 ]. First, a cost matrix M ∈ ℜn×m is defined, where the rows and the columns correspond to the frames of the query Q and the document D, respectively:

Mi,j =  cc((qqii,, ddjj)) + Mi−1,0  c(qi, dj) + M∗(i, j) if if otherwise i = 0 i > 0, j = 0 where c(qi, dj) represents the cost between query vector qi and document vector dj, both of dimension U , and

M∗(i, j) = min (Mi−1,j, Mi−1,j−1, Mi,j−1)

Pearson’s correlation coefficient r is used as distance metric [ 14 ]: r(qi, dj) =

U(qi · dj) − kqikkdjk q(Ukqi2k − kqik2)(Ukdj2k − kdjk2)

In order to use r as a cost function, it is linearly mapped to the range [ 0,1 ], where 0 corresponds to correlations equal to 1 and 1 corresponds to correlations equal to -1. (1) (2) (3) In order to detect nc candidate matches of a query in a spoken document, every time a candidate match is detected, which ends at frame b∗, M (n, b∗) is set to ∞ in order to ignore this match. 2.3

Phoneme Unit Selection

A technique to select the most relevant phonemes among the phonetic units of the different decoders was used in the primary system. Given the best alignment path P(Q,D) of length K between a query and a matching document, the correlation and the cost at each step of the path can be decomposed so there is a different term for each phonetic unit u: r(qi, dj , u) =

U qi,udj,u − U1 kqikkdj k q(U kqi2k − kqik2)(U kdj2k − kdj k2)

In this way, the cost accumulated by each phonetic unit through the best alignment path can be computed: R(P (Q, D), u) = 1 K

X c(qik , djk , u)

K k=1

This value R(P (Q, D), u) can be considered as the relevance of the phonetic unit u (the lower the contribution to the cost, the more relevant the phonetic unit). Hence, the phonetic units can be sorted from more relevant to less relevant in order to keep the most relevant ones and to discard those who increased the cost of the best alignment path.

Using only one alignment path may not provide a good estimate of the relevance of the phonetic units; hence, the relevance of the different pairs query-matching document in the development set were accumulated in order to robustly estimate the relevance. The number of relevant phonetic units was empirically selected for each system. 2.4

Normalization and fusion

Score normalization and fusion were performed following [ 12 ]. First, the scores were normalized by the length of the warping path. A binary logistic regression was used for fusion, as described in [ 1 ].

RESULTS AND DISCUSSION

Table 2 shows the results obtained on QUESST 2015 data using the submitted systems. The Table shows that the primary system, that features phoneme unit selection, clearly outperforms the contrastive system, suggesting that the proposed technique obtains the expected improvement. Another fact that can be observed is that Dev and Eval results are very similar, showing the generalization capability of the (4) (5) systems. Late systems feature z-norm normalization of the query scores, obtaining an improvement with respect to the original submissions, where only path-length normalization was applied. In Table 3, actCnxe obtained with and without applying the phoneme unit selection approach in some individual systems are compared.

Table 4 shows the indexing speed factor (ISF), searching speed factor (SSF), peak memory usage for indexing (PMUI) and searching (PMUS) and processing load (PL)2, computed as described in [ 9 ]. ISF and PMUI are rather high because, in the dnn systems, first an automatic speech recognition system (ASR) is applied in order to obtain the input features to the DNN; hence, the peak memory usage is so large due to the memory requirements of the language model, and the large computation time is caused by the two recognition steps that are performed to estimate the transformation matrix used to obtain the fMLLR features that are the input to the DNN. In future work, the ASR step of dnn systems will be replaced with a phonetic network in order to avoid these time and memory consuming steps.

ACKNOWLEDGEMENTS

This research was funded by the Spanish Government (‘SpeechTech4All Project’ TEC2012-38939-C03-01), the Galician Government through the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014) and ‘AtlantTIC Project’ CN2012/160, and also by the Spanish Government and the European Regional Development Fund (ERDF) under project TACTICA. 2 These values were computed using 2xIntel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 12cores/24threads, 128GB RAM.

[1]

Akbacak ,

Burget ,

Weng , and

Houtvan . Rich system combination for keyword spotting in noisy and acoustically heterogeneous audio streams . In Proceedings of ICASSP , pages 8267 - 8271 , 2013 .

[2]

Docio-Fernandez ,

Cardenal-Lopez , and C. Garcia-Mateo. TC-STAR 2006 automatic speech recognition evaluation: The UVIGO system . In TC-STAR Workshop on Speech-to-Speech Translation , 2006 .

[3]

Garcia-Mateo ,

Dieguez-Tirado ,

Docio-Fernandez , and

Cardenal-Lopez . Transcrigal: A bilingual system for automatic indexing of broadcast news . In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC) , 2004 .

[4]

Ghahremani ,

BabaAli ,

Povey ,

Riedhammer ,

Trmal , and

Khudanpur . A pitch extraction algorithm tuned for automatic speech recognition . In IEEE International Conference on Acoustics, Speech and Signal Processing , ICASSP 2014 , Florence, Italy, May 4- 9 , 2014 , pages 2494 - 2498 , 2014 .

[5]

Graves ,

Mohamed , and

G. E.

Hinton . Speech recognition with deep recurrent neural networks . In IEEE International Conference on Acoustics, Speech and Signal Processing , ICASSP 2013 , Vancouver, BC, Canada, May 26 -31, 2013 , pages 6645 - 6649 , 2013 .

[6]

Korvas , O.

Pla´tek,

O. Duˇsek , L. Zˇilka, and F.

Jurˇc´ıˇcek . Vystadial 2013 - Czech data, 2014 . LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague.

[7]

Mu ¨ller . Information Retrieval for Music and Motion . Springer-Verlag, 2007 .

[8]

Panayotov ,

Chen ,

Povey , and S. Khudanpur. LibriSpeech: an ASR corpus based on public domain audio books . In Proceedings of ICASSP , pages 99 - 105 , 2015 .

[9]

Rodriguez-Fuentes and

Penagarikano . MediaEval 2013 spoken web search task: system performance measures . Technical report , Dept. Electricity and Electronics, University of the Basque Country, 2013 .

[10]

Rodriguez-Fuentes ,

Varona ,

Penagarikano , G. Bordel, and

Diez . GTTS systems for the SWS task at MediaEval 2013 . In Proceedings of the MediaEval 2013 Workshop , 2013 .

[11]

Schwarz . Phoneme Recognition based on Long Temporal Context . PhD thesis , Brno University of Technology, 2009 .

[12] I. Szo ¨ke,

Burget , F. Gr´ezl , J. Cˇ ernocky´, and

Ondel . Calibration and fusion of query-by-example systems - BUT SWS 2013 . In Proceedings of ICASSP 2014 , pages 7899 - 7903 . IEEE Signal Processing Society , 2014 .

[13] I. Szo ¨ke, L. Rodriguez-Fuentes ,

Buzo ,

Anguera ,

Metze ,

Proenca ,

Lojka , and

Xiong . Query by example search on speech at Mediaeval 2015 . In Proceedings of the MediaEval 2015 Workshop , 2015 .

[14] I. Szo ¨ke, M. Ska´cel, and L. Burget. BUT QUESST2014 system description . In Proceedings of the MediaEval 2014 Workshop , 2014 .

[15]

Vesely´ ,

Ghoshal ,

Burget , and

Povey . Sequence-discriminative training of deep neural networks . In INTERSPEECH 2013 , 14th Annual Conference of the International Speech Communication Association , Lyon, France, August 25-29 , 2013 , pages 2345 - 2349 , 2013 .