=Paper=
{{Paper
|id=Vol-1263/paper43
|storemode=property
|title=ELiRF at MediaEval 2014: Query by Example Search on Speech Task (QUESST)
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_43.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/CalvoGHAG14
}}
==ELiRF at MediaEval 2014: Query by Example Search on Speech Task (QUESST)==
ELiRF at MediaEval 2014: Query by Example Search on Speech Task (QUESST) Marcos Calvo, Mayte Giménez, Lluís-F. Hurtado, Emilio Sanchis, Jon A. Gómez Departament de Sistemes Informàtics i Computació Universitat Politècnica de València Camí de Vera s/n, 46020, València, Spain {mcalvo, mgimenez, lhurtado, esanchis, jon}@dsic.upv.es ABSTRACT (audio documents and queries) as sequences of feature vec- In this paper, we present the systems that the Natural Lan- tors. Each feature vector contains 39 values: the energy, the guage Engineering and Pattern Recognition group (ELiRF) first 12 MFCCs plus the first and second time derivatives of has submitted to the MediaEval 2014 Query by Example the original 13 features. Search on Speech task. All of them are based on a Subse- We worked with two parametrizations. One is the stan- quence Dynamic Time Warping algorithm and do not use dard feature extraction without applying any filter for noise any other information from outside the task (zero-resources reduction (labeled as non-filtered), and the other one con- systems). sists on compensating the MFCC for improving the recog- nition on noisy speech (labeled as Choi) [3]. In both cases a preemphasis filter over the signal was ap- 1. INTRODUCTION plied with H(z) = 1 − 0.95z −1 , the subsampling frequency In this paper, we present the systems that we have sub- was 100Hz, i.e. one feature vector every 10 ms, and a Ham- mitted to the MediaEval 2014 Query by Example Search on ming window of 25 ms was applied before computing the Speech task. The goal of the task is to identify the audio FFT. The first difference is found when computing the Mel- documents which match a spoken query. This match can be filterbank. Each filterbank output is computed as follows in either exact (the same term both in the query and in the the case of the non-filtered parametrization: document), or with variations [2]. The two systems we have submitted are based on a Sub- mj = log(Yj ) sequence Dynamic Time Warping (S-DTW) algorithm [1]. However, the systems differ in the way the audio files are where Yj is the output magnitude of the j-th Mel-filterbank. preprocessed, which makes the feature vectors to be differ- In the case of using the approach proposed by Choi [3]: ent for each system. It is worth to note that this approach does not use any external information, which makes our sys- tems zero-resources systems. In the following sections, we mj = αj log{1 + βj max(Yj − N̂j , γj Yj )} will explain the differences in how the feature vectors are computed for each system, the search algorithm, and the where βj = 0.001 and γj = 0.4 ∀j in our implementation, N̂j results obtained in this evaluation. is the noise magnitude estimation of the j-th Mel-filterbank output, and 2. OVERVIEW OF THE SYSTEMS Y log(1 + N̂j ) j Both of our systems used the same philosophy. First step αj = M was preprocessing all the audio files, both spoken documents P Yk log(1 + N̂ ) and queries. This way we obtained a sequence of feature vec- k=1 k tors as a representation of each audio file. Then, we took M is the total number of Mel-filters. α values are computed each possible pair (document, query) and run a S-DTW al- for each feature vector. gorithm on them. This provided the bounds of a possible Next step in the parametrization is to compute the stan- detection of the query within the document, and a score dard Discrete Cosine Transform to the Mel-filterbank. The for this detection. Finally, a decision-making module estab- first 12 MFCC are obtained. But in the case of the filtered lished a threshold based on the scores of all the possible parametrization a transformation of energy and each MFCC detections. This was necessary to provide detections with component is performed based on the Cumulative Distribu- the highest confidences. tion Mapping (CDM) technique [3], which is based on the use of histogram equalization originally developed for image 3. PARAMETRIZATION processing [4]. Last step of parametrization was the compu- We used Mel Frequency Cepstral Coefficients (MFCC) tation of first and second time derivatives. plus energy as the features for representing speech samples It is worth to note that most of queries contain leading and trailing silences. Therefore, we trimmed the sequence of feature vectors representing each query by means of a voice Copyright is held by the author/owner(s). activity detection procedure, in order to help the search al- MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain gorithm. 4. SEARCH ALGORITHM Table 1: Results obtained for the development set. Finding spoken queries within a set of audios is a com- System Cnxe MinCnxe MTWV ATWV plex task, hence we used a Dynamic Programming (DP) technique in order to face this problem. In particular, we Choi 5.8940 0.9595 0.0692 0.0692 used S-DTW, that is a DP technique for comparing two se- non-filtered 6.0905 0.9571 0.0767 0.0768 quences of objects. In our case, one of the sequences corre- sponds to feature vectors of one of the audio documents, and Table 2: Results obtained for the test set. the other one belongs to the query. Therefore, the S-DTW System Cnxe MinCnxe MTWV ATWV method finds multiple local alignments of the query within audio documents, by allowing it to start at any position of Choi 5.5369 0.9667 0.0558 0.0557 the audio document. non-filtered 5.9502 0.9648 0.0587 0.0587 Equation 1 shows the generic formulation of S-DTW: +∞ i<0 and we used around 16 GB of memory. Thus, our processing load for both systems was 3.40 · 10−2 . +∞ j<0 M (i, j) = 0 j=0 min M (i − x, j − y) + D(A(i), B(j)) j ≥ 1 6. CONCLUSIONS ∀(x,y)∈S In this paper, we have presented the systems we have sub- (1) mitted to the MediaEval 2014 Query by Example Search on where M is the DP matrix; S is the set of allowed move- Speech Evaluation, as well as the results obtained. This ments, represented as pairs (x, y) of horizontal and vertical was a very challenging task in which both exact and varied increments; A(i), B(j) are the objects representing the po- occurrences of the queries within the documents had to be sitions i-th and j-th of their respective sequences; and D found. Despite of our preliminary attempts, our approach is a function that computes some distance or dissimilarity has been proven as not suitable for this task. One of the between two objects. reasons is due to the nature of the S-DTW algorithm. Its In our implementation the set of allowed movements S use makes not possible to find occurrences of queries where is {(1, 2), (1, 1), (2, 1)}. This set of movements guarantees a reordering of words is needed. However, we would like to that the size of any detection will be between 0.5 and 2 times point out that significant improvements were observed when the size of the query. trimmed queries were used for the development set. As future work, we would like to improve our system in 5. EXPERIMENTS AND RESULTS order to use it in tasks like QUESST, where swaps in the We performed several preliminary experiments in order to order of components of a query can happen. Facing this find the best configuration for our systems. kind of word reorderings would be possible if a higher level We evaluated different distance functions and parametriza- of knowledge is used, e.g. sequences of phonemes instead tions. One of them was the Kullback-Leibler divergence on of using only sequences of acoustic feature vectors. It is not sequences of vectors of probabilities as representation of au- possible to use words in a task where distinct languages may dio files. The probabilities were obtained by a GMM es- appear and no other source than audio files is provided. timated by means of the EM algorithm with all the audio documents in the corpus. Different number of components 7. ACKNOWLEDGMENTS in the GMM were tried. We also tried the cosine distance Work funded by the Spanish Government and the E.U. un- with the Mel-filterbank parametrization. However, we fi- der contract TIN2011-28169-C05 and FPU Grant AP2010- nally used cosine distance with the MFCC, since it provided 4193. the best results for the development set. For this MediaEval 2014 Query by Example Search on Speech Evaluation, we submitted one run for both systems 8. REFERENCES described above. The results we obtained are shown in Ta- [1] X. Anguera and M. Ferrarons. Memory efficient bles 1 and 2. The measure to be optimized for this Evalu- subsequence DTW for Query-by-Example spoken term ation was the cross entropy score (Cnxe). However, other detection. In 2013 IEEE International Conference on measures such as the Mean and the Actual Term Weighted Multimedia and Expo. IEEE, 2013. Values (MTWV and ATWV, respectively) were considered [2] X. Anguera, L. J. Rodriguez-Fuentes, I. Szöke, as secondary metrics, as they are very widely used in this A. Buzo, and F. Metze. Query by Example Search on kind of tasks. Speech at Mediaeval 2014. In MediaEval 2014 Results shown in both tables reveal a bad performance of Workshop, 16-17 October 2013. our systems (a high value of Cnxe). Nevertheless, given the [3] E. H. C. Choi. On compensating the mel-frequency difference in the sources of the audio documents and audio cepstral coefficients for noisy speech recognition. In queries, we expected a higher accuracy for our system that Proceedings of the 29th Australasian Computer Science uses the Choi parametrization. Conference - Volume 48, ACSC ’06, pages 49–54, We run our own multi-thread implementation of S-DTW Darlinghurst, Australia, Australia, 2006. Australian algorithm, using an standard PC with an i7 core and 16 GB Computer Society, Inc. of RAM using 8 threads on a Linux operating system. At the [4] J. C. Russ and R. P. Woods. The image processing parametrization stage, we achieved an indexing speed factor handbook. Journal of Computer Assisted Tomography, of 1.26·10−2 , and our memory peak was around 0.25 GB. At 19(6):979–981, 1995. the search stage, our searching speed factor was 2.34 · 10−3