=Paper= {{Paper |id=Vol-1263/paper43 |storemode=property |title=ELiRF at MediaEval 2014: Query by Example Search on Speech Task (QUESST) |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_43.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/CalvoGHAG14 }} ==ELiRF at MediaEval 2014: Query by Example Search on Speech Task (QUESST)== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_43.pdf

ELiRF at MediaEval 2014: Query by Example Search on
Speech Task (QUESST)

Marcos Calvo, Mayte Giménez, Lluís-F. Hurtado, Emilio Sanchis, Jon A. Gómez
Departament de Sistemes Informàtics i Computació
Universitat Politècnica de València
Camí de Vera s/n, 46020, València, Spain
{mcalvo, mgimenez, lhurtado, esanchis, jon}@dsic.upv.es

ABSTRACT (audio documents and queries) as sequences of feature vec-
In this paper, we present the systems that the Natural Lan- tors. Each feature vector contains 39 values: the energy, the
guage Engineering and Pattern Recognition group (ELiRF) first 12 MFCCs plus the first and second time derivatives of
has submitted to the MediaEval 2014 Query by Example the original 13 features.
Search on Speech task. All of them are based on a Subse- We worked with two parametrizations. One is the stan-
quence Dynamic Time Warping algorithm and do not use dard feature extraction without applying any filter for noise
any other information from outside the task (zero-resources reduction (labeled as non-filtered), and the other one con-
systems). sists on compensating the MFCC for improving the recog-
nition on noisy speech (labeled as Choi) [3].
In both cases a preemphasis filter over the signal was ap-
1. INTRODUCTION plied with H(z) = 1 − 0.95z −1 , the subsampling frequency
In this paper, we present the systems that we have sub- was 100Hz, i.e. one feature vector every 10 ms, and a Ham-
mitted to the MediaEval 2014 Query by Example Search on ming window of 25 ms was applied before computing the
Speech task. The goal of the task is to identify the audio FFT. The first difference is found when computing the Mel-
documents which match a spoken query. This match can be filterbank. Each filterbank output is computed as follows in
either exact (the same term both in the query and in the the case of the non-filtered parametrization:
document), or with variations [2].
The two systems we have submitted are based on a Sub- mj = log(Yj )
sequence Dynamic Time Warping (S-DTW) algorithm [1].
However, the systems differ in the way the audio files are where Yj is the output magnitude of the j-th Mel-filterbank.
preprocessed, which makes the feature vectors to be differ- In the case of using the approach proposed by Choi [3]:
ent for each system. It is worth to note that this approach
does not use any external information, which makes our sys-
tems zero-resources systems. In the following sections, we mj = αj log{1 + βj max(Yj − N̂j , γj Yj )}
will explain the differences in how the feature vectors are
computed for each system, the search algorithm, and the where βj = 0.001 and γj = 0.4 ∀j in our implementation, N̂j
results obtained in this evaluation. is the noise magnitude estimation of the j-th Mel-filterbank
output, and
2. OVERVIEW OF THE SYSTEMS Y
log(1 + N̂j )
j
Both of our systems used the same philosophy. First step αj = M
was preprocessing all the audio files, both spoken documents P Yk
log(1 + N̂ )
and queries. This way we obtained a sequence of feature vec- k=1 k

tors as a representation of each audio file. Then, we took M is the total number of Mel-filters. α values are computed
each possible pair (document, query) and run a S-DTW al- for each feature vector.
gorithm on them. This provided the bounds of a possible Next step in the parametrization is to compute the stan-
detection of the query within the document, and a score dard Discrete Cosine Transform to the Mel-filterbank. The
for this detection. Finally, a decision-making module estab- first 12 MFCC are obtained. But in the case of the filtered
lished a threshold based on the scores of all the possible parametrization a transformation of energy and each MFCC
detections. This was necessary to provide detections with component is performed based on the Cumulative Distribu-
the highest confidences. tion Mapping (CDM) technique [3], which is based on the
use of histogram equalization originally developed for image
3. PARAMETRIZATION processing [4]. Last step of parametrization was the compu-
We used Mel Frequency Cepstral Coefficients (MFCC) tation of first and second time derivatives.
plus energy as the features for representing speech samples It is worth to note that most of queries contain leading
and trailing silences. Therefore, we trimmed the sequence of
feature vectors representing each query by means of a voice
Copyright is held by the author/owner(s). activity detection procedure, in order to help the search al-
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain gorithm.
4. SEARCH ALGORITHM
Table 1: Results obtained for the development set.
Finding spoken queries within a set of audios is a com- System Cnxe MinCnxe MTWV ATWV
plex task, hence we used a Dynamic Programming (DP)
technique in order to face this problem. In particular, we Choi 5.8940 0.9595 0.0692 0.0692
used S-DTW, that is a DP technique for comparing two se- non-filtered 6.0905 0.9571 0.0767 0.0768
quences of objects. In our case, one of the sequences corre-
sponds to feature vectors of one of the audio documents, and
Table 2: Results obtained for the test set.
the other one belongs to the query. Therefore, the S-DTW System Cnxe MinCnxe MTWV ATWV
method finds multiple local alignments of the query within
audio documents, by allowing it to start at any position of Choi 5.5369 0.9667 0.0558 0.0557
the audio document. non-filtered 5.9502 0.9648 0.0587 0.0587
Equation 1 shows the generic formulation of S-DTW:


 +∞ i<0 and we used around 16 GB of memory. Thus, our processing
load for both systems was 3.40 · 10−2 .

+∞

j<0
M (i, j) = 0 j=0


 min M (i − x, j − y) + D(A(i), B(j)) j ≥ 1

 6. CONCLUSIONS
∀(x,y)∈S
In this paper, we have presented the systems we have sub-
(1)
mitted to the MediaEval 2014 Query by Example Search on
where M is the DP matrix; S is the set of allowed move-
Speech Evaluation, as well as the results obtained. This
ments, represented as pairs (x, y) of horizontal and vertical
was a very challenging task in which both exact and varied
increments; A(i), B(j) are the objects representing the po-
occurrences of the queries within the documents had to be
sitions i-th and j-th of their respective sequences; and D
found. Despite of our preliminary attempts, our approach
is a function that computes some distance or dissimilarity
has been proven as not suitable for this task. One of the
between two objects.
reasons is due to the nature of the S-DTW algorithm. Its
In our implementation the set of allowed movements S
use makes not possible to find occurrences of queries where
is {(1, 2), (1, 1), (2, 1)}. This set of movements guarantees
a reordering of words is needed. However, we would like to
that the size of any detection will be between 0.5 and 2 times
point out that significant improvements were observed when
the size of the query.
trimmed queries were used for the development set.
As future work, we would like to improve our system in
5. EXPERIMENTS AND RESULTS order to use it in tasks like QUESST, where swaps in the
We performed several preliminary experiments in order to order of components of a query can happen. Facing this
find the best configuration for our systems. kind of word reorderings would be possible if a higher level
We evaluated different distance functions and parametriza- of knowledge is used, e.g. sequences of phonemes instead
tions. One of them was the Kullback-Leibler divergence on of using only sequences of acoustic feature vectors. It is not
sequences of vectors of probabilities as representation of au- possible to use words in a task where distinct languages may
dio files. The probabilities were obtained by a GMM es- appear and no other source than audio files is provided.
timated by means of the EM algorithm with all the audio
documents in the corpus. Different number of components 7. ACKNOWLEDGMENTS
in the GMM were tried. We also tried the cosine distance
Work funded by the Spanish Government and the E.U. un-
with the Mel-filterbank parametrization. However, we fi-
der contract TIN2011-28169-C05 and FPU Grant AP2010-
nally used cosine distance with the MFCC, since it provided
4193.
the best results for the development set.
For this MediaEval 2014 Query by Example Search on
Speech Evaluation, we submitted one run for both systems 8. REFERENCES
described above. The results we obtained are shown in Ta- [1] X. Anguera and M. Ferrarons. Memory efficient
bles 1 and 2. The measure to be optimized for this Evalu- subsequence DTW for Query-by-Example spoken term
ation was the cross entropy score (Cnxe). However, other detection. In 2013 IEEE International Conference on
measures such as the Mean and the Actual Term Weighted Multimedia and Expo. IEEE, 2013.
Values (MTWV and ATWV, respectively) were considered [2] X. Anguera, L. J. Rodriguez-Fuentes, I. Szöke,
as secondary metrics, as they are very widely used in this A. Buzo, and F. Metze. Query by Example Search on
kind of tasks. Speech at Mediaeval 2014. In MediaEval 2014
Results shown in both tables reveal a bad performance of Workshop, 16-17 October 2013.
our systems (a high value of Cnxe). Nevertheless, given the [3] E. H. C. Choi. On compensating the mel-frequency
difference in the sources of the audio documents and audio cepstral coefficients for noisy speech recognition. In
queries, we expected a higher accuracy for our system that Proceedings of the 29th Australasian Computer Science
uses the Choi parametrization. Conference - Volume 48, ACSC ’06, pages 49–54,
We run our own multi-thread implementation of S-DTW Darlinghurst, Australia, Australia, 2006. Australian
algorithm, using an standard PC with an i7 core and 16 GB Computer Society, Inc.
of RAM using 8 threads on a Linux operating system. At the [4] J. C. Russ and R. P. Woods. The image processing
parametrization stage, we achieved an indexing speed factor handbook. Journal of Computer Assisted Tomography,
of 1.26·10−2 , and our memory peak was around 0.25 GB. At 19(6):979–981, 1995.
the search stage, our searching speed factor was 2.34 · 10−3