=Paper= {{Paper |id=Vol-1263/paper43 |storemode=property |title=ELiRF at MediaEval 2014: Query by Example Search on Speech Task (QUESST) |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_43.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/CalvoGHAG14 }} ==ELiRF at MediaEval 2014: Query by Example Search on Speech Task (QUESST)== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_43.pdf
     ELiRF at MediaEval 2014: Query by Example Search on
                    Speech Task (QUESST)

          Marcos Calvo, Mayte Giménez, Lluís-F. Hurtado, Emilio Sanchis, Jon A. Gómez
                                   Departament de Sistemes Informàtics i Computació
                                          Universitat Politècnica de València
                                       Camí de Vera s/n, 46020, València, Spain
                         {mcalvo, mgimenez, lhurtado, esanchis, jon}@dsic.upv.es

ABSTRACT                                                         (audio documents and queries) as sequences of feature vec-
In this paper, we present the systems that the Natural Lan-      tors. Each feature vector contains 39 values: the energy, the
guage Engineering and Pattern Recognition group (ELiRF)          first 12 MFCCs plus the first and second time derivatives of
has submitted to the MediaEval 2014 Query by Example             the original 13 features.
Search on Speech task. All of them are based on a Subse-            We worked with two parametrizations. One is the stan-
quence Dynamic Time Warping algorithm and do not use             dard feature extraction without applying any filter for noise
any other information from outside the task (zero-resources      reduction (labeled as non-filtered), and the other one con-
systems).                                                        sists on compensating the MFCC for improving the recog-
                                                                 nition on noisy speech (labeled as Choi) [3].
                                                                    In both cases a preemphasis filter over the signal was ap-
1.   INTRODUCTION                                                plied with H(z) = 1 − 0.95z −1 , the subsampling frequency
   In this paper, we present the systems that we have sub-       was 100Hz, i.e. one feature vector every 10 ms, and a Ham-
mitted to the MediaEval 2014 Query by Example Search on          ming window of 25 ms was applied before computing the
Speech task. The goal of the task is to identify the audio       FFT. The first difference is found when computing the Mel-
documents which match a spoken query. This match can be          filterbank. Each filterbank output is computed as follows in
either exact (the same term both in the query and in the         the case of the non-filtered parametrization:
document), or with variations [2].
   The two systems we have submitted are based on a Sub-                                mj = log(Yj )
sequence Dynamic Time Warping (S-DTW) algorithm [1].
However, the systems differ in the way the audio files are       where Yj is the output magnitude of the j-th Mel-filterbank.
preprocessed, which makes the feature vectors to be differ-      In the case of using the approach proposed by Choi [3]:
ent for each system. It is worth to note that this approach
does not use any external information, which makes our sys-
tems zero-resources systems. In the following sections, we               mj = αj log{1 + βj max(Yj − N̂j , γj Yj )}
will explain the differences in how the feature vectors are
computed for each system, the search algorithm, and the          where βj = 0.001 and γj = 0.4 ∀j in our implementation, N̂j
results obtained in this evaluation.                             is the noise magnitude estimation of the j-th Mel-filterbank
                                                                 output, and
2.   OVERVIEW OF THE SYSTEMS                                                                           Y
                                                                                            log(1 + N̂j )
                                                                                                           j
   Both of our systems used the same philosophy. First step                         αj =   M
was preprocessing all the audio files, both spoken documents                               P             Yk
                                                                                                 log(1 + N̂ )
and queries. This way we obtained a sequence of feature vec-                               k=1                 k

tors as a representation of each audio file. Then, we took       M is the total number of Mel-filters. α values are computed
each possible pair (document, query) and run a S-DTW al-         for each feature vector.
gorithm on them. This provided the bounds of a possible             Next step in the parametrization is to compute the stan-
detection of the query within the document, and a score          dard Discrete Cosine Transform to the Mel-filterbank. The
for this detection. Finally, a decision-making module estab-     first 12 MFCC are obtained. But in the case of the filtered
lished a threshold based on the scores of all the possible       parametrization a transformation of energy and each MFCC
detections. This was necessary to provide detections with        component is performed based on the Cumulative Distribu-
the highest confidences.                                         tion Mapping (CDM) technique [3], which is based on the
                                                                 use of histogram equalization originally developed for image
3.   PARAMETRIZATION                                             processing [4]. Last step of parametrization was the compu-
  We used Mel Frequency Cepstral Coefficients (MFCC)             tation of first and second time derivatives.
plus energy as the features for representing speech samples         It is worth to note that most of queries contain leading
                                                                 and trailing silences. Therefore, we trimmed the sequence of
                                                                 feature vectors representing each query by means of a voice
Copyright is held by the author/owner(s).                        activity detection procedure, in order to help the search al-
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain   gorithm.
4.   SEARCH ALGORITHM
                                                                 Table 1: Results obtained for the development set.
  Finding spoken queries within a set of audios is a com-         System         Cnxe MinCnxe MTWV ATWV
plex task, hence we used a Dynamic Programming (DP)
technique in order to face this problem. In particular, we        Choi          5.8940   0.9595     0.0692 0.0692
used S-DTW, that is a DP technique for comparing two se-          non-filtered 6.0905    0.9571     0.0767 0.0768
quences of objects. In our case, one of the sequences corre-
sponds to feature vectors of one of the audio documents, and
                                                                      Table 2: Results obtained for the test set.
the other one belongs to the query. Therefore, the S-DTW           System        Cnxe MinCnxe MTWV ATWV
method finds multiple local alignments of the query within
audio documents, by allowing it to start at any position of        Choi         5.5369    0.9667    0.0558    0.0557
the audio document.                                                non-filtered 5.9502    0.9648    0.0587    0.0587
  Equation 1 shows the generic formulation of S-DTW:
          
          
           +∞                                         i<0       and we used around 16 GB of memory. Thus, our processing
                                                                 load for both systems was 3.40 · 10−2 .
          
          +∞
          
                                                       j<0
M (i, j) = 0                                           j=0
          
          
           min M (i − x, j − y) + D(A(i), B(j)) j ≥ 1
          
                                                                6.   CONCLUSIONS
            ∀(x,y)∈S
                                                                    In this paper, we have presented the systems we have sub-
                                                           (1)
                                                                 mitted to the MediaEval 2014 Query by Example Search on
where M is the DP matrix; S is the set of allowed move-
                                                                 Speech Evaluation, as well as the results obtained. This
ments, represented as pairs (x, y) of horizontal and vertical
                                                                 was a very challenging task in which both exact and varied
increments; A(i), B(j) are the objects representing the po-
                                                                 occurrences of the queries within the documents had to be
sitions i-th and j-th of their respective sequences; and D
                                                                 found. Despite of our preliminary attempts, our approach
is a function that computes some distance or dissimilarity
                                                                 has been proven as not suitable for this task. One of the
between two objects.
                                                                 reasons is due to the nature of the S-DTW algorithm. Its
   In our implementation the set of allowed movements S
                                                                 use makes not possible to find occurrences of queries where
is {(1, 2), (1, 1), (2, 1)}. This set of movements guarantees
                                                                 a reordering of words is needed. However, we would like to
that the size of any detection will be between 0.5 and 2 times
                                                                 point out that significant improvements were observed when
the size of the query.
                                                                 trimmed queries were used for the development set.
                                                                    As future work, we would like to improve our system in
5.   EXPERIMENTS AND RESULTS                                     order to use it in tasks like QUESST, where swaps in the
   We performed several preliminary experiments in order to      order of components of a query can happen. Facing this
find the best configuration for our systems.                     kind of word reorderings would be possible if a higher level
   We evaluated different distance functions and parametriza-    of knowledge is used, e.g. sequences of phonemes instead
tions. One of them was the Kullback-Leibler divergence on        of using only sequences of acoustic feature vectors. It is not
sequences of vectors of probabilities as representation of au-   possible to use words in a task where distinct languages may
dio files. The probabilities were obtained by a GMM es-          appear and no other source than audio files is provided.
timated by means of the EM algorithm with all the audio
documents in the corpus. Different number of components          7.   ACKNOWLEDGMENTS
in the GMM were tried. We also tried the cosine distance
                                                                   Work funded by the Spanish Government and the E.U. un-
with the Mel-filterbank parametrization. However, we fi-
                                                                 der contract TIN2011-28169-C05 and FPU Grant AP2010-
nally used cosine distance with the MFCC, since it provided
                                                                 4193.
the best results for the development set.
   For this MediaEval 2014 Query by Example Search on
Speech Evaluation, we submitted one run for both systems         8.   REFERENCES
described above. The results we obtained are shown in Ta-        [1] X. Anguera and M. Ferrarons. Memory efficient
bles 1 and 2. The measure to be optimized for this Evalu-            subsequence DTW for Query-by-Example spoken term
ation was the cross entropy score (Cnxe). However, other             detection. In 2013 IEEE International Conference on
measures such as the Mean and the Actual Term Weighted               Multimedia and Expo. IEEE, 2013.
Values (MTWV and ATWV, respectively) were considered             [2] X. Anguera, L. J. Rodriguez-Fuentes, I. Szöke,
as secondary metrics, as they are very widely used in this           A. Buzo, and F. Metze. Query by Example Search on
kind of tasks.                                                       Speech at Mediaeval 2014. In MediaEval 2014
   Results shown in both tables reveal a bad performance of          Workshop, 16-17 October 2013.
our systems (a high value of Cnxe). Nevertheless, given the      [3] E. H. C. Choi. On compensating the mel-frequency
difference in the sources of the audio documents and audio           cepstral coefficients for noisy speech recognition. In
queries, we expected a higher accuracy for our system that           Proceedings of the 29th Australasian Computer Science
uses the Choi parametrization.                                       Conference - Volume 48, ACSC ’06, pages 49–54,
   We run our own multi-thread implementation of S-DTW               Darlinghurst, Australia, Australia, 2006. Australian
algorithm, using an standard PC with an i7 core and 16 GB            Computer Society, Inc.
of RAM using 8 threads on a Linux operating system. At the       [4] J. C. Russ and R. P. Woods. The image processing
parametrization stage, we achieved an indexing speed factor          handbook. Journal of Computer Assisted Tomography,
of 1.26·10−2 , and our memory peak was around 0.25 GB. At            19(6):979–981, 1995.
the search stage, our searching speed factor was 2.34 · 10−3