1. INTRODUCTION

ELiRF at MediaEval 2014: Query by Example Search on Speech Task (QUESST)

Marcos Calvo

mcalvo@dsic.upv.es 0

Mayte Giménez

Lluís-F. Hurtado

lhurtado@dsic.upv.es 0

Emilio Sanchis

esanchis@dsic.upv.es 0

Jon A. Gómez

jon@dsic.upv.es 0 0 Departament de Sistemes Informàtics i Computació Universitat Politècnica de València Camí de Vera s/n, 46020, València , Spain

2014

16 17

In this paper, we present the systems that the Natural Language Engineering and Pattern Recognition group (ELiRF) has submitted to the MediaEval 2014 Query by Example Search on Speech task. All of them are based on a Subsequence Dynamic Time Warping algorithm and do not use any other information from outside the task (zero-resources systems).

1. INTRODUCTION

In this paper, we present the systems that we have submitted to the MediaEval 2014 Query by Example Search on Speech task. The goal of the task is to identify the audio documents which match a spoken query. This match can be either exact (the same term both in the query and in the document), or with variations [ 2 ].

The two systems we have submitted are based on a Subsequence Dynamic Time Warping (S-DTW) algorithm [ 1 ]. However, the systems di er in the way the audio les are preprocessed, which makes the feature vectors to be di erent for each system. It is worth to note that this approach does not use any external information, which makes our systems zero-resources systems. In the following sections, we will explain the di erences in how the feature vectors are computed for each system, the search algorithm, and the results obtained in this evaluation.

OVERVIEW OF THE SYSTEMS

Both of our systems used the same philosophy. First step was preprocessing all the audio les, both spoken documents and queries. This way we obtained a sequence of feature vectors as a representation of each audio le. Then, we took each possible pair (document, query ) and run a S-DTW algorithm on them. This provided the bounds of a possible detection of the query within the document, and a score for this detection. Finally, a decision-making module established a threshold based on the scores of all the possible detections. This was necessary to provide detections with the highest con dences.

PARAMETRIZATION

where Yj is the output magnitude of the j-th Mel- lterbank. In the case of using the approach proposed by Choi [ 3 ]: mj = j logf1 + j max(Yj N^j ; j Yj )g where j = 0:001 and j = 0:4 8j in our implementation, N^j is the noise magnitude estimation of the j-th Mel- lterbank output, and

log(1 + NY^jj ) j = M

P log(1 + NY^kk ) k=1 M is the total number of Mel- lters. values are computed for each feature vector.

Next step in the parametrization is to compute the standard Discrete Cosine Transform to the Mel- lterbank. The rst 12 MFCC are obtained. But in the case of the ltered parametrization a transformation of energy and each MFCC component is performed based on the Cumulative Distribution Mapping (CDM) technique [ 3 ], which is based on the use of histogram equalization originally developed for image processing [ 4 ]. Last step of parametrization was the computation of rst and second time derivatives.

It is worth to note that most of queries contain leading and trailing silences. Therefore, we trimmed the sequence of feature vectors representing each query by means of a voice activity detection procedure, in order to help the search algorithm.

4. SEARCH ALGORITHM

Finding spoken queries within a set of audios is a complex task, hence we used a Dynamic Programming (DP) technique in order to face this problem. In particular, we used S-DTW, that is a DP technique for comparing two sequences of objects. In our case, one of the sequences corresponds to feature vectors of one of the audio documents, and the other one belongs to the query. Therefore, the S-DTW method nds multiple local alignments of the query within audio documents, by allowing it to start at any position of the audio document.

Equation 1 shows the generic formulation of S-DTW: 8+1 > > >><+1

0 > >>> min :8(x;y)2S i < 0 j < 0 j = 0 y) + D(A(i); B(j)) j 1 M (i; j) =

M (i x; j (1) where M is the DP matrix; S is the set of allowed movements, represented as pairs (x; y) of horizontal and vertical increments; A(i), B(j) are the objects representing the positions i-th and j-th of their respective sequences; and D is a function that computes some distance or dissimilarity between two objects.

In our implementation the set of allowed movements S is f(1; 2); (1; 1); (2; 1)g. This set of movements guarantees that the size of any detection will be between 0:5 and 2 times the size of the query.

EXPERIMENTS AND RESULTS

We performed several preliminary experiments in order to nd the best con guration for our systems.

We evaluated di erent distance functions and parametrizations. One of them was the Kullback-Leibler divergence on sequences of vectors of probabilities as representation of audio les. The probabilities were obtained by a GMM estimated by means of the EM algorithm with all the audio documents in the corpus. Di erent number of components in the GMM were tried. We also tried the cosine distance with the Mel- lterbank parametrization. However, we nally used cosine distance with the MFCC, since it provided the best results for the development set.

For this MediaEval 2014 Query by Example Search on Speech Evaluation, we submitted one run for both systems described above. The results we obtained are shown in Tables 1 and 2. The measure to be optimized for this Evaluation was the cross entropy score (Cnxe). However, other measures such as the Mean and the Actual Term Weighted Values (MTWV and ATWV, respectively) were considered as secondary metrics, as they are very widely used in this kind of tasks.

Results shown in both tables reveal a bad performance of our systems (a high value of Cnxe). Nevertheless, given the di erence in the sources of the audio documents and audio queries, we expected a higher accuracy for our system that uses the Choi parametrization.

We run our own multi-thread implementation of S-DTW algorithm, using an standard PC with an i7 core and 16 GB of RAM using 8 threads on a Linux operating system. At the parametrization stage, we achieved an indexing speed factor of 1:26 10 2, and our memory peak was around 0:25 GB. At the search stage, our searching speed factor was 2:34 10 3

CONCLUSIONS

In this paper, we have presented the systems we have submitted to the MediaEval 2014 Query by Example Search on Speech Evaluation, as well as the results obtained. This was a very challenging task in which both exact and varied occurrences of the queries within the documents had to be found. Despite of our preliminary attempts, our approach has been proven as not suitable for this task. One of the reasons is due to the nature of the S-DTW algorithm. Its use makes not possible to nd occurrences of queries where a reordering of words is needed. However, we would like to point out that signi cant improvements were observed when trimmed queries were used for the development set.

As future work, we would like to improve our system in order to use it in tasks like QUESST, where swaps in the order of components of a query can happen. Facing this kind of word reorderings would be possible if a higher level of knowledge is used, e.g. sequences of phonemes instead of using only sequences of acoustic feature vectors. It is not possible to use words in a task where distinct languages may appear and no other source than audio les is provided.

ACKNOWLEDGMENTS

[1]

Anguera and

Ferrarons . Memory e cient subsequence DTW for Query-by-Example spoken term detection . In 2013 IEEE International Conference on Multimedia and Expo. IEEE , 2013 .

[2]

Anguera ,

L. J.

Rodriguez-Fuentes , I. Szoke, A . Buzo, and

Metze . Query by Example Search on Speech at Mediaeval 2014 . In MediaEval 2014 Workshop, 16 - 17 October 2013 .

[3]

E. H. C.

Choi . On compensating the mel-frequency cepstral coe cients for noisy speech recognition . In Proceedings of the 29th Australasian Computer Science Conference - Volume 48, ACSC '06 , pages 49 { 54 , Darlinghurst , Australia, Australia, 2006 . Australian Computer Society, Inc.

[4]

J. C.

Russ and

R. P.

Woods . The image processing handbook . Journal of Computer Assisted Tomography , 19 ( 6 ): 979 { 981 , 1995 .