=Paper= {{Paper |id=None |storemode=property |title=ELiRF at MediaEval 2013: Spoken Web Search Task |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_80.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/GomezHCS13 }} ==ELiRF at MediaEval 2013: Spoken Web Search Task== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_80.pdf

ELiRF at MediaEval 2013: Spoken Web Search Task

Jon A. Gómez, Lluís-F. Hurtado, Marcos Calvo, Emilio Sanchis
Departament de Sistemes Informàtics i Computació
Universitat Politècnica de València
Camí de Vera s/n, 46020, València, Spain
{jon, lhurtado, mcalvo, esanchis}@dsic.upv.es

ABSTRACT tion that computes some distance or dissimilarity between
In this paper, we present the systems that the Natural Lan- two objects.
guage Engineering and Pattern Recognition group (ELiRF) In this task the sequences of objects to be aligned are
has submitted to the MediaEval 2013 Spoken Web Search the sequences of feature vectors obtained from the audio
task. All of them are based on a Subsequence Dynamic Time files corresponding to the documents and the queries. This
Warping algorithm and are zero-resources systems. approach allows us to find the best alignment of the query
in each document taking as the starting point every frame
of the document. For this work we have used the cosine
1. INTRODUCTION distance in all our systems, since it provided the best results
In this paper, we present the systems that we have sumit- for the development set.
ted to the MediaEval 2013 Spoken Web Search task [2]. This Each of the computed alignments should be considered
task can be placed in the framework of Query-by-Example a candidate detection. Hence, this strategy provides a too
Spoken Term Detection (QbE-STD) tasks, where a set of large number of candidates. This way, it is necessary to find
documents and queries are provided, and the goal of the a criterion to find the set of definitive detections among the
task is to find all the occurrences of each query within each elements of this set.
document in the collection. In this particular case, a variety Another common step for all our systems is that, as part
of languages and acoustic conditions are represented, but no of the preprocessing, we deleted the leading and trailing si-
information about them is provided to the participants. lences of the queries by using a Voice Activity Detection
All the systems we have submitted to this MediaEval strategy based on a Smith trigger. This led our systems to
2013 Evaluation are based on a Subsequence Dynamic Time a better performance.
Warping (S-DTW) algorithm [1], but using different dis- Thus, our systems differ basically on three different as-
tances, sets of possible movements, and feature vectors. Also, pects: how the feature vectors are obtained, how to deter-
all our systems are zero-resources systems, that is, they do mine which of the candidate detections are considered as
not use any external information, but just the one provided definitive, and which are the allowed movements in the Dy-
by the task. namic Programming algorithm.

2. DESCRIPTION OF THE SYSTEMS 2.1 System 1
For this task, we have submitted four different systems, In this system, the acoustic signal is parametrized using
all of them based on the S-DTW algorithm. S-DTW is a the energy, the first twelve cepstral coeficients and their first
Dynamic Programming (DP) algorithm which aim is to find and second derivatives, using a sampling period of 10 ms.
multiple local alignments of two input sequences of objects Thus, we represent each frame as a 39-dimensional vector.
using a set of allowed movements, but allowing one of the Then, we perform the S-DTW step, using a particular set of
sequences to start at any position of the other. Equation 1 movements: {(1,2), (1,1), (2,1)}. Also, in each step of the
shows the generic formulation of the S-DTW algorithm. S-DTW algorithm, we have kept and maximized the accu-
 mulated distance normalized by the number of operations

 +∞ i<0 carried out until that point. The set of candidate detections
are all the hypotheses that arrived to any cell corresponding

+∞

j<0
M (i, j) = 0 to the last frame of the query in the DP matrix. Further-
j=0

 more, the set of movements used guarantees the size of any
 min M (i − x, j − y) + D(A(i), B(j)) j ≥ 1


∀(x,y)∈S detection will be between 0.5 and 2 times the size of the
(1) query. These candidates are filtered using Algorithm 1. The
where M is the DP matrix; S is the set of allowed move- idea of this algorithm is to find all the local minima that do
ments, represented as pairs (x, y) of horizontal and vertical not overlap any other local minimum with a better score,
increments; A(i), B(j) are the objects representing the po- and then fix a threshold according to a linear combination
sitions i and j of their respective sequences; and D is a func- of the average and standard deviation of the scores of the
“cleaned” set of local minima (the parameter λ of this lin-
ear combination is empirically adjusted). Also, a maximum
Copyright is held by the author/owner(s). number of filtered detections for each query d is allowed. In
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain this system, we have adjusted the parameters in order to
obtain just a few definitive detections per query. 2.4 System 4
This system is similar to System 3, but the way of obtain-
Algorithm 1 Algorithm to filter a list of candidate detec- ing the feature vectors varies. The features are here obtained
tions by using a Dissimilarity Space. 300 frames are selected from
Require: A list of candidate detections CD, the development set applying the Katsavounidis criterion
a maximum number of filtered detections d, with the cosine distance as metric [3]. Then each frame is
a coefficient λ moved into the dissimilarity space, where each component
Ensure: A list of filtered detections F D of the new feature vectors is computed as the distance from
1: SCD = sort the hypothesis in CD by their score the sample to each one of the 300 taken as references. Thus,
2: F D2 = empty list in this system the feature vectors have 300 dimensions. All
3: while SCD is not empty do the frames from both the documents and the queries are
4: h = first element of SCD converted to this Dissimilarity Space, and the S-DTW is
5: Move h to F D2 performed using these vectors.
6: Delete from SCD all the detections h0 such that
timespan(h0 )∩timespan(h) 6= ∅ 3. EXPERIMENTS AND RESULTS
7: end while For this MediaEval 2013 Spoken Web Search Evaluation,
8: t = avg + λ · sd, where avg and sd represent the average we submitted one run for each of the four systems described
and the standard deviation of the elements in FD2 above. The results we obtained are shown in Tables 1 and 2,
9: F D = first d elements of F D2 with a score ≥ t where P stands for Precision and R means Recall. Table 2
10: return F D also shows the Real Time factor (RT) obtained for the test
set. Its value for the development set is very similar.

2.2 System 2
Table 1: Results obtained for the development set.
This system is very similar to System 1, but the thresh-
olds were adjusted in a less restrictive way. The number of System MTWV ATWV P(%) R(%) Cnxe
hypotheses provided by this system is much larger than for Sys. 1 0.1699 0.1697 3.47 15.69 2.45
System 1. Sys. 2 0.1296 0.1291 2.21 16.71 3.91
Sys. 3 0.1480 0.1478 3.18 14.37 1.03
2.3 System 3 Sys. 4 0.1463 0.1461 2.55 15.76 1.00
This system uses the same parametrization as Systems 1
and 2. However, the allowed movements for the S-DTW are
{(0,1), (1,0), (1,1)}. Also the algorithm to filter the candi-
Table 2: Results obtained for the test set.
date detections is a bit different (see Algorithm 2). In this Sys. MTWV ATWV P(%) R(%) Cnxe RT
algorithm the condition for local minima not to be pruned
is that: (i) they have a value larger than a threshold and (ii) S. 1 0.1593 0.1591 3.29 14.89 2.53 3·10−3
there is not any other detection with a better score within S. 2 0.1016 0.1016 1.99 12.44 4.83 3·10−3
a window of 2 seconds. Finally, at most n occurrences per S. 3 0.1481 0.1475 3.03 13.66 1.03 5·10−4
query and k detections per document are allowed. S. 4 0.1462 0.1457 2.47 15.08 1.00 2·10−3

Algorithm 2 Another way of filtering a list of candidate All the software of the systems presented here was com-
detections pletely developed in our research group. Also, all these sys-
Require: A list of candidate detections CD, tems were run on a standard PC with an i7 processor and
a maximum number of occurrences per query n, 32 GB of RAM, using 8 threads. The memory peaks for
a maximum number of detections per document k systems 1 and 2 were around 12 GB, and for systems 3 and
Ensure: A list of filtered detections F D 4 were around 1 GB.
1: SCD = empty list
2: for all Query q do 4. ACKNOWLEDGEMENTS
3: for all Document d do Work funded by the Spanish Government and the E.U. un-
4: m = minimum score of a detection of q within d der contract TIN2011-28169-C05 and FPU Grant AP2010-
5: M = maximum score of a detection of q within d 4193.
6: t = m + 0.1(M − m)
7: Add to SCD all the hypotheses from CD with a 5. REFERENCES
score larger than t and that do not overlap a better [1] X. Anguera and M. Ferrarons. Memory efficient
detection within a window of 2 seconds. subsequence DTW for Query-by-Example spoken term
8: end for detection. In 2013 IEEE International Conference on
9: end for Multimedia and Expo.
10: F DP = For each query, keep the at most n best occur- [2] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
rences in SCD Rodriguez-Fuentes. The Spoken Web Search Task. In
11: F D = For each document, keep the at most k best de- MediaEval 2013 Workshop, 18-19 October 2013.
tections in F DP [3] I. Katsavounidis, C.-C. Jay Kuo, and Z. Zhang. A new
12: return F D initialization technique for generalized Lloyd iteration.
Signal Processing Letters, IEEE, 1(10):144–146, 1994.