=Paper= {{Paper |id=None |storemode=property |title=ELiRF at MediaEval 2013: Spoken Web Search Task |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_80.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/GomezHCS13 }} ==ELiRF at MediaEval 2013: Spoken Web Search Task== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_80.pdf
        ELiRF at MediaEval 2013: Spoken Web Search Task

                       Jon A. Gómez, Lluís-F. Hurtado, Marcos Calvo, Emilio Sanchis
                                    Departament de Sistemes Informàtics i Computació
                                           Universitat Politècnica de València
                                        Camí de Vera s/n, 46020, València, Spain
                                  {jon, lhurtado, mcalvo, esanchis}@dsic.upv.es


ABSTRACT                                                          tion that computes some distance or dissimilarity between
In this paper, we present the systems that the Natural Lan-       two objects.
guage Engineering and Pattern Recognition group (ELiRF)              In this task the sequences of objects to be aligned are
has submitted to the MediaEval 2013 Spoken Web Search             the sequences of feature vectors obtained from the audio
task. All of them are based on a Subsequence Dynamic Time         files corresponding to the documents and the queries. This
Warping algorithm and are zero-resources systems.                 approach allows us to find the best alignment of the query
                                                                  in each document taking as the starting point every frame
                                                                  of the document. For this work we have used the cosine
1.   INTRODUCTION                                                 distance in all our systems, since it provided the best results
   In this paper, we present the systems that we have sumit-      for the development set.
ted to the MediaEval 2013 Spoken Web Search task [2]. This           Each of the computed alignments should be considered
task can be placed in the framework of Query-by-Example           a candidate detection. Hence, this strategy provides a too
Spoken Term Detection (QbE-STD) tasks, where a set of             large number of candidates. This way, it is necessary to find
documents and queries are provided, and the goal of the           a criterion to find the set of definitive detections among the
task is to find all the occurrences of each query within each     elements of this set.
document in the collection. In this particular case, a variety       Another common step for all our systems is that, as part
of languages and acoustic conditions are represented, but no      of the preprocessing, we deleted the leading and trailing si-
information about them is provided to the participants.           lences of the queries by using a Voice Activity Detection
   All the systems we have submitted to this MediaEval            strategy based on a Smith trigger. This led our systems to
2013 Evaluation are based on a Subsequence Dynamic Time           a better performance.
Warping (S-DTW) algorithm [1], but using different dis-              Thus, our systems differ basically on three different as-
tances, sets of possible movements, and feature vectors. Also,    pects: how the feature vectors are obtained, how to deter-
all our systems are zero-resources systems, that is, they do      mine which of the candidate detections are considered as
not use any external information, but just the one provided       definitive, and which are the allowed movements in the Dy-
by the task.                                                      namic Programming algorithm.

2.   DESCRIPTION OF THE SYSTEMS                                   2.1    System 1
   For this task, we have submitted four different systems,          In this system, the acoustic signal is parametrized using
all of them based on the S-DTW algorithm. S-DTW is a              the energy, the first twelve cepstral coeficients and their first
Dynamic Programming (DP) algorithm which aim is to find           and second derivatives, using a sampling period of 10 ms.
multiple local alignments of two input sequences of objects       Thus, we represent each frame as a 39-dimensional vector.
using a set of allowed movements, but allowing one of the         Then, we perform the S-DTW step, using a particular set of
sequences to start at any position of the other. Equation 1       movements: {(1,2), (1,1), (2,1)}. Also, in each step of the
shows the generic formulation of the S-DTW algorithm.             S-DTW algorithm, we have kept and maximized the accu-
                                                                 mulated distance normalized by the number of operations
           
            +∞                                       i<0         carried out until that point. The set of candidate detections
                                                                  are all the hypotheses that arrived to any cell corresponding
           
           +∞
           
                                                      j<0
M (i, j) = 0                                                      to the last frame of the query in the DP matrix. Further-
                                                      j=0
           
                                                                 more, the set of movements used guarantees the size of any
            min M (i − x, j − y) + D(A(i), B(j)) j ≥ 1
           
           
            ∀(x,y)∈S                                              detection will be between 0.5 and 2 times the size of the
                                                            (1)   query. These candidates are filtered using Algorithm 1. The
where M is the DP matrix; S is the set of allowed move-           idea of this algorithm is to find all the local minima that do
ments, represented as pairs (x, y) of horizontal and vertical     not overlap any other local minimum with a better score,
increments; A(i), B(j) are the objects representing the po-       and then fix a threshold according to a linear combination
sitions i and j of their respective sequences; and D is a func-   of the average and standard deviation of the scores of the
                                                                  “cleaned” set of local minima (the parameter λ of this lin-
                                                                  ear combination is empirically adjusted). Also, a maximum
Copyright is held by the author/owner(s).                         number of filtered detections for each query d is allowed. In
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain    this system, we have adjusted the parameters in order to
obtain just a few definitive detections per query.                2.4      System 4
                                                                     This system is similar to System 3, but the way of obtain-
Algorithm 1 Algorithm to filter a list of candidate detec-        ing the feature vectors varies. The features are here obtained
tions                                                             by using a Dissimilarity Space. 300 frames are selected from
Require: A list of candidate detections CD,                       the development set applying the Katsavounidis criterion
    a maximum number of filtered detections d,                    with the cosine distance as metric [3]. Then each frame is
    a coefficient λ                                               moved into the dissimilarity space, where each component
Ensure: A list of filtered detections F D                         of the new feature vectors is computed as the distance from
 1: SCD = sort the hypothesis in CD by their score                the sample to each one of the 300 taken as references. Thus,
 2: F D2 = empty list                                             in this system the feature vectors have 300 dimensions. All
 3: while SCD is not empty do                                     the frames from both the documents and the queries are
 4:   h = first element of SCD                                    converted to this Dissimilarity Space, and the S-DTW is
 5:   Move h to F D2                                              performed using these vectors.
 6:   Delete from SCD all the detections h0 such that
      timespan(h0 )∩timespan(h) 6= ∅                              3.    EXPERIMENTS AND RESULTS
 7: end while                                                       For this MediaEval 2013 Spoken Web Search Evaluation,
 8: t = avg + λ · sd, where avg and sd represent the average      we submitted one run for each of the four systems described
    and the standard deviation of the elements in FD2             above. The results we obtained are shown in Tables 1 and 2,
 9: F D = first d elements of F D2 with a score ≥ t               where P stands for Precision and R means Recall. Table 2
10: return F D                                                    also shows the Real Time factor (RT) obtained for the test
                                                                  set. Its value for the development set is very similar.

2.2    System 2
                                                                  Table 1: Results obtained for the development set.
  This system is very similar to System 1, but the thresh-
olds were adjusted in a less restrictive way. The number of          System MTWV ATWV P(%) R(%) Cnxe
hypotheses provided by this system is much larger than for           Sys. 1   0.1699   0.1697   3.47 15.69 2.45
System 1.                                                            Sys. 2   0.1296   0.1291   2.21 16.71 3.91
                                                                     Sys. 3   0.1480   0.1478   3.18 14.37 1.03
2.3    System 3                                                      Sys. 4   0.1463   0.1461   2.55 15.76 1.00
   This system uses the same parametrization as Systems 1
and 2. However, the allowed movements for the S-DTW are
{(0,1), (1,0), (1,1)}. Also the algorithm to filter the candi-
                                                                          Table 2: Results obtained for the test set.
date detections is a bit different (see Algorithm 2). In this      Sys.     MTWV ATWV P(%) R(%) Cnxe                 RT
algorithm the condition for local minima not to be pruned
is that: (i) they have a value larger than a threshold and (ii)    S. 1     0.1593   0.1591   3.29 14.89 2.53 3·10−3
there is not any other detection with a better score within        S. 2     0.1016   0.1016   1.99 12.44 4.83 3·10−3
a window of 2 seconds. Finally, at most n occurrences per          S. 3     0.1481   0.1475   3.03 13.66 1.03 5·10−4
query and k detections per document are allowed.                   S. 4     0.1462   0.1457   2.47 15.08 1.00 2·10−3

Algorithm 2 Another way of filtering a list of candidate            All the software of the systems presented here was com-
detections                                                        pletely developed in our research group. Also, all these sys-
Require: A list of candidate detections CD,                       tems were run on a standard PC with an i7 processor and
    a maximum number of occurrences per query n,                  32 GB of RAM, using 8 threads. The memory peaks for
    a maximum number of detections per document k                 systems 1 and 2 were around 12 GB, and for systems 3 and
Ensure: A list of filtered detections F D                         4 were around 1 GB.
 1: SCD = empty list
 2: for all Query q do                                            4.    ACKNOWLEDGEMENTS
 3:   for all Document d do                                         Work funded by the Spanish Government and the E.U. un-
 4:      m = minimum score of a detection of q within d           der contract TIN2011-28169-C05 and FPU Grant AP2010-
 5:      M = maximum score of a detection of q within d           4193.
 6:      t = m + 0.1(M − m)
 7:      Add to SCD all the hypotheses from CD with a             5.    REFERENCES
         score larger than t and that do not overlap a better     [1] X. Anguera and M. Ferrarons. Memory efficient
         detection within a window of 2 seconds.                      subsequence DTW for Query-by-Example spoken term
 8:   end for                                                         detection. In 2013 IEEE International Conference on
 9: end for                                                           Multimedia and Expo.
10: F DP = For each query, keep the at most n best occur-         [2] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J.
    rences in SCD                                                     Rodriguez-Fuentes. The Spoken Web Search Task. In
11: F D = For each document, keep the at most k best de-              MediaEval 2013 Workshop, 18-19 October 2013.
    tections in F DP                                              [3] I. Katsavounidis, C.-C. Jay Kuo, and Z. Zhang. A new
12: return F D                                                        initialization technique for generalized Lloyd iteration.
                                                                      Signal Processing Letters, IEEE, 1(10):144–146, 1994.