<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ELiRF at MediaEval 2013: Spoken Web Search Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jon A. Gómez</string-name>
          <email>jon@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lluís-F. Hurtado</string-name>
          <email>lhurtado@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcos Calvo</string-name>
          <email>mcalvo@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emilio Sanchis</string-name>
          <email>esanchis@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departament de Sistemes Informàtics i Computació Universitat Politècnica de València Camí de Vera</institution>
          <addr-line>s/n, 46020, València</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>In this paper, we present the systems that the Natural Language Engineering and Pattern Recognition group (ELiRF) has submitted to the MediaEval 2013 Spoken Web Search task. All of them are based on a Subsequence Dynamic Time Warping algorithm and are zero-resources systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>In this paper, we present the systems that we have
sumitted to the MediaEval 2013 Spoken Web Search task [2]. This
task can be placed in the framework of Query-by-Example
Spoken Term Detection (QbE-STD) tasks, where a set of
documents and queries are provided, and the goal of the
task is to nd all the occurrences of each query within each
document in the collection. In this particular case, a variety
of languages and acoustic conditions are represented, but no
information about them is provided to the participants.</p>
      <p>All the systems we have submitted to this MediaEval
2013 Evaluation are based on a Subsequence Dynamic Time
Warping (S-DTW) algorithm [1], but using di erent
distances, sets of possible movements, and feature vectors. Also,
all our systems are zero-resources systems, that is, they do
not use any external information, but just the one provided
by the task.</p>
    </sec>
    <sec id="sec-2">
      <title>DESCRIPTION OF THE SYSTEMS</title>
      <p>For this task, we have submitted four di erent systems,
all of them based on the S-DTW algorithm. S-DTW is a
Dynamic Programming (DP) algorithm which aim is to nd
multiple local alignments of two input sequences of objects
using a set of allowed movements, but allowing one of the
sequences to start at any position of the other. Equation 1
shows the generic formulation of the S-DTW algorithm.
M (i; j) =
8+1
&gt;
&gt;
&gt;&gt;&lt;+1</p>
      <p>0
&gt;
&gt;&gt;&gt; min
:8(x;y)2S</p>
      <p>M (i
x; j
y) + D(A(i); B(j)) j</p>
      <p>
        1
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where M is the DP matrix; S is the set of allowed
movements, represented as pairs (x; y) of horizontal and vertical
increments; A(i), B(j) are the objects representing the
positions i and j of their respective sequences; and D is a
funci &lt; 0
j &lt; 0
j = 0
tion that computes some distance or dissimilarity between
two objects.
      </p>
      <p>In this task the sequences of objects to be aligned are
the sequences of feature vectors obtained from the audio
les corresponding to the documents and the queries. This
approach allows us to nd the best alignment of the query
in each document taking as the starting point every frame
of the document. For this work we have used the cosine
distance in all our systems, since it provided the best results
for the development set.</p>
      <p>Each of the computed alignments should be considered
a candidate detection. Hence, this strategy provides a too
large number of candidates. This way, it is necessary to nd
a criterion to nd the set of de nitive detections among the
elements of this set.</p>
      <p>Another common step for all our systems is that, as part
of the preprocessing, we deleted the leading and trailing
silences of the queries by using a Voice Activity Detection
strategy based on a Smith trigger. This led our systems to
a better performance.</p>
      <p>Thus, our systems di er basically on three di erent
aspects: how the feature vectors are obtained, how to
determine which of the candidate detections are considered as
de nitive, and which are the allowed movements in the
Dynamic Programming algorithm.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>System 1</title>
      <p>
        In this system, the acoustic signal is parametrized using
the energy, the rst twelve cepstral coe cients and their rst
and second derivatives, using a sampling period of 10 ms.
Thus, we represent each frame as a 39-dimensional vector.
Then, we perform the S-DTW step, using a particular set of
movements: f(
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ), (
        <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
        ), (
        <xref ref-type="bibr" rid="ref1 ref2">2,1</xref>
        )g. Also, in each step of the
S-DTW algorithm, we have kept and maximized the
accumulated distance normalized by the number of operations
carried out until that point. The set of candidate detections
are all the hypotheses that arrived to any cell corresponding
to the last frame of the query in the DP matrix.
Furthermore, the set of movements used guarantees the size of any
detection will be between 0.5 and 2 times the size of the
query. These candidates are ltered using Algorithm 1. The
idea of this algorithm is to nd all the local minima that do
not overlap any other local minimum with a better score,
and then x a threshold according to a linear combination
of the average and standard deviation of the scores of the
\cleaned" set of local minima (the parameter of this
linear combination is empirically adjusted). Also, a maximum
number of ltered detections for each query d is allowed. In
this system, we have adjusted the parameters in order to
obtain just a few de nitive detections per query.
Algorithm 1 Algorithm to lter a list of candidate
detections
Require: A list of candidate detections CD,
a maximum number of ltered detections d,
a coe cient
Ensure: A list of ltered detections F D
1: SCD = sort the hypothesis in CD by their score
2: F D2 = empty list
3: while SCD is not empty do
4: h = rst element of SCD
5: Move h to F D2
6: Delete from SCD all the detections h0 such that
timespan(h0)\timespan(h) 6= ;
7: end while
8: t = avg + sd, where avg and sd represent the average
and the standard deviation of the elements in FD2
9: F D = rst d elements of F D2 with a score t
10: return F D
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>System 2</title>
      <p>This system is very similar to System 1, but the
thresholds were adjusted in a less restrictive way. The number of
hypotheses provided by this system is much larger than for
System 1.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>System 3</title>
      <p>
        This system uses the same parametrization as Systems 1
and 2. However, the allowed movements for the S-DTW are
f(
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ), (
        <xref ref-type="bibr" rid="ref1">1,0</xref>
        ), (
        <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
        )g. Also the algorithm to lter the
candidate detections is a bit di erent (see Algorithm 2). In this
algorithm the condition for local minima not to be pruned
is that: (i) they have a value larger than a threshold and (ii)
there is not any other detection with a better score within
a window of 2 seconds. Finally, at most n occurrences per
query and k detections per document are allowed.
Algorithm 2 Another way of ltering a list of candidate
detections
Require: A list of candidate detections CD,
a maximum number of occurrences per query n,
a maximum number of detections per document k
Ensure: A list of ltered detections F D
1: SCD = empty list
2: for all Query q do
3: for all Document d do
4: m = minimum score of a detection of q within d
5: M = maximum score of a detection of q within d
6: t = m + 0:1(M m)
7: Add to SCD all the hypotheses from CD with a
score larger than t and that do not overlap a better
detection within a window of 2 seconds.
8: end for
9: end for
10: F DP = For each query, keep the at most n best
occurrences in SCD
11: F D = For each document, keep the at most k best
detections in F DP
12: return F D
      </p>
    </sec>
    <sec id="sec-6">
      <title>System 4</title>
      <p>This system is similar to System 3, but the way of
obtaining the feature vectors varies. The features are here obtained
by using a Dissimilarity Space. 300 frames are selected from
the development set applying the Katsavounidis criterion
with the cosine distance as metric [3]. Then each frame is
moved into the dissimilarity space, where each component
of the new feature vectors is computed as the distance from
the sample to each one of the 300 taken as references. Thus,
in this system the feature vectors have 300 dimensions. All
the frames from both the documents and the queries are
converted to this Dissimilarity Space, and the S-DTW is
performed using these vectors.
3.</p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>For this MediaEval 2013 Spoken Web Search Evaluation,
we submitted one run for each of the four systems described
above. The results we obtained are shown in Tables 1 and 2,
where P stands for Precision and R means Recall. Table 2
also shows the Real Time factor (RT) obtained for the test
set. Its value for the development set is very similar.</p>
      <p>All the software of the systems presented here was
completely developed in our research group. Also, all these
systems were run on a standard PC with an i7 processor and
32 GB of RAM, using 8 threads. The memory peaks for
systems 1 and 2 were around 12 GB, and for systems 3 and
4 were around 1 GB.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGEMENTS</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferrarons</surname>
          </string-name>
          .
          <article-title>Memory e cient subsequence DTW for Query-by-Example spoken term detection</article-title>
          .
          <source>In 2013 IEEE International Conference on Multimedia and Expo.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Szoke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          .
          <article-title>The Spoken Web Search Task</article-title>
          . In MediaEval 2013 Workshop,
          <fpage>18</fpage>
          -
          <lpage>19</lpage>
          October
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Katsavounidis</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Jay Kuo</surname>
            , and
            <given-names>Z. Zhang.</given-names>
          </string-name>
          <article-title>A new initialization technique for generalized Lloyd iteration</article-title>
          .
          <source>Signal Processing Letters</source>
          , IEEE,
          <volume>1</volume>
          (
          <issue>10</issue>
          ):
          <volume>144</volume>
          {
          <fpage>146</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>