<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ELiRF at MediaEval 2014: Query by Example Search on Speech Task (QUESST)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcos Calvo</string-name>
          <email>mcalvo@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mayte Giménez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lluís-F. Hurtado</string-name>
          <email>lhurtado@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emilio Sanchis</string-name>
          <email>esanchis@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jon A. Gómez</string-name>
          <email>jon@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departament de Sistemes Informàtics i Computació Universitat Politècnica de València Camí de Vera</institution>
          <addr-line>s/n, 46020, València</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>In this paper, we present the systems that the Natural Language Engineering and Pattern Recognition group (ELiRF) has submitted to the MediaEval 2014 Query by Example Search on Speech task. All of them are based on a Subsequence Dynamic Time Warping algorithm and do not use any other information from outside the task (zero-resources systems).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In this paper, we present the systems that we have
submitted to the MediaEval 2014 Query by Example Search on
Speech task. The goal of the task is to identify the audio
documents which match a spoken query. This match can be
either exact (the same term both in the query and in the
document), or with variations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The two systems we have submitted are based on a
Subsequence Dynamic Time Warping (S-DTW) algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
However, the systems di er in the way the audio les are
preprocessed, which makes the feature vectors to be di
erent for each system. It is worth to note that this approach
does not use any external information, which makes our
systems zero-resources systems. In the following sections, we
will explain the di erences in how the feature vectors are
computed for each system, the search algorithm, and the
results obtained in this evaluation.
      </p>
    </sec>
    <sec id="sec-2">
      <title>OVERVIEW OF THE SYSTEMS</title>
      <p>Both of our systems used the same philosophy. First step
was preprocessing all the audio les, both spoken documents
and queries. This way we obtained a sequence of feature
vectors as a representation of each audio le. Then, we took
each possible pair (document, query ) and run a S-DTW
algorithm on them. This provided the bounds of a possible
detection of the query within the document, and a score
for this detection. Finally, a decision-making module
established a threshold based on the scores of all the possible
detections. This was necessary to provide detections with
the highest con dences.</p>
    </sec>
    <sec id="sec-3">
      <title>PARAMETRIZATION</title>
      <p>
        where Yj is the output magnitude of the j-th Mel- lterbank.
In the case of using the approach proposed by Choi [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]:
mj = j logf1 + j max(Yj
N^j ; j Yj )g
where j = 0:001 and j = 0:4 8j in our implementation, N^j
is the noise magnitude estimation of the j-th Mel- lterbank
output, and
      </p>
      <p>log(1 + NY^jj )
j = M</p>
      <p>P log(1 + NY^kk )
k=1
M is the total number of Mel- lters. values are computed
for each feature vector.</p>
      <p>
        Next step in the parametrization is to compute the
standard Discrete Cosine Transform to the Mel- lterbank. The
rst 12 MFCC are obtained. But in the case of the ltered
parametrization a transformation of energy and each MFCC
component is performed based on the Cumulative
Distribution Mapping (CDM) technique [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is based on the
use of histogram equalization originally developed for image
processing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Last step of parametrization was the
computation of rst and second time derivatives.
      </p>
      <p>It is worth to note that most of queries contain leading
and trailing silences. Therefore, we trimmed the sequence of
feature vectors representing each query by means of a voice
activity detection procedure, in order to help the search
algorithm.</p>
    </sec>
    <sec id="sec-4">
      <title>4. SEARCH ALGORITHM</title>
      <p>Finding spoken queries within a set of audios is a
complex task, hence we used a Dynamic Programming (DP)
technique in order to face this problem. In particular, we
used S-DTW, that is a DP technique for comparing two
sequences of objects. In our case, one of the sequences
corresponds to feature vectors of one of the audio documents, and
the other one belongs to the query. Therefore, the S-DTW
method nds multiple local alignments of the query within
audio documents, by allowing it to start at any position of
the audio document.</p>
      <p>Equation 1 shows the generic formulation of S-DTW:
8+1
&gt;
&gt;
&gt;&gt;&lt;+1</p>
      <p>0
&gt;
&gt;&gt;&gt; min
:8(x;y)2S
i &lt; 0
j &lt; 0
j = 0
y) + D(A(i); B(j)) j 1
M (i; j) =</p>
      <p>M (i
x; j
(1)
where M is the DP matrix; S is the set of allowed
movements, represented as pairs (x; y) of horizontal and vertical
increments; A(i), B(j) are the objects representing the
positions i-th and j-th of their respective sequences; and D
is a function that computes some distance or dissimilarity
between two objects.</p>
      <p>In our implementation the set of allowed movements S
is f(1; 2); (1; 1); (2; 1)g. This set of movements guarantees
that the size of any detection will be between 0:5 and 2 times
the size of the query.</p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>We performed several preliminary experiments in order to
nd the best con guration for our systems.</p>
      <p>We evaluated di erent distance functions and
parametrizations. One of them was the Kullback-Leibler divergence on
sequences of vectors of probabilities as representation of
audio les. The probabilities were obtained by a GMM
estimated by means of the EM algorithm with all the audio
documents in the corpus. Di erent number of components
in the GMM were tried. We also tried the cosine distance
with the Mel- lterbank parametrization. However, we
nally used cosine distance with the MFCC, since it provided
the best results for the development set.</p>
      <p>For this MediaEval 2014 Query by Example Search on
Speech Evaluation, we submitted one run for both systems
described above. The results we obtained are shown in
Tables 1 and 2. The measure to be optimized for this
Evaluation was the cross entropy score (Cnxe). However, other
measures such as the Mean and the Actual Term Weighted
Values (MTWV and ATWV, respectively) were considered
as secondary metrics, as they are very widely used in this
kind of tasks.</p>
      <p>Results shown in both tables reveal a bad performance of
our systems (a high value of Cnxe). Nevertheless, given the
di erence in the sources of the audio documents and audio
queries, we expected a higher accuracy for our system that
uses the Choi parametrization.</p>
      <p>We run our own multi-thread implementation of S-DTW
algorithm, using an standard PC with an i7 core and 16 GB
of RAM using 8 threads on a Linux operating system. At the
parametrization stage, we achieved an indexing speed factor
of 1:26 10 2, and our memory peak was around 0:25 GB. At
the search stage, our searching speed factor was 2:34 10 3</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>In this paper, we have presented the systems we have
submitted to the MediaEval 2014 Query by Example Search on
Speech Evaluation, as well as the results obtained. This
was a very challenging task in which both exact and varied
occurrences of the queries within the documents had to be
found. Despite of our preliminary attempts, our approach
has been proven as not suitable for this task. One of the
reasons is due to the nature of the S-DTW algorithm. Its
use makes not possible to nd occurrences of queries where
a reordering of words is needed. However, we would like to
point out that signi cant improvements were observed when
trimmed queries were used for the development set.</p>
      <p>As future work, we would like to improve our system in
order to use it in tasks like QUESST, where swaps in the
order of components of a query can happen. Facing this
kind of word reorderings would be possible if a higher level
of knowledge is used, e.g. sequences of phonemes instead
of using only sequences of acoustic feature vectors. It is not
possible to use words in a task where distinct languages may
appear and no other source than audio les is provided.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferrarons</surname>
          </string-name>
          .
          <article-title>Memory e cient subsequence DTW for Query-by-Example spoken term detection</article-title>
          .
          <source>In 2013 IEEE International Conference on Multimedia and Expo. IEEE</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rodriguez-Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Szoke, A</article-title>
          . Buzo, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          .
          <article-title>Query by Example Search on Speech at Mediaeval 2014</article-title>
          . In MediaEval 2014 Workshop,
          <fpage>16</fpage>
          -
          <lpage>17</lpage>
          October
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E. H. C.</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <article-title>On compensating the mel-frequency cepstral coe cients for noisy speech recognition</article-title>
          .
          <source>In Proceedings of the 29th Australasian Computer Science Conference - Volume 48, ACSC '06</source>
          , pages
          <fpage>49</fpage>
          {
          <fpage>54</fpage>
          ,
          <string-name>
            <surname>Darlinghurst</surname>
          </string-name>
          , Australia, Australia,
          <year>2006</year>
          . Australian Computer Society, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Russ</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Woods</surname>
          </string-name>
          .
          <article-title>The image processing handbook</article-title>
          .
          <source>Journal of Computer Assisted Tomography</source>
          ,
          <volume>19</volume>
          (
          <issue>6</issue>
          ):
          <volume>979</volume>
          {
          <fpage>981</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>