<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Petra Galušcˇ áková, Pavel Pecina</string-name>
          <email>pecina@ufal.mff.cuni.cz</email>
          <email>{galuscakova,pecina}@ufal.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Kruliš, Jakub Lokocˇ</string-name>
          <email>lokoc@ksi.mff.cuni.cz</email>
          <email>{krulis,lokoc}@ksi.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University in Prague, Faculty of Mathematics and Physics, Department of Software Engineering</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>In this report, we present our experiments performed for the Hyperlinking part of the Search and Hyperlinking Task in MediaEval Benchmark 2014. Our system successfully combines features from multiple modalities (textual, visual, and prosodic) and confirms the positive effect of our former method for segmentation based on Decision Trees.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The main aim of the Hyperlinking sub-task is to find
segments similar to a given (query) segment in the collection of
audio-visual recordings. Created hyperlinks enable users to
browse the collection and thus improve exploratory search
ability and add entertainment value to the collection [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The data consists of 1335 hours of BBC Broadcast
recordings available for training and 2686 hours available for
testing. In our experiments, we exploit subtitles, automatic
speech recognition transcripts by LIMSI [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], LIUM [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and
NST-Sheffield [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], visual features (shots and keyframes) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
and prosodic features, all available for the task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>SEARCH SYSTEM</title>
      <p>
        Our search system for the Hyperlinking sub-task is
identical to the system used in the Search sub-task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We apply
the same retrieval model with the same settings and
segmentation methods – the fixed-length segmentation and the
segmentation employing Decision Trees (DT). The length
of the segment used in the Hyperlinking was tuned on the
training data and set to 50 seconds. Similarly to the Search
sub-task, we also exploit metadata by appending metadata
of the recordings to the text (subtitles/transcripts) of each
its segment and apply post-filtering of retrieved segments
which partially overlap with another higher ranked segment.
In addition, we also remove all retrieved segments which
partially overlap with the query segment.
      </p>
    </sec>
    <sec id="sec-3">
      <title>HYPERLINKING</title>
      <p>In the Hyperlinking sub-task, we first transformed the
query segment into a textual query by including all the
words of the subtitles lying within the segment boundary.
Then, we extended the segment boundary by including the
context surrounding the query segment. The optimal length
of the surrounding context was tuned on the training data.
We used a 200-seconds-long passage before and after each
segment.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Visual Similarity</title>
      <p>
        The visual modality was employed in the following way.
First, we calculated distance between each keyframe in the
collection and each query segment keyframe using the
Signature Quadratic Form Distance [
        <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
        ] and Feature
Signatures [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (the parameter of the method was tuned on the
training data). Then, we calculated the V isualSimilarity
between each query/segment pair as the maximal similarity
(1−distance) between keyframes in the query and keyframes
in the segment. The calculated V isualSimilarity was used
to modify the final score of the segment in the retrieval for a
particular query segment as follows (the W eight parameter
was tuned on the training data and Score(segment/query)
is the output of the retrieval on the subtitles/transcripts):
F inalScore(segment/query) = Score(segment/query)
+ W eight ∗ V isualSimilarity(segment/query).
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Prosodic Similarity</title>
      <p>The eight prosodic features provided in the data (energy,
loudness, voice probability, pitch, pitch direction, direction
score, voice quality, and harmonics-to-noise ratio) were used
to construct 8-dimensional prosodic vectors each 10 ms of
the recordings. We took overlapping sequences of 10 vectors
appearing up to 1 second from the beginning of the query
segment and found the most similar sequence of the vectors
in each segment.</p>
      <p>Similarity between the vector sequences was calculated as
the sum of differences between the corresponding vectors of
the sequence. These differences were calculated as the sum
of the absolute values of the differences between the
corresponding items of the prosodic vectors. To ensure that all
prosodic features have equal weights, the difference of each
item of the prosodic vector was normalized to have
component values between 0 and 1. Due to the computational
costs, we only took into account the vector sequences lying
at most 1 second far from the beginning of the segment. The
final score of each segment was calculated in the same way
as the final score for the visual similarity. The W eight for
the audio similarity was tuned on the training set.
4.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS</title>
      <p>The results of the Hyperlinking sub-task are displayed in
Table 1. We report the following evaluation measures: Mean</p>
      <p>
        Average Precision (MAP), Precision at 5 (P5), Precision at
10 (P10), Precision at 20 (P20), Binned Relevance
(MAPbin), and Tolerance to Irrelevance (MAP-tol) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The highest scores of MAP, MAP-bin and the
precisionbased measures are not surprisingly reached in the cases
when overlapping segments are preserved in the results [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Unlike in the Search sub-task, the segmentation employing
the Decision Trees outperforms the fixed-length
segmentation for most of the measures. There is a constant
improvement in the case that the visual weights were used
and small, but promising improvement in MAP, P20, and
MAP-tol measures in the case when prosodic features were
used. The concatenation with the context and metadata is
also proved to be beneficial; the improvement is on all
transcripts and the MAP score raised more than 5 times on the
LIUM transcripts when metadata and context were used.
      </p>
    </sec>
    <sec id="sec-7">
      <title>5. ACKNOWLEDGMENTS</title>
      <p>This research is supported by the Czech Science
Foundation, grant number P103/12/G084, Charles University
Grant Agency GA UK, grant number 920913, and by SVV
project number 260 104.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Adapting Binary Information Retrieval Evaluation Metrics for Segment-based Retrieval Tasks</article-title>
          . CoRR, abs/1312.
          <year>1913</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J. F.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Linking Inside a Video Collection: What and How to Measure?</article-title>
          <source>In Proc. of WWW</source>
          , pages
          <fpage>457</fpage>
          -
          <lpage>460</lpage>
          , Rio de Janeiro, Brazil,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Beecks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Uysal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Seidl</surname>
          </string-name>
          .
          <article-title>Signature Quadratic Form Distance</article-title>
          .
          <source>In Proc. of CIVR</source>
          , pages
          <fpage>438</fpage>
          -
          <lpage>445</lpage>
          , Xi'an, China,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Racca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>The Search and Hyperlinking Task at MediaEval 2014</article-title>
          .
          <source>In Proc. of MediaEval</source>
          , Barcelona, Spain,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gross</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <article-title>Recent Developments in openSMILE, the Munich Open-source Multimedia Feature Extractor</article-title>
          .
          <source>In Proc. of ACMMM</source>
          , pages
          <fpage>835</fpage>
          -
          <lpage>838</lpage>
          , Barcelona, Spain,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Pecina</surname>
          </string-name>
          . CUNI at
          <article-title>MediaEval 2014 Search and Hyperlinking Task: Search Task Experiments</article-title>
          .
          <source>In Proc. of MediaEval</source>
          , Barcelona, Spain,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kruliš</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lokoč</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Skopal</surname>
          </string-name>
          .
          <article-title>Efficient Extraction of Feature Signatures Using Multi-GPU Architecture</article-title>
          .
          <source>In MMM (2)</source>
          , volume
          <volume>7733</volume>
          <source>of LNCS</source>
          , pages
          <fpage>446</fpage>
          -
          <lpage>456</lpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kruliš</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Skopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lokoč</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Beecks</surname>
          </string-name>
          .
          <article-title>Combining CPU and GPU Architectures for Fast Similarity Search</article-title>
          .
          <source>Distributed and Parallel Databases</source>
          ,
          <volume>30</volume>
          (
          <issue>3-4</issue>
          ):
          <fpage>179</fpage>
          -
          <lpage>207</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          .
          <article-title>Speech Processing for Audio Indexing</article-title>
          .
          <source>In Proc. of GoTAL</source>
          , pages
          <fpage>4</fpage>
          -
          <lpage>15</lpage>
          , Gothenburg, Sweden,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lanchantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-J.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-J.-F. Gales</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Hain</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Quinnell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Renals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Saz</surname>
            , M.-
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Seigel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Swietojanski</surname>
            , and
            <given-names>P.-C.</given-names>
          </string-name>
          <string-name>
            <surname>Woodland</surname>
          </string-name>
          .
          <article-title>Automatic Transcription of Multi-genre Media Archives</article-title>
          .
          <source>In Proc. of SLAM Workshop</source>
          , pages
          <fpage>26</fpage>
          -
          <lpage>31</lpage>
          , Marseille, France,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rousseau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Deléglise</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Estève</surname>
          </string-name>
          .
          <article-title>Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling And More TED Talks</article-title>
          .
          <source>In Proc. of LREC</source>
          , pages
          <fpage>3935</fpage>
          -
          <lpage>3939</lpage>
          , Reykjavik, Iceland,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>