<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TOSCA-MP at Search and Hyperlinking of Television Content Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michał Lokaj</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Stiegler</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Austria werner.bailer@joanneum.at</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This paper describes the work done by the TOSCA-MP team for the linking subtask. We submitted three sets of runs: text-only with xed segments, text-only aligned with shot boundaries, and text and visual with xed segments. Each of these sets consists of six runs, using combinations of three di erent types of text resources and for each using only the anchor segment or anchor plus context as input. The results show signi cant improvements by taking the context of the anchor into account, and smaller improvements when additionally using visual features.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The MediaEval 2013 Search and Hyperlinking of
Television Content Task addresses the scenario of performing
textbased known item search in a video collection (search
subtask) and subsequent exploration of related video segments
(hyperlinking subtask). Such a scenario is well-aligned with
the goals TOSCA-MP project, which aims at developing
task-adaptive analysis and search tools for professional
media production. Some content needs for media production
are not always sharply de ned, in some cases a
comprehensive coverage of a topic may be needed, or media creators
aim at nding more diverse, less well-known material. All
these cases t into the pattern of performing a rst search
task, and using the result set for further interactive
exploration of the video collection.</p>
      <p>
        This paper describes the work done by the TOSCA-MP
team for the linking subtask. Details on the task and the
data set can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>LINKING SUBTASK</title>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>For the linking subtask, we combine textual/metadata
similarity and visual similarity. The textual/metadata
similarity is based on matching terms and named entities, and
provides a basic set of result segments. In some runs, visual
similarity based on local descriptors is used for reranking.
The textual/metadata based approach uses the ASR
transcript or subtitles, the metadata about the broadcast
(synopsis) and the text of the query related to the anchor as
inputs. All these textual resources are preprocessed by
removing punctuation, normalizing capitalization and
removing stop words and very short words (less than three
characters). We then select a basic set of terms T = Ta [ Tq [ Tm,
which are the words from the three cleaned text resources
(anchor, query, metadata) that are found in DBpedia1. For
the ASR transcript or subtitles, we then broaden the set
of terms and select speci c classes. As a rst step, we add
synonyms for the terms in T from WordNet2, obtaining a
set ST . We then select a set of connected entities CT for
the terms in T from FreeBase3. For the subset of terms
Tg T , which FreeBase identi es as related to a geographic
location, we also add the set of connected geographic
entities GTg from GeoNames4. Thus the set of terms used for
matching is T = T [ ST [ CT [ GTg .</p>
      <p>For matching two segments, we match the terms related
to these segments with di erent weights:
w(t) = wo; t 2 T;
w(t) = wg; t 2 GTg ;
w(t) = ws; t 2 ST [ CT ; withws &lt; wg &lt; wo:
(1)
For multiple occurrences K in a segment, the weights of
each occurrence decrease, with the total weight de ned as
w(t) = PK
b k=1(1=k)w(t). For a pair of video segments (v1; v2)
the similarity is determined as Pt2T (v1)\T (v2) w(t), with
T (vi) being the extended set of terms of segment vi.</p>
      <p>For initial text-based matching, the videos have been
segmented into segments of equal lengths of 20 seconds. In
the experiments, we cut the lists at a normalized similarity
score of 0.35, keeping at least 75 items. On these raw
results, reranking based on visual features or alignment with
shot boundaries is applied.</p>
      <p>
        The visual matching approach is based on the well-known
SIFT descriptor [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], extracted from DoG interest points from
the video. Only one eld of the input image is used in order
to avoid possible side e ects of interlaced content.
Descriptors are extracted from every fth frame (every tenth eld)
and detecting several hundred key points (the number
depends on the resolution of the content and the structure of
the content itself). We have performed complete pairwise
matching of a set of candidate link segments (result of
textual matching). Both descriptor extraction and matching
are implemented on the GPU using NVIDIA CUDA.
      </p>
      <p>In order to align candidate segments with shot
bound1dbpedia.org
2wordnet.princeton.edu
3www.freebase.com
4www.geonames.org
MAP</p>
      <p>MAP (visual)</p>
      <p>P@5 (visual)</p>
      <p>P@10 (visual)
aries, all adjacent candidate segments have been matched
to the respective shots. In order to avoid a bias for text
matching from the shots, only scores within 30 seconds have
been counted. However, the entire shot has been reported
as result. According to the guidelines, the segments have
been cut to at most 120 seconds, even if some result shots
exceed this length.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Results</title>
      <p>We generated three sets of runs: text-only with xed
segments, text-only aligned with shot boundaries, and text and
visual with xed segments. Each of these sets consists of six
runs, using combinations of each of the three di erent types
of text resources (two ASR transcripts and subtitles) and for
each using only the anchor segment or anchor plus context
as input. For the runs using visual features, a higher weight
for visual features was used (0.7 vs. 0.3 for text features),
in order to make results di erent from text-only runs. This
may not be the optimal weight combination.</p>
      <p>The best runs reached up to 0.21 mean average precision
(MAP), median average precision is 0.21 as well. The results
are quite consistent across the anchors, with the median AP
being very similar to the mean AP. The results for the top
ranks are much better, with mean precision at rank 5 up
to 0.83 and at rank 10 up to 0.7. At the top 5 ranks, the
median precision is even higher, reaching 1.00 for three of
the runs. Using the context of the video around the anchor
had a very strong impact on the results. In terms of MAP
the increase is about 0.10, i.e. MAP roughly doubles.</p>
      <p>Reranking using visual features slightly improves the
results in all cases, with most improvement at the top 5 ranks.
The use of the visual features has quite di erent e ects
on the di erent types of anchors, causing an increase for
some and a decrease for others (see Figure 1), depending on
whether a query focuses more on topical/textual content or
is more visual, be it because the query is more descriptive, or
the scene or dominant objects are shared by all relevant
segments. Using a shot-based segmentation did not in general
improve the results.</p>
      <p>
        Looking at the result segments, we did not expect
signi cant di erences between the runs with the same
conguration but di erent types of transcripts. The
distribution between the types of terms is also quite similar for
all text resources (about 45% named entities, and another
45% from words and synomyms, the other type 1-2% each).
Only LIMSI has a slightly lower fraction of matching query
terms and metadata than the others. The results for the
di erent textual resources are quite similar, with the
LIUMbased [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] runs performing slightly better than those using
LIMSI/Vocapia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] transcripts or manual subtitles. There
is one outlier run based on subtitles, which performs
significantly worse when shot boundaries are used, but only for
the non-context case. This seems to be a particular issue of
the alignment of the transcript with the boundaries for the
segments involved, but not a general pattern.
3.
      </p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION</title>
      <p>The results are quite encouraging, and show that the
proposed method yields useful results for the linking subtask,
especially at the top ve to ten ranks. Taking the context of
the anchor into account provides a signi cant improvement
of the results. The use of visual reranking segments
provides small but consistent improvements. The use of shots
as result segments does not generally improve the results.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The research leading to these results has received funding
from the European Union's Seventh Framework Programme
(FP7/2007-2013) under grant agreement n 287532,
\TOSCAMP - Task-oriented search and content annotation for media
production" (http://www.tosca-mp.eu).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Maria</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Gareth J.F.</given-names>
            <surname>Jones</surname>
          </string-name>
          , Shu Chen, Robin Aly, and
          <string-name>
            <given-names>Roeland</given-names>
            <surname>Ordelman</surname>
          </string-name>
          .
          <article-title>The Search and Hyperlinking Task at MediaEval 2013</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Lori</given-names>
            <surname>Lamel</surname>
          </string-name>
          and
          <string-name>
            <surname>Jean-Luc Gauvain</surname>
          </string-name>
          .
          <article-title>Speech processing for audio indexing</article-title>
          .
          <source>In Advances in Natural Language Processing (LNCS 5221)</source>
          , pages
          <fpage>4</fpage>
          <lpage>{</lpage>
          15. Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lowe</surname>
          </string-name>
          .
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>60</volume>
          (
          <issue>2</issue>
          ):
          <volume>91</volume>
          {
          <fpage>110</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rousseau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Deleglise</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Estev</surname>
          </string-name>
          .
          <article-title>LIUM's systems for the IWSLT 2011 Speech Translation Tasks</article-title>
          .
          <source>In Proceedings of IWSLT</source>
          <year>2011</year>
          , San Francisco, USA,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>