<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IRISA at MediaEval 2015: Search and Anchoring in Video Archives Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anca-Roxana S¸imon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillaume Gravier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pascale Sébillot</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>IRISA</string-name>
          <email>rstname.lastname@irisa.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>INRIA Rennes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Univ.</institution>
          <addr-line>Rennes 1</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper presents our approach and results in the Search and Anchoring in Video Archives task at MediaEval 2015. The Search part aims at returning a ranked list of video segments that are relevant to a textual user query. The Anchoring part focuses on the automatic selection of video segments, from a list of videos, that can be used as anchors to encourage further exploration within the archive. Our approach consists in structuring each video into a hierarchy of topically focused fragments, to extract salient segments in the videos at di erent levels of details with precise jump-in points for them. These segments will be leveraged both to answer the queries and to create anchor segments, relying on content based analysis and comparisons. The algorithm deriving the hierarchical structure relies on the burstiness phenomenon in word occurrences which gives an advantage over the classical bag-of-words model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        This paper presents the participation of IRISA at the
MediaEval 2015 Search and Anchoring in Video Archives
task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The rst part of the task is a search scenario in a
video collection. Starting from a two- eld query, where one
eld refers to the spoken content and the other refers to the
visual content, the goal is to retrieve parts of the archived
videos that contain the requested information (audio or
visual). The second part consists in automatically selecting
video segments, called anchors, for which users could
require additional information to explore the archive. The
solutions should help the users to nd relevant information in
the archive and also to improve the browsing and navigation
experience in the archive.
      </p>
      <p>
        Our approach consists in structuring each video as a
hierarchy of topically focused segments using the algorithm
proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This structure helps to extract segments
with precise jump-in points and at various levels of details.
Once extracted, the segments will be used to select anchor
segments and to answer the queries issued by users. For this
we rely on content based analysis and comparisons. The
segment extraction step is usually done using xed-length
segmentation [
        <xref ref-type="bibr" rid="ref3 ref7">3, 7</xref>
        ] or linear topic segmentation strategies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The advantage, of using the hierarchy of topically focused
segments over traditional solutions, is that it helps to
identify the salient (i.e., important) information in the videos,
skipping irrelevant information. Indeed, some fragments
of the data bear important ideas while others are simple
llers, i.e., they do not bring additional important
information. Moreover, having a hierarchical representation, the
segments we provide as results can be at di erent
granularity, i.e., more speci c or more general, o ering di erent
levels of details. Anchors that cover a more general topic
or di erent points of view on some topic can be selected;
While for the search part the results retrieved could o er a
general perspective or a more focused one. The algorithm is
build upon the burstiness phenomenon in word occurrences.
In practice words tend to appear in bursts (i.e., if a word
appears once it is more likely to appear again) instead of
independently [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Several studies for statistical laws in
language have proposed burst detection models that analyze the
distributional patterns of words [
        <xref ref-type="bibr" rid="ref6 ref8">8, 6</xref>
        ]. We believe that such
an approach brings more focus to what is extracted from the
videos, and not only to the content-based comparisons and
analysis part.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM OVERVIEW</title>
      <p>
        The aim of our approach is rst to nd precise jump-in
points to the salient segments in the videos, at various
levels of details. These segments are obtained by applying the
algorithm proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] (denote it HTFF), which outputs
a hierarchy of topically focused fragments for each video.
HTFF relies on text-like data. Therefore, we exploit
spoken data obtained from automatic transcripts and manual
subtitles [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and visual concepts detected for each video[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
More details about the data can be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. After
obtaining the topically focused fragments we perform content
analysis and comparisons to propose the top segments for
the two sub-tasks.
      </p>
      <p>Subsections 2.1{2.3 detail the following: 2.1, the
generation of potential anchor and query-response segments; 2.2,
the selection of the top 20 segments for each query; 2.3, the
ranking of the anchor segments.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Hierarchy of topically focused segments</title>
      <p>Each video in the test collection is partitioned into a
hierarchy of topically focused fragments with the automatic
segmentation algorithm HTFF, which is domain
independent, needs no a priori information and has proven to o er a
good representation of the information contained in videos.
It can be applied on any text-like data. For the search
subtask we are provided also with the visual query, i.e., visual
concept words. For the search sub-task LIMSI transcripts
and the visual concepts detected for each keyframe in the
video were used; While for the anchor detection sub-task
LIMSI and manual transcripts were used. For applying the
algorithm, data in the transcripts are rst lemmatized and
only nouns, non modal verbs and adjectives are kept.</p>
      <p>
        The core of HTFF is Kleinberg's algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] used to
identify word bursts, together with the intervals where they
occurred. A burst interval corresponds to a period where
the word occurs with increased frequency with respect to the
normal behavior. Kleinberg's algorithm outputs a hierarchy
of burst intervals for each word, taking one word at a time
(for more details see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). The HTFF algorithm generates a
hierarchy of salient topics using an agglomerative clustering
of burst intervals found with Kleinberg's algorithm. The
result is a set of nested topically focused fragments which
are hierarchically organized. Next, we describe how the best
segments are proposed for each sub-task.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Search sub-task</title>
      <p>A cosine similarity measure is computed between each
query and the content of the segments previously retrieved.
This measure is computed with segments from all levels in
the hierarchy and the ones for which higher similarity is
obtained compared to the others will be ranked higher. In this
setting, short, focused and highly similar segments are
favored. This procedure is done both for textual and visual
query independently.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Anchor selection sub-task</title>
      <p>
        After having the list of salient segments for every video
for which anchors need to be extracted from, we compute a
cohesion measure to rank these fragments. The measure is
a probabilistic one where lexical cohesion for a segment Si
is computed using a Laplace law as in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], i.e.,
C(Si) = log Yni fi(wji ) + 1 ;
      </p>
      <p>ni + k
j=1
where ni is the number of word occurrences in Si, fi(wji ) is
the number of occurrences of the word wji in segment Si and
k is the number of words in V. The quantity C(Si) increases
when words are repeated and decreases consistently when
they are di erent. Using HTFF for anchor detection does
not ensure any number of anchor segments to be found for
a video. Therefore, some videos might have more or less
anchors proposed than others. This is realistic, since the
number of anchors that can be found in a video depends on
the information contained.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS</title>
      <p>For the search sub-task, 30 test set queries were de ned.
The top 10 results for each query were evaluated for each
method, using crowd-sourcing technologies. Our results for
the search sub-task are reported in Table 1. LIMSI denotes
the system using LIMSI automatic transcripts and textual
query, while Visual denotes the system relying on visual
concepts and visual query. The best results are obtained with
the LIMSI system. Analyzing the list of all the segments
proposed by participants, it can be observed that with our
approach the segments proposed are shorter in duration.
Figure 1 illustrates the duration of the segments that were
judged relevant or not with both our systems (LIMSI and</p>
      <p>P 5
0.34
0.12</p>
      <p>Visual) compared to those proposed by other participants.
The segments we proposed are on average less than half the
size of the segments proposed by other participants. This
was detrimental to our approach. Some of the short
segments, proposed with our methods, are judged not relevant.
While long segments which cover these short segments are
judged relevant. However, many of our short segment do
not overlap with longer segments proposed by others, so in
the end they remain judged as not relevant.</p>
      <p>For the anchor sub-task, a list of 33 videos was de ned, for
which anchors had to be proposed. The top-25 ranks for each
video and each method were judged by crowd-sourcing using
Amazon Mechanical Turk workers who gave their opinion on
these segments taken from the context of the videos.
Precision, recall and Mean Reciprocal Rank (MRR) measures
have been used. The results obtained for both our systems
LIMSI (using LIMSI automatic transcripts) and Manual
(using subtitles) are reported in Table 2. The best results were
obtained when relying on automatic transcripts.
4.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>The results obtained on both sub-tasks show that while
for anchor detection short segments are a good idea, for
the search sub-task, assessors seem to need more context to
judge a segment relevant. For future work on the search
subtask we consider selecting larger segments from a higher level
in the hierarchy (i.e., coarse grain). Additionally, combining
visual and textual bursts could improve the results. For the
anchor detection task, di erent ways to rank the segments
could be considered, favoring segments which contain named
entities or visual bursts.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pappas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Habibi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu-Belis</surname>
          </string-name>
          . IDIAP at MediaEval 2013:
          <article-title>Search and hyperlinking task</article-title>
          .
          <source>In Proceedings of the MediaEval Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Racca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          . SAVA at MediaEval 2015:
          <article-title>Search and anchoring in video archives</article-title>
          .
          <source>In Proceedings of the MediaEval 2015 Workshop</source>
          , Wurzen, Germany,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscakova</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Pecina</surname>
          </string-name>
          . CUNI at
          <article-title>MediaEval 2014 search and hyperlinking task: Search task experiments</article-title>
          .
          <source>In Proceedings of the MediaEval Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Adda.</surname>
          </string-name>
          <article-title>The LIMSI broadcast news transcription system</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>37</volume>
          (
          <issue>1-2</issue>
          ):
          <volume>89</volume>
          {
          <fpage>108</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          .
          <article-title>Bursty and hierarchical structure in streams</article-title>
          .
          <source>In 8th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining</source>
          , pages
          <volume>91</volume>
          {
          <fpage>101</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Madsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kauchak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Elkan</surname>
          </string-name>
          .
          <article-title>Modeling word burstiness using the Dirichlet distribution</article-title>
          .
          <source>In 22nd International Conference on Machine Learning, ICML '05</source>
          , pages
          <fpage>545</fpage>
          {
          <fpage>552</fpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Racca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Dcu search runs at MediaEval 2014 search and hyperlinking</article-title>
          .
          <source>In Proceedings of the MediaEval 2014 Workshop</source>
          , Barcelona,
          <string-name>
            <surname>SPAIN</surname>
          </string-name>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Garthwaite</surname>
          </string-name>
          , and
          <string-name>
            <surname>A. De Roeck</surname>
          </string-name>
          .
          <article-title>A bayesian mixture model for term re-occurrence and burstiness</article-title>
          .
          <source>In 9th Conference on Computational Natural Language Learning</source>
          ,
          <source>CONLL '05</source>
          , pages
          <fpage>48</fpage>
          {
          <fpage>55</fpage>
          . Association for Computational Linguistics,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.-R.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sebillot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gravier</surname>
          </string-name>
          .
          <article-title>Hierarchical topic structuring: from dense segmentation to topically focused fragments via burst analysis</article-title>
          .
          <source>In Recent Advances in NLP, Hissar, Bulgaria</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tommasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B. N.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Chat eld, and</article-title>
          et al.
          <article-title>Beyond metadata: searching your archive based on its audio-visual content</article-title>
          .
          <source>In International Broadcasting Convention</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Utiyama</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Isahara</surname>
          </string-name>
          .
          <article-title>A statistical model for domain-independent text segmentation</article-title>
          .
          <source>In Association for Computational Linguistics</source>
          , Toulouse, France,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>