IRISA at MediaEval 2015:
              Search and Anchoring in Video Archives Task

                                   Anca-Roxana Şimon* , Guillaume Gravier** ,
                                             Pascale Sébillot***
                                                   IRISA & INRIA Rennes
                                             Univ. Rennes 1* , CNRS** , INSA***
                                               35042 Rennes Cedex, France
                                              firstname.lastname@irisa.fr

ABSTRACT                                                         tify the salient (i.e., important) information in the videos,
This paper presents our approach and results in the Search       skipping irrelevant information. Indeed, some fragments
and Anchoring in Video Archives task at MediaEval 2015.          of the data bear important ideas while others are simple
The Search part aims at returning a ranked list of video         fillers, i.e., they do not bring additional important informa-
segments that are relevant to a textual user query. The          tion. Moreover, having a hierarchical representation, the
Anchoring part focuses on the automatic selection of video       segments we provide as results can be at different granu-
segments, from a list of videos, that can be used as anchors     larity, i.e., more specific or more general, offering different
to encourage further exploration within the archive. Our         levels of details. Anchors that cover a more general topic
approach consists in structuring each video into a hierarchy     or different points of view on some topic can be selected;
of topically focused fragments, to extract salient segments in   While for the search part the results retrieved could offer a
the videos at different levels of details with precise jump-in   general perspective or a more focused one. The algorithm is
points for them. These segments will be leveraged both to        build upon the burstiness phenomenon in word occurrences.
answer the queries and to create anchor segments, relying        In practice words tend to appear in bursts (i.e., if a word
on content based analysis and comparisons. The algorithm         appears once it is more likely to appear again) instead of
deriving the hierarchical structure relies on the burstiness     independently [6]. Several studies for statistical laws in lan-
phenomenon in word occurrences which gives an advantage          guage have proposed burst detection models that analyze the
over the classical bag-of-words model.                           distributional patterns of words [8, 6]. We believe that such
                                                                 an approach brings more focus to what is extracted from the
                                                                 videos, and not only to the content-based comparisons and
1.   INTRODUCTION                                                analysis part.
   This paper presents the participation of IRISA at the
MediaEval 2015 Search and Anchoring in Video Archives
task [2]. The first part of the task is a search scenario in a   2.    SYSTEM OVERVIEW
video collection. Starting from a two-field query, where one        The aim of our approach is first to find precise jump-in
field refers to the spoken content and the other refers to the   points to the salient segments in the videos, at various lev-
visual content, the goal is to retrieve parts of the archived    els of details. These segments are obtained by applying the
videos that contain the requested information (audio or vi-      algorithm proposed in [9] (denote it HTFF), which outputs
sual). The second part consists in automatically selecting       a hierarchy of topically focused fragments for each video.
video segments, called anchors, for which users could re-        HTFF relies on text-like data. Therefore, we exploit spo-
quire additional information to explore the archive. The so-     ken data obtained from automatic transcripts and manual
lutions should help the users to find relevant information in    subtitles [4] and visual concepts detected for each video[10].
the archive and also to improve the browsing and navigation      More details about the data can be found in [2]. After ob-
experience in the archive.                                       taining the topically focused fragments we perform content
   Our approach consists in structuring each video as a hi-      analysis and comparisons to propose the top segments for
erarchy of topically focused segments using the algorithm        the two sub-tasks.
proposed in [9]. This structure helps to extract segments           Subsections 2.1–2.3 detail the following: 2.1, the genera-
with precise jump-in points and at various levels of details.    tion of potential anchor and query-response segments; 2.2,
Once extracted, the segments will be used to select anchor       the selection of the top 20 segments for each query; 2.3, the
segments and to answer the queries issued by users. For this     ranking of the anchor segments.
we rely on content based analysis and comparisons. The seg-
ment extraction step is usually done using fixed-length seg-     2.1    Hierarchy of topically focused segments
mentation [3, 7] or linear topic segmentation strategies [1].       Each video in the test collection is partitioned into a hi-
The advantage, of using the hierarchy of topically focused       erarchy of topically focused fragments with the automatic
segments over traditional solutions, is that it helps to iden-   segmentation algorithm HTFF, which is domain indepen-
                                                                 dent, needs no a priori information and has proven to offer a
                                                                 good representation of the information contained in videos.
Copyright is held by the author/owner(s).                        It can be applied on any text-like data. For the search sub-
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany      task we are provided also with the visual query, i.e., visual
concept words. For the search sub-task LIMSI transcripts                                    P5       P 10   P 20
and the visual concepts detected for each keyframe in the                         LIMSI     0.34     0.31   0.19
video were used; While for the anchor detection sub-task                          Visual    0.12     0.11   0.06
LIMSI and manual transcripts were used. For applying the
algorithm, data in the transcripts are first lemmatized and       Table 1: Precision values obtained for all proposed
only nouns, non modal verbs and adjectives are kept.              methods for the search sub-task.
   The core of HTFF is Kleinberg’s algorithm [5] used to
identify word bursts, together with the intervals where they
occurred. A burst interval corresponds to a period where
the word occurs with increased frequency with respect to the
normal behavior. Kleinberg’s algorithm outputs a hierarchy
of burst intervals for each word, taking one word at a time
(for more details see [5]). The HTFF algorithm generates a
hierarchy of salient topics using an agglomerative clustering
of burst intervals found with Kleinberg’s algorithm. The
result is a set of nested topically focused fragments which
are hierarchically organized. Next, we describe how the best
segments are proposed for each sub-task.

2.2    Search sub-task                                            Figure 1: Boxplots showing segment duration vari-
  A cosine similarity measure is computed between each            ation proposed by others participants and our sys-
query and the content of the segments previously retrieved.       tems (i.e., LIMSI and Visual)
This measure is computed with segments from all levels in
the hierarchy and the ones for which higher similarity is ob-                            Precision    Recall   MRR
tained compared to the others will be ranked higher. In this                  LIMSI       0.557       0.435    0.773
setting, short, focused and highly similar segments are fa-                   Manual      0.469        0.38    0.735
vored. This procedure is done both for textual and visual
query independently.                                              Table 2: Precision, recall and MRR values obtained
                                                                  for all proposed methods for the anchor detection
2.3    Anchor selection sub-task                                  sub-task.
   After having the list of salient segments for every video
for which anchors need to be extracted from, we compute a
cohesion measure to rank these fragments. The measure is          Visual) compared to those proposed by other participants.
a probabilistic one where lexical cohesion for a segment Si       The segments we proposed are on average less than half the
is computed using a Laplace law as in [11], i.e.,                 size of the segments proposed by other participants. This
                               ni
                                                                  was detrimental to our approach. Some of the short seg-
                               Y  fi (wji ) + 1                   ments, proposed with our methods, are judged not relevant.
                C(Si ) = log                      ,
                               j=1
                                     ni + k                       While long segments which cover these short segments are
                                                                  judged relevant. However, many of our short segment do
where ni is the number of word occurrences in Si , fi (wji ) is   not overlap with longer segments proposed by others, so in
the number of occurrences of the word wji in segment Si and       the end they remain judged as not relevant.
k is the number of words in V. The quantity C(Si ) increases         For the anchor sub-task, a list of 33 videos was defined, for
when words are repeated and decreases consistently when           which anchors had to be proposed. The top-25 ranks for each
they are different. Using HTFF for anchor detection does          video and each method were judged by crowd-sourcing using
not ensure any number of anchor segments to be found for          Amazon Mechanical Turk workers who gave their opinion on
a video. Therefore, some videos might have more or less           these segments taken from the context of the videos. Pre-
anchors proposed than others. This is realistic, since the        cision, recall and Mean Reciprocal Rank (MRR) measures
number of anchors that can be found in a video depends on         have been used. The results obtained for both our systems
the information contained.                                        LIMSI (using LIMSI automatic transcripts) and Manual (us-
                                                                  ing subtitles) are reported in Table 2. The best results were
3.    RESULTS                                                     obtained when relying on automatic transcripts.
  For the search sub-task, 30 test set queries were defined.
The top 10 results for each query were evaluated for each         4.   CONCLUSION
method, using crowd-sourcing technologies. Our results for           The results obtained on both sub-tasks show that while
the search sub-task are reported in Table 1. LIMSI denotes        for anchor detection short segments are a good idea, for
the system using LIMSI automatic transcripts and textual          the search sub-task, assessors seem to need more context to
query, while Visual denotes the system relying on visual con-     judge a segment relevant. For future work on the search sub-
cepts and visual query. The best results are obtained with        task we consider selecting larger segments from a higher level
the LIMSI system. Analyzing the list of all the segments          in the hierarchy (i.e., coarse grain). Additionally, combining
proposed by participants, it can be observed that with our        visual and textual bursts could improve the results. For the
approach the segments proposed are shorter in duration.           anchor detection task, different ways to rank the segments
Figure 1 illustrates the duration of the segments that were       could be considered, favoring segments which contain named
judged relevant or not with both our systems (LIMSI and           entities or visual bursts.
5.   REFERENCES
 [1] C. Bhatt, N. Pappas, M. Habibi, and
     A. Popescu-Belis. IDIAP at MediaEval 2013: Search
     and hyperlinking task. In Proceedings of the
     MediaEval Workshop, 2013.
 [2] M. Eskevich, R. Aly, R. Ordelman, D. N. Racca,
     S. Chen, and G. J. F. Jones. SAVA at MediaEval
     2015: Search and anchoring in video archives. In
     Proceedings of the MediaEval 2015 Workshop, Wurzen,
     Germany, 2015.
 [3] P. Galuščáková and P. Pecina. CUNI at MediaEval
     2014 search and hyperlinking task: Search task
     experiments. In Proceedings of the MediaEval
     Workshop, 2014.
 [4] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI
     broadcast news transcription system. Speech
     Communication, 37(1-2):89–108, 2002.
 [5] J. Kleinberg. Bursty and hierarchical structure in
     streams. In 8th ACM SIGKDD International Conf. on
     Knowledge Discovery and Data Mining, pages 91–101,
     2002.
 [6] R. E. Madsen, D. Kauchak, and C. Elkan. Modeling
     word burstiness using the Dirichlet distribution. In
     22nd International Conference on Machine Learning,
     ICML ’05, pages 545–552, New York, NY, USA, 2005.
     ACM.
 [7] D. N. Racca, M. Eskevich, and G. J. F. Jones. Dcu
     search runs at MediaEval 2014 search and
     hyperlinking. In Proceedings of the MediaEval 2014
     Workshop, Barcelona, SPAIN, 2014.
 [8] A. Sarkar, P. H. Garthwaite, and A. De Roeck. A
     bayesian mixture model for term re-occurrence and
     burstiness. In 9th Conference on Computational
     Natural Language Learning, CONLL ’05, pages 48–55.
     Association for Computational Linguistics, 2005.
 [9] A.-R. Simon, P. Sébillot, and G. Gravier. Hierarchical
     topic structuring: from dense segmentation to
     topically focused fragments via burst analysis. In
     Recent Advances in NLP, Hissar, Bulgaria, 2015.
[10] T. Tommasi, R. B. N. Aly, K. McGuinness,
     K. Chatfield, and et al. Beyond metadata: searching
     your archive based on its audio-visual content. In
     International Broadcasting Convention, 2014.
[11] M. Utiyama and H. Isahara. A statistical model for
     domain-independent text segmentation. In Association
     for Computational Linguistics, Toulouse, France, 2001.