1. INTRODUCTION

IRISA at MediaEval 2015: Search and Anchoring in Video Archives Task

Anca-Roxana S¸imon

Guillaume Gravier

Pascale Sébillot

IRISA

rstname.lastname@irisa.fr 0

INRIA Rennes

0 0 Univ.

Rennes 1

2015

14 15

This paper presents our approach and results in the Search and Anchoring in Video Archives task at MediaEval 2015. The Search part aims at returning a ranked list of video segments that are relevant to a textual user query. The Anchoring part focuses on the automatic selection of video segments, from a list of videos, that can be used as anchors to encourage further exploration within the archive. Our approach consists in structuring each video into a hierarchy of topically focused fragments, to extract salient segments in the videos at di erent levels of details with precise jump-in points for them. These segments will be leveraged both to answer the queries and to create anchor segments, relying on content based analysis and comparisons. The algorithm deriving the hierarchical structure relies on the burstiness phenomenon in word occurrences which gives an advantage over the classical bag-of-words model.

1. INTRODUCTION

This paper presents the participation of IRISA at the MediaEval 2015 Search and Anchoring in Video Archives task [ 2 ]. The rst part of the task is a search scenario in a video collection. Starting from a two- eld query, where one eld refers to the spoken content and the other refers to the visual content, the goal is to retrieve parts of the archived videos that contain the requested information (audio or visual). The second part consists in automatically selecting video segments, called anchors, for which users could require additional information to explore the archive. The solutions should help the users to nd relevant information in the archive and also to improve the browsing and navigation experience in the archive.

Our approach consists in structuring each video as a hierarchy of topically focused segments using the algorithm proposed in [ 9 ]. This structure helps to extract segments with precise jump-in points and at various levels of details. Once extracted, the segments will be used to select anchor segments and to answer the queries issued by users. For this we rely on content based analysis and comparisons. The segment extraction step is usually done using xed-length segmentation [ 3, 7 ] or linear topic segmentation strategies [ 1 ]. The advantage, of using the hierarchy of topically focused segments over traditional solutions, is that it helps to identify the salient (i.e., important) information in the videos, skipping irrelevant information. Indeed, some fragments of the data bear important ideas while others are simple llers, i.e., they do not bring additional important information. Moreover, having a hierarchical representation, the segments we provide as results can be at di erent granularity, i.e., more speci c or more general, o ering di erent levels of details. Anchors that cover a more general topic or di erent points of view on some topic can be selected; While for the search part the results retrieved could o er a general perspective or a more focused one. The algorithm is build upon the burstiness phenomenon in word occurrences. In practice words tend to appear in bursts (i.e., if a word appears once it is more likely to appear again) instead of independently [ 6 ]. Several studies for statistical laws in language have proposed burst detection models that analyze the distributional patterns of words [ 8, 6 ]. We believe that such an approach brings more focus to what is extracted from the videos, and not only to the content-based comparisons and analysis part. 2.

SYSTEM OVERVIEW

The aim of our approach is rst to nd precise jump-in points to the salient segments in the videos, at various levels of details. These segments are obtained by applying the algorithm proposed in [ 9 ] (denote it HTFF), which outputs a hierarchy of topically focused fragments for each video. HTFF relies on text-like data. Therefore, we exploit spoken data obtained from automatic transcripts and manual subtitles [ 4 ] and visual concepts detected for each video[ 10 ]. More details about the data can be found in [ 2 ]. After obtaining the topically focused fragments we perform content analysis and comparisons to propose the top segments for the two sub-tasks.

Subsections 2.1{2.3 detail the following: 2.1, the generation of potential anchor and query-response segments; 2.2, the selection of the top 20 segments for each query; 2.3, the ranking of the anchor segments. 2.1

Hierarchy of topically focused segments

Each video in the test collection is partitioned into a hierarchy of topically focused fragments with the automatic segmentation algorithm HTFF, which is domain independent, needs no a priori information and has proven to o er a good representation of the information contained in videos. It can be applied on any text-like data. For the search subtask we are provided also with the visual query, i.e., visual concept words. For the search sub-task LIMSI transcripts and the visual concepts detected for each keyframe in the video were used; While for the anchor detection sub-task LIMSI and manual transcripts were used. For applying the algorithm, data in the transcripts are rst lemmatized and only nouns, non modal verbs and adjectives are kept.

The core of HTFF is Kleinberg's algorithm [ 5 ] used to identify word bursts, together with the intervals where they occurred. A burst interval corresponds to a period where the word occurs with increased frequency with respect to the normal behavior. Kleinberg's algorithm outputs a hierarchy of burst intervals for each word, taking one word at a time (for more details see [ 5 ]). The HTFF algorithm generates a hierarchy of salient topics using an agglomerative clustering of burst intervals found with Kleinberg's algorithm. The result is a set of nested topically focused fragments which are hierarchically organized. Next, we describe how the best segments are proposed for each sub-task. 2.2

Search sub-task

A cosine similarity measure is computed between each query and the content of the segments previously retrieved. This measure is computed with segments from all levels in the hierarchy and the ones for which higher similarity is obtained compared to the others will be ranked higher. In this setting, short, focused and highly similar segments are favored. This procedure is done both for textual and visual query independently. 2.3

Anchor selection sub-task

After having the list of salient segments for every video for which anchors need to be extracted from, we compute a cohesion measure to rank these fragments. The measure is a probabilistic one where lexical cohesion for a segment Si is computed using a Laplace law as in [ 11 ], i.e., C(Si) = log Yni fi(wji ) + 1 ;

ni + k j=1 where ni is the number of word occurrences in Si, fi(wji ) is the number of occurrences of the word wji in segment Si and k is the number of words in V. The quantity C(Si) increases when words are repeated and decreases consistently when they are di erent. Using HTFF for anchor detection does not ensure any number of anchor segments to be found for a video. Therefore, some videos might have more or less anchors proposed than others. This is realistic, since the number of anchors that can be found in a video depends on the information contained.

RESULTS

For the search sub-task, 30 test set queries were de ned. The top 10 results for each query were evaluated for each method, using crowd-sourcing technologies. Our results for the search sub-task are reported in Table 1. LIMSI denotes the system using LIMSI automatic transcripts and textual query, while Visual denotes the system relying on visual concepts and visual query. The best results are obtained with the LIMSI system. Analyzing the list of all the segments proposed by participants, it can be observed that with our approach the segments proposed are shorter in duration. Figure 1 illustrates the duration of the segments that were judged relevant or not with both our systems (LIMSI and

P 5 0.34 0.12

Visual) compared to those proposed by other participants. The segments we proposed are on average less than half the size of the segments proposed by other participants. This was detrimental to our approach. Some of the short segments, proposed with our methods, are judged not relevant. While long segments which cover these short segments are judged relevant. However, many of our short segment do not overlap with longer segments proposed by others, so in the end they remain judged as not relevant.

For the anchor sub-task, a list of 33 videos was de ned, for which anchors had to be proposed. The top-25 ranks for each video and each method were judged by crowd-sourcing using Amazon Mechanical Turk workers who gave their opinion on these segments taken from the context of the videos. Precision, recall and Mean Reciprocal Rank (MRR) measures have been used. The results obtained for both our systems LIMSI (using LIMSI automatic transcripts) and Manual (using subtitles) are reported in Table 2. The best results were obtained when relying on automatic transcripts. 4.

CONCLUSION

The results obtained on both sub-tasks show that while for anchor detection short segments are a good idea, for the search sub-task, assessors seem to need more context to judge a segment relevant. For future work on the search subtask we consider selecting larger segments from a higher level in the hierarchy (i.e., coarse grain). Additionally, combining visual and textual bursts could improve the results. For the anchor detection task, di erent ways to rank the segments could be considered, favoring segments which contain named entities or visual bursts.

[1]

Bhatt ,

Pappas ,

Habibi , and

Popescu-Belis . IDIAP at MediaEval 2013: Search and hyperlinking task . In Proceedings of the MediaEval Workshop , 2013 .

[2]

Eskevich ,

Aly ,

Ordelman ,

D. N.

Racca ,

Chen , and

G. J. F.

Jones . SAVA at MediaEval 2015: Search and anchoring in video archives . In Proceedings of the MediaEval 2015 Workshop , Wurzen, Germany, 2015 .

[3]

Galuscakova and

Pecina . CUNI at MediaEval 2014 search and hyperlinking task: Search task experiments . In Proceedings of the MediaEval Workshop , 2014 .

[4]

J.-L.

Gauvain ,

Lamel , and G. Adda. The LIMSI broadcast news transcription system . Speech Communication , 37 ( 1-2 ): 89 { 108 , 2002 .

[5]

Kleinberg . Bursty and hierarchical structure in streams . In 8th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining , pages 91 { 101 , 2002 .

[6]

R. E.

Madsen ,

Kauchak , and

Elkan . Modeling word burstiness using the Dirichlet distribution . In 22nd International Conference on Machine Learning, ICML '05 , pages 545 { 552 , New York, NY, USA, 2005 . ACM.

[7]

D. N.

Racca ,

Eskevich , and

G. J. F.

Jones . Dcu search runs at MediaEval 2014 search and hyperlinking . In Proceedings of the MediaEval 2014 Workshop , Barcelona, SPAIN , 2014 .

[8]

Sarkar ,

P. H.

Garthwaite , and A. De Roeck . A bayesian mixture model for term re-occurrence and burstiness . In 9th Conference on Computational Natural Language Learning , CONLL '05 , pages 48 { 55 . Association for Computational Linguistics, 2005 .

[9]

A.-R.

Simon ,

Sebillot , and

Gravier . Hierarchical topic structuring: from dense segmentation to topically focused fragments via burst analysis . In Recent Advances in NLP, Hissar, Bulgaria , 2015 .

[10]

Tommasi ,

R. B. N.

Aly ,

McGuinness , K. Chat eld, and et al. Beyond metadata: searching your archive based on its audio-visual content . In International Broadcasting Convention , 2014 .

[11]

Utiyama and

Isahara . A statistical model for domain-independent text segmentation . In Association for Computational Linguistics , Toulouse, France, 2001 .