IRISA at MediaEval 2015: Search and Anchoring in Video Archives Task Anca-Roxana Şimon* , Guillaume Gravier** , Pascale Sébillot*** IRISA & INRIA Rennes Univ. Rennes 1* , CNRS** , INSA*** 35042 Rennes Cedex, France firstname.lastname@irisa.fr ABSTRACT tify the salient (i.e., important) information in the videos, This paper presents our approach and results in the Search skipping irrelevant information. Indeed, some fragments and Anchoring in Video Archives task at MediaEval 2015. of the data bear important ideas while others are simple The Search part aims at returning a ranked list of video fillers, i.e., they do not bring additional important informa- segments that are relevant to a textual user query. The tion. Moreover, having a hierarchical representation, the Anchoring part focuses on the automatic selection of video segments we provide as results can be at different granu- segments, from a list of videos, that can be used as anchors larity, i.e., more specific or more general, offering different to encourage further exploration within the archive. Our levels of details. Anchors that cover a more general topic approach consists in structuring each video into a hierarchy or different points of view on some topic can be selected; of topically focused fragments, to extract salient segments in While for the search part the results retrieved could offer a the videos at different levels of details with precise jump-in general perspective or a more focused one. The algorithm is points for them. These segments will be leveraged both to build upon the burstiness phenomenon in word occurrences. answer the queries and to create anchor segments, relying In practice words tend to appear in bursts (i.e., if a word on content based analysis and comparisons. The algorithm appears once it is more likely to appear again) instead of deriving the hierarchical structure relies on the burstiness independently [6]. Several studies for statistical laws in lan- phenomenon in word occurrences which gives an advantage guage have proposed burst detection models that analyze the over the classical bag-of-words model. distributional patterns of words [8, 6]. We believe that such an approach brings more focus to what is extracted from the videos, and not only to the content-based comparisons and 1. INTRODUCTION analysis part. This paper presents the participation of IRISA at the MediaEval 2015 Search and Anchoring in Video Archives task [2]. The first part of the task is a search scenario in a 2. SYSTEM OVERVIEW video collection. Starting from a two-field query, where one The aim of our approach is first to find precise jump-in field refers to the spoken content and the other refers to the points to the salient segments in the videos, at various lev- visual content, the goal is to retrieve parts of the archived els of details. These segments are obtained by applying the videos that contain the requested information (audio or vi- algorithm proposed in [9] (denote it HTFF), which outputs sual). The second part consists in automatically selecting a hierarchy of topically focused fragments for each video. video segments, called anchors, for which users could re- HTFF relies on text-like data. Therefore, we exploit spo- quire additional information to explore the archive. The so- ken data obtained from automatic transcripts and manual lutions should help the users to find relevant information in subtitles [4] and visual concepts detected for each video[10]. the archive and also to improve the browsing and navigation More details about the data can be found in [2]. After ob- experience in the archive. taining the topically focused fragments we perform content Our approach consists in structuring each video as a hi- analysis and comparisons to propose the top segments for erarchy of topically focused segments using the algorithm the two sub-tasks. proposed in [9]. This structure helps to extract segments Subsections 2.1–2.3 detail the following: 2.1, the genera- with precise jump-in points and at various levels of details. tion of potential anchor and query-response segments; 2.2, Once extracted, the segments will be used to select anchor the selection of the top 20 segments for each query; 2.3, the segments and to answer the queries issued by users. For this ranking of the anchor segments. we rely on content based analysis and comparisons. The seg- ment extraction step is usually done using fixed-length seg- 2.1 Hierarchy of topically focused segments mentation [3, 7] or linear topic segmentation strategies [1]. Each video in the test collection is partitioned into a hi- The advantage, of using the hierarchy of topically focused erarchy of topically focused fragments with the automatic segments over traditional solutions, is that it helps to iden- segmentation algorithm HTFF, which is domain indepen- dent, needs no a priori information and has proven to offer a good representation of the information contained in videos. Copyright is held by the author/owner(s). It can be applied on any text-like data. For the search sub- MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany task we are provided also with the visual query, i.e., visual concept words. For the search sub-task LIMSI transcripts P5 P 10 P 20 and the visual concepts detected for each keyframe in the LIMSI 0.34 0.31 0.19 video were used; While for the anchor detection sub-task Visual 0.12 0.11 0.06 LIMSI and manual transcripts were used. For applying the algorithm, data in the transcripts are first lemmatized and Table 1: Precision values obtained for all proposed only nouns, non modal verbs and adjectives are kept. methods for the search sub-task. The core of HTFF is Kleinberg’s algorithm [5] used to identify word bursts, together with the intervals where they occurred. A burst interval corresponds to a period where the word occurs with increased frequency with respect to the normal behavior. Kleinberg’s algorithm outputs a hierarchy of burst intervals for each word, taking one word at a time (for more details see [5]). The HTFF algorithm generates a hierarchy of salient topics using an agglomerative clustering of burst intervals found with Kleinberg’s algorithm. The result is a set of nested topically focused fragments which are hierarchically organized. Next, we describe how the best segments are proposed for each sub-task. 2.2 Search sub-task Figure 1: Boxplots showing segment duration vari- A cosine similarity measure is computed between each ation proposed by others participants and our sys- query and the content of the segments previously retrieved. tems (i.e., LIMSI and Visual) This measure is computed with segments from all levels in the hierarchy and the ones for which higher similarity is ob- Precision Recall MRR tained compared to the others will be ranked higher. In this LIMSI 0.557 0.435 0.773 setting, short, focused and highly similar segments are fa- Manual 0.469 0.38 0.735 vored. This procedure is done both for textual and visual query independently. Table 2: Precision, recall and MRR values obtained for all proposed methods for the anchor detection 2.3 Anchor selection sub-task sub-task. After having the list of salient segments for every video for which anchors need to be extracted from, we compute a cohesion measure to rank these fragments. The measure is Visual) compared to those proposed by other participants. a probabilistic one where lexical cohesion for a segment Si The segments we proposed are on average less than half the is computed using a Laplace law as in [11], i.e., size of the segments proposed by other participants. This ni was detrimental to our approach. Some of the short seg- Y fi (wji ) + 1 ments, proposed with our methods, are judged not relevant. C(Si ) = log , j=1 ni + k While long segments which cover these short segments are judged relevant. However, many of our short segment do where ni is the number of word occurrences in Si , fi (wji ) is not overlap with longer segments proposed by others, so in the number of occurrences of the word wji in segment Si and the end they remain judged as not relevant. k is the number of words in V. The quantity C(Si ) increases For the anchor sub-task, a list of 33 videos was defined, for when words are repeated and decreases consistently when which anchors had to be proposed. The top-25 ranks for each they are different. Using HTFF for anchor detection does video and each method were judged by crowd-sourcing using not ensure any number of anchor segments to be found for Amazon Mechanical Turk workers who gave their opinion on a video. Therefore, some videos might have more or less these segments taken from the context of the videos. Pre- anchors proposed than others. This is realistic, since the cision, recall and Mean Reciprocal Rank (MRR) measures number of anchors that can be found in a video depends on have been used. The results obtained for both our systems the information contained. LIMSI (using LIMSI automatic transcripts) and Manual (us- ing subtitles) are reported in Table 2. The best results were 3. RESULTS obtained when relying on automatic transcripts. For the search sub-task, 30 test set queries were defined. The top 10 results for each query were evaluated for each 4. CONCLUSION method, using crowd-sourcing technologies. Our results for The results obtained on both sub-tasks show that while the search sub-task are reported in Table 1. LIMSI denotes for anchor detection short segments are a good idea, for the system using LIMSI automatic transcripts and textual the search sub-task, assessors seem to need more context to query, while Visual denotes the system relying on visual con- judge a segment relevant. For future work on the search sub- cepts and visual query. The best results are obtained with task we consider selecting larger segments from a higher level the LIMSI system. Analyzing the list of all the segments in the hierarchy (i.e., coarse grain). Additionally, combining proposed by participants, it can be observed that with our visual and textual bursts could improve the results. For the approach the segments proposed are shorter in duration. anchor detection task, different ways to rank the segments Figure 1 illustrates the duration of the segments that were could be considered, favoring segments which contain named judged relevant or not with both our systems (LIMSI and entities or visual bursts. 5. REFERENCES [1] C. Bhatt, N. Pappas, M. Habibi, and A. Popescu-Belis. IDIAP at MediaEval 2013: Search and hyperlinking task. In Proceedings of the MediaEval Workshop, 2013. [2] M. Eskevich, R. Aly, R. Ordelman, D. N. Racca, S. Chen, and G. J. F. Jones. SAVA at MediaEval 2015: Search and anchoring in video archives. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, 2015. [3] P. Galuščáková and P. Pecina. CUNI at MediaEval 2014 search and hyperlinking task: Search task experiments. In Proceedings of the MediaEval Workshop, 2014. [4] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI broadcast news transcription system. Speech Communication, 37(1-2):89–108, 2002. [5] J. Kleinberg. Bursty and hierarchical structure in streams. In 8th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, pages 91–101, 2002. [6] R. E. Madsen, D. Kauchak, and C. Elkan. Modeling word burstiness using the Dirichlet distribution. In 22nd International Conference on Machine Learning, ICML ’05, pages 545–552, New York, NY, USA, 2005. ACM. [7] D. N. Racca, M. Eskevich, and G. J. F. Jones. Dcu search runs at MediaEval 2014 search and hyperlinking. In Proceedings of the MediaEval 2014 Workshop, Barcelona, SPAIN, 2014. [8] A. Sarkar, P. H. Garthwaite, and A. De Roeck. A bayesian mixture model for term re-occurrence and burstiness. In 9th Conference on Computational Natural Language Learning, CONLL ’05, pages 48–55. Association for Computational Linguistics, 2005. [9] A.-R. Simon, P. Sébillot, and G. Gravier. Hierarchical topic structuring: from dense segmentation to topically focused fragments via burst analysis. In Recent Advances in NLP, Hissar, Bulgaria, 2015. [10] T. Tommasi, R. B. N. Aly, K. McGuinness, K. Chatfield, and et al. Beyond metadata: searching your archive based on its audio-visual content. In International Broadcasting Convention, 2014. [11] M. Utiyama and H. Isahara. A statistical model for domain-independent text segmentation. In Association for Computational Linguistics, Toulouse, France, 2001.