=Paper=
{{Paper
|id=Vol-1263/paper13
|storemode=property
|title=LinkedTV at MediaEval 2014 Search and Hyperlinking Task
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_13.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LeBHCBAMPMSES14
}}
==LinkedTV at MediaEval 2014 Search and Hyperlinking Task==
LinkedTV at MediaEval 2014 Search and Hyperlinking Task H.A. Le1 , Q.M. Bui1 , B. Huet1 , B. Červenková2 , J. Bouchner2 , E. Apostolidis3 , F. Markatopoulou3 , A. Pournaras3 , V. Mezaris3 , D. Stein4 , S. Eickeler4 , and M. Stadtschnitzer4 1 Eurecom, Sophia Antipolis, France. huet@eurecom.fr 2 University of Economics, Prague, Czech Republic. barbora.cervenkova@vse.cz 3 Information Technologies Institute, CERTH, Thessaloniki, Greece. bmezaris@iti.gr 4 Fraunhofer IAIS, Sankt Augustin, Germany. daniel.stein@iais.fraunhofer.de ABSTRACT sponding to a unique visual concept response. The paper presents the LinkedTV approaches for the Search and Hyperlinking (S&H) task at MediaEval 2014. Our sub- 1.1 Temporal Granularity missions aim at evaluating 2 key dimensions: temporal gran- Three temporal granularities are evaluated. The first, ularity and visual properties of the video segments. The termed Text-Segment, consists in grouping together sentences temporal granularity of target video segments is defined by (up to 40) from the text sources. We also propose to segment grouping text sentences, or consecutive automatically de- videos into scenes which consist of semantically correlated tected shots, considering the temporal coherence, the vi- adjacent shots. Two strategies are employed to create scene sual similarity and the lexical cohesion among them. Visual level temporal segments. Visually similar adjacent shots are properties are combined with text search results using mul- merged together to create Visual-scenes [10], while Topic- timodal fusion for re-ranking. Two alternative methods are scenes are built by jointly considering the aforementioned proposed to identify which visual concepts are relevant to results of visual scene segmentation and text-based topical each query: using WordNet similarity or Google Image anal- cohesion (exploiting text extracted from ASR transcripts or ysis. For Hyperlinking, relevant visual concepts are identi- subtitles). fied by analysing the video anchor. 1.2 Visual Properties 1. INTRODUCTION In MediaEval S&H 2014, queries are composed of a few keywords only (visual-cues are not provided). Hence, the This paper describes the framework used by the LinkedTV identification of relevant visual concepts is more complex team to tackle the problem of Search and Hyperlinking in- than last year. We propose two alternatives to this problem. side a video collection [3]. The applied techniques originate On one hand, WordNet similarity is employed to map visual from the LinkedTV project1 , which aims at integrating TV concepts with query terms [8]. On the other hand, the query and Web documents, by enabling users to access additional terms are used to perform a Google Image search. Visual information and media resources aggregated from diverse concept detection (using 151 concepts from the TRECVID sources, thanks to automatic media annotation. Here fol- SIN task [6]) is performed on the first 100 returned im- lows the description of our media annotation process. Shot ages and concepts obtaining the highest average score are segmentation is performed using a variation of [1], while selected. the selected keyframes (one per shot) are analysed by vi- sual concept detection [9] and Optical Character Recogni- tion (OCR) [11] techniques. For each video, keywords are 2. SEARCH SUB-TASK extracted from the subtitles, based on the algorithm pre- sented in [12]. Finally, video shots are grouped into longer 2.1 Text-based methods segments (scenes) based on 2 hierarchical clustering strate- In this approach, relevant text and video segments are gies. Media annotations are indexed at 2 levels (video level searched using Solr using text (TXT ) only. Two strategies and scene level) using the Apache Solr platform2 . At the are compared: one where search is performed at the text video level, document descriptions are limited to text (ti- segment level directly (S ) and one where the first 50 videos tle, subtitle, keywords, etc...), while the scene level docu- are retrieved at the video level and then the relevant video ments are characterized by both text (subtitle/transcript, segment is locate using the scene-level index. The scene- keywords, ocr, etc...) and float fields. Each float field corre- level index granularity is either the Visual-Scene (VS ) or 1 http://www.linkedtv.eu/ the Topic-Scene (TS ). Scenes at both granularities are char- 2 http://lucene.apache.org/solr/ acterized by textual information only (either the subtitle (M ) or one of the 3 ASR transcripts ( (U ) LIUM [7], (I ) LIMSI [4], (S ) NST/Sheffield [5])). Copyright is held by the author/owner(s). MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain 2.2 Multimodal Fusion method Table 2: Results of the Hyperlinking sub-task Motivated by [8], visual concept scores are fused with Run map P5 P 10 P 20 text-based results from Solr to perform re-ranking. Rele- TXT S MLT2 I 0,0502 0,2333 0,1833 0,1117 vant visual concepts, out of the 151 available, for individual TXT S MLT2 M 0,1201 0,3667 0,3267 0,2217 queries are identified using either the WordNet (WN ) or the TXT S MLT2 S 0,0855 0,2067 0,2233 0,1717 GoogleImage (GI ) strategy. For those multi-modal (MM ) TXT VS M 0,2524 0,504 0,448 0,328 runs only visual scene (VS ) segmentation is evaluated. TXT S MLT1 I 0,0798 0,3 0,2462 0,1635 TXT S MLT1 M 0,1511 0,4167 0,375 0,2687 TXT S MLT1 S 0,1118 0,3 0,2857 0,2143 3. HYPERLINKING SUB-TASK TXT S MLT1 U 0,1068 0,2692 0,2577 0,2038 MM VS M 0,1201 0,3 0,2885 0,1923 Pivotal to the hyperlinking task is the ability to automat- MM TS M 0,1048 0,3538 0,2654 0,1692 ically craft an effective query from the video anchor under consideration, to search within the annotated set of media. 5. CONCLUSION We submitted two alternative approaches; One using the The results of LinkedTV’s approaches on the 2014 Medi- MoreLikeThis (MLT ) Solr extension, and the other using aEval S&H task show that it is difficult to improve over text Solr’s query engine. MLT is used in combination with the based approaches when no visual cues are provided. Over- sentence segments (S ), using either text (MLT1 ) or text all, our S&H algorithms performance on this year’s dataset and annotations [2] (MLT2 ). When Solr is used directly, have decreased compared to 2013, showing that task defini- we consider text only (TXT ) or with visual concept scores tion changes have made the task harder to solve. of anchors (MM ) to formulate queries. Keywords appearing within the query anchor’s subtitles compose the textual part 6. ACKNOWLEDGMENTS of the query. Visual concepts whose scores within the query This work was supported by the European Commission under anchor exceed the 0.7 threshold are identified as relevant to contract FP7-287911 LinkedTV. the video anchor and added to the Solr query. Both visual (VS ) and topic scenes (TS ) granularities are evaluated in this approach. 7. REFERENCES [1] E. Apostolidis and V. Mezaris. Fast shot segmentation combining global and local visual descriptors. In 2014 IEEE Int. Conf. on Acoustics, Speech and Signal 4. RESULTS Processing (ICASSP), p 6583–6587, Italy. [2] M. Dojchinovski and T. Kliegr. Entityclassifier.eu: 4.1 Search sub-task Real-Time Classification of Entities in Text with Wikipedia. In H. Blockeel, K. Kersting, S. Nijssen, and Table 1 shows the performance of our search runs. Our F. Železný, editors, Machine Learning and Knowledge best performing approach (TXT VS M ), according to MAP, Discovery in Databases, volume 8190 of Lecture Notes in relies on manual transcript only segmented according to vi- Computer Science, pages 654–658. Springer, 2013. sual scenes. Looking at the precision scores at 5, 10 and 20, [3] M. Eskevich, R. Aly, D.N. Racca, R. Ordelman, S. Chen, one can notice that multi-modal approaches using Word- and G.J.F. Jones. The Search and Hyperlinking Task at MediaEval 2014. In MediaEval 2014 Workshop, Spain. Net (MM VS WN M ) and Google images (MM VS GI M ) [4] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI boost the performance of text only approaches. There is a broadcast news transcription system. Speech clear performance drop whenever ASR (I, U or S ) are em- Communication, 37(1):89–108, 2002. ployed, instead of subtitles (M ). [5] T. Hain, A. El Hannani, S. N. Wrigley, and V. Wan. Automatic speech recognition for scientific purposes-webasr. Table 1: Results of the Search sub-task In Interspeech, Australia, pages 504–507, 2008. Run map P5 P 10 P 20 [6] P. Over et al. TRECVID 2012 – An Overview of the Goals, TXT TS I 0,4664 0,6533 0,6167 0,5317 Tasks, Data, Evaluation Mechanisms and Metrics. In TXT TS M 0,4871 0,6733 0,6333 0,545 Proceedings of TRECVID 2012. NIST, USA, 2012. TXT TS S 0,4435 0,66 0,6367 0,54 [7] A. Rousseau, P. Deléglise, and Y. Estève. Enhancing the TXT TS U 0,4205 0,6467 0,6 0,5133 TED-LIUM corpus with selected data for language TXT S I 0,2784 0,6467 0,57 0,4133 modeling and more TED talks. In LREC 2014, Iceland. TXT S M 0,3456 0,6333 0,5933 0,48 [8] B. Safadi, M. Sahuguet, and B. Huet. When textual and TXT S S 0,1672 0,3926 0,3815 0,3019 visual information join forces for multimedia retrieval. In TXT S U 0,3144 0,66 0,6233 0,48 ACM ICMR 2014, Glasgow, Scotland. TXT VS I 0,4672 0,66 0,62 0,53 [9] P. Sidiropoulos, V. Mezaris, and I. Kompatsiaris. TXT VS M 0,5172 0,68 0,6733 0,5933 Enhancing Video concept detection with the use of TXT VS S 0,465 0,6933 0,6367 0,5317 tomographs. In IEEE ICIP 2013, Australia. TXT VS U 0,4208 0,6267 0,6067 0,53 [10] P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, MM VS WN M 0,5096 0,7 0,6967 0,5833 M. Bugalho, and I. Trancoso. Temporal Video MM VS GI M 0,509 0,6667 0,68 0,5933 Segmentation to Scenes Using High-Level Audiovisual Features. IEEE Transactions on Circuits and Systems for 4.2 Hyperlinking sub-task Video Technology, 21(8):1163–1177, Aug. 2011. [11] D. Stein, S. Eickeler, R. Bardeli, E. Apostolidis, V. Mezaris, Table 2 shows the performance of our hyperlinking runs. and M. Müller. Think Before You Link – Meeting Content Again, the approach based on subtitle only (TXT VS M ) Constraints when Linking Television to the Web. In performed best (MAP=0,25) followed by the approach using Proc. NEM Summit, Nantes, France, Oct. 2013. MoreLikeThis (TXT S MLT1 M ). Multi-modal approaches [12] S. Tschopel and D. Schneider. A lightweight keyword and did not produce the expected performance improvement. tag-cloud retrieval algorithm for automatic speech We believe this is due to the significant duration reduction recognition transcripts. In Interspeech, Japan, 2010. of anchors compared with last year.