LinkedTV at MediaEval 2013 Search and Hyperlinking Task M. Sahuguet1 , B. Huet1 , B. Červenková2 , E. Apostolidis3 , V. Mezaris3 , D. Stein4 , S. Eickeler4 , J.L. Redondo Garcia1 , R. Troncy1 , and L. Pikora2 1 Eurecom, Sophia Antipolis, France. [sahuguet,huet,redondo,troncy]@eurecom.fr 2 University of Economics, Prague, Czech Republic. [barbora.cervenkova,lukas.pikora]@vse.cz 3 Information Technologies Institute, Thessaloniki, Greece. [apostolid,bmezaris]@iti.gr 4 Fraunhofer IAIS, Sankt Augustin, Germany. [daniel.stein,stefan.eickeler]@iais.fraunhofer.de ABSTRACT 3.2 From visual cues to detected concepts This paper aims at presenting the results of LinkedTV’s first Text search is straightforward with Lucene, by using the participation to the Search and Hyperlinking task at Medi- default text search based on TF-IDF values. In order to in- aEval challenge 2013. We used textual information, tran- corporate visual information to the search, we mapped key- scripts, subtitles and metadata, and we tested their combi- words extracted from the visual cues query (using Alchemy nation with automatically detected visual concepts. Hence, API2 ) to visual concepts using a semantic word distance we submitted various runs to compare diverse approaches based on Wordnet synsets [5]. When visual concepts were and see the improvement when adding visual information. detected in the query, we enriched the textual query by range queries on the values of the corresponding visual concepts. 1. INTRODUCTION 3.3 Search task This paper describes the framework used by the LinkedTV We concatenated textual and visual queries to perform team to tackle the problem of Search and Hyperlinking in- the text query. Two strategies were adopted: we either used side a video collection [2]. The applied techniques originate segments indexed in the Lucene engine, or performed queries from the LinkedTV project1 , which aims at integrating TV creating segments on the fly, by merging video segments and internet experience, by enabling the user to access ad- based on their score. ditional information and media resources aggregated from Performing a text query on the video index often returned diverse sources, thanks to automatic media annotation. the relevant video in the top of the list. Hence, some runs first restrict the pool of videos that are going to be searched 2. PRE-PROCESSING STEP to a small number, and then perform additional queries for Concept detection was performed on the key-frames of smaller segments inside this pool. the video, following the approach in [6], while the algorithm for Optical Character Recognition (OCR) described in [8] We submitted 9 runs in total: was used for text localization. Moreover, for each video, • scenes-C : Scene search using textual and visual cues. we extracted keywords from the provided subtitles, based • scenes-noC : Same as previous using textual cues only (no on the algorithm presented in [9]. Finally, we grouped the visual cues) for comparison purposes. predefined video shots into bigger segments (scenes), based • part-sc-C : Partial scenes search from shot boundary using on the visual similarity and the temporal consistency among textual and visual cues following three steps: filtering of them, using the method introduced in [7]. the list of videos; querying for shots inside each video; or- dering them by score. As a shot is a unit that is too small to be returned to a viewer, we completed the segment with 3. OUR FRAMEWORK the end of the scene that includes this shot. • part-sc-noC : Same as previous using textual cues only. 3.1 Lucene indexing • cl10-C : Temporal clustering of shots within a video using We indexed all available data in a Lucene index at dif- text and visual cues in the following manner: filtering out ferent granularities: video level, scene level, shot level and the set of videos to search; computing scores for every segments created using sliding window algorithm [3]. Doc- shot in the video; clustering together shots closer than 10 uments were represented by both textual fields (for a text seconds apart (scores were added to form the final score). search) and floating point fields (for the visual concepts). • cl10-noC :Same as previous using text search only. • scenes-S or scenes-U or scenes-I : Scene search using only 1 http://www.linkedtv.eu/ textual cue from transcript or subtitle, no metadata. • SW-60-I or SW-60-S : Search over segments created by the sliding window algorithm for LIMSI/Vocapia [4] tran- Copyright is held by the author/owner(s). 2 MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain http://www.alchemyapi.com/ scripts and subtitles, where the size of the sliding windows is 60. Table 2: Results of the Hyperlinking task • SW-40-U : Same as above for LIUM transcript with sliding Run MAP P-5 P-10 P-20 window size of 40. LA cl10 0.0577 0.4467 0.3200 0.2067 LA MLT 0.1201 0.4200 0.4200 0.3217 3.4 Hyperlinking task LA scenes 0.1770 0.6867 0.5867 0.4167 A first approach consisted in reusing the search compo- LC cl10 0.0823 0.5733 0.4833 0.2767 nent with the scene approach and the shot clustering ap- LC MLT 0.1820 0.5667 0.5667 0.4300 proach. A query was crafted from the anchor: the text LC scenes 0.2523 0.8133 0.7300 0.5283 query was made by extracting keywords from the subtitle aligned at start time and end time of the anchor. Visual concepts scores were extracted from the keyframes of shots 5. CONCLUSION contained in the anchor. If the anchor was constituted by This paper presented our framework and results at the more than one shot, we took for each concept the highest MediaEval Search and Hyperlinking task. From our runs, score over all shots. it is clear that scene segmentation is the approach with the A second approach made use of MorelikeThis Solr com- best performances. Therefore, this approach should be stud- ponent (MLT) combined with Entityclassifier.eu annotation ied more in depth, a potential improvement being to refine [1]. We created a temporary document from the query as the segmentation using semantics or speakers information. the root for searching similar documents, and performed the Also, we see here that this task benefits from the use of vi- search over segments from the LIMSI transcripts created us- sual information present in the video. Hence, those two axes ing sliding windows and enriched with synonyms. should be the next steps to study for a future challenge. 4. RESULTS 6. ACKNOWLEDGMENTS This work was supported by the European Commission under 4.1 Search task contracts FP7-287911 LinkedTV and FP7-318101 MediaMixer. The results of the search task are listed in Table 1. We first notice than given the same conditions, subtitles perform 7. REFERENCES [1] M. Dojchinovski and T. Kliegr. Entityclassifier.eu: significantly better than any of the transcripts, which is an Real-Time Classification of Entities in Text with Wikipedia. expected outcome. It is also interesting to note that using In H. Blockeel, K. Kersting, S. Nijssen, and F. Železný, the visual concepts in the query slightly increases the results editors, Machine Learning and Knowledge Discovery in for all measures (e.g., clustering10-C vs clustering10-noC). Databases, volume 8190 of Lecture Notes in Computer Science, pages 654–658. Springer Berlin Heidelberg, 2013. [2] M. Eskevich, J. Gareth J.F., S. Chen, R. Aly, and Table 1: Results of the Search task R. Ordelman. The Search and Hyperlinking Task at Run MRR mGAP MASP MediaEval 2013. In MediaEval 2013 Workshop, Barcelona, Spain, October 18-19 2013. scenes-C 0.3095 0.1770 0.1951 [3] M. Eskevich, G. Jones, C. Wartena, M. Larson, R. Aly, scenes-noC 0.3091 0.1767 0.1947 T. Verschoor, and R. Ordelman. Comparing retrieval scenes-S 0.3152 0.1635 0.2021 effectiveness of alternative content segmentation methods for scenes-I 0.2613 0.1444 0.1582 Internet video search. In Content-Based Multimedia scenes-U 0.2458 0.1344 0.1528 Indexing (CBMI), 2012 10th International Workshop on, pages 1–6, 2012. part-sc-C 0.2284 0.1241 0.1024 [4] L. Lamel and J.-L. Gauvain. Speech processing for audio part-sc-noC 0.2281 0.1240 0.1021 indexing. In Advances in Natural Language Processing cl10-C 0.2929 0.1525 0.1814 (LNCS 5221), pages 4–15. Springer, 2008. cl10-noC 0.2849 0.1479 0.1713 [5] D. Lin. An Information-Theoretic Definition of Similarity. In SW-60-S 0.2833 0.1925 0.2027 Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, pages 296–304, San Francisco, SW-60-I 0.1965 0.1206 0.1204 CA, USA, 1998. Morgan Kaufmann Publishers Inc. SW-40-U 0.2368 0.1342 0.1501 [6] P. Sidiropoulos, V. Mezaris, and I. Kompatsiaris. Enhancing Video concept detection with the use of tomographs. In Overall, the best approaches are those using scenes and Proceedings of the IEEE International Conference on Image sliding windows. Scene based approaches retrieve a higher Processing (ICIP), 2013. number of correct relevant segments within a time window [7] P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, M. Bugalho, and I. Trancoso. Temporal Video Segmentation of 60 seconds (higher MRR), but they are not the most pre- to Scenes Using High-Level Audiovisual Features. IEEE cise in terms of start and end time, compared to the sliding Transactions on Circuits and Systems for Video Technology, windows approach (as suggested by mGAP and MASP). 21(8):1163–1177, Aug. 2011. [8] D. Stein, S. Eickeler, R. Bardeli, E. Apostolidis, V. Mezaris, 4.2 Hyperlinking task and M. Müller. Think Before You Link – Meeting Content Constraints when Linking Television to the Web. In The results are listed in Table 2. For both LA and LC con- Proc. NEM Summit, Nantes, France, Oct. 2013. to appear. dictions, runs using scenes outperform other runs for all met- [9] S. Tschöpel and D. Schneider. A lightweight keyword and rics. The MoreLikeThis/Entityclassifier.eu approach comes tag-cloud retrieval algorithm for automatic speech second. As expected, using the context increases the pre- recognition transcripts. In Proceedings of the 11th Annual cision when hyperlinking video segments. It is also notable Conference of the International Speech Communication that the precision at rank n decreases when n increases. Association ISCA (INTERSPEECH), 2010.