The Search and Hyperlinking Task at MediaEval 2013 Maria Eskevich1 , Robin Aly2 , Roeland Ordelman2 , Shu Chen3 , Gareth J.F. Jones1 1 CNGL Centre for Global Intelligent Content, Dublin City University, Ireland 2 University of Twente, The Netherlands 3 INSIGHT Centre for Data Analytics, Dublin City University, Ireland {meskevich,gjones} @computing.dcu.ie shu.chen4@mail.dcu.ie {r.aly, ordelman} @ewi.utwente.nl ABSTRACT (i) All audio files were transcribed by LIMSI-CNRS/Vocapia1 The Search and Hyperlinking Task formed part of the Medi- using the VoxSigma vrbs trans system (version eng-usa 4.0) aEval 2013 evaluation campaign. The Task consisted of two [7]. The models used by the system have been updated with sub-tasks: (1) answering known-item queries from a collec- partial support from the Quaero program [6]. tion of roughly 1200 hours of broadcast TV material, and (ii) The LIUM system2 [10] is based on the CMU Sphinx (2) linking anchors within the known-item to other parts of project, and was developed to participate in the evaluation the video collection. We provide an overview of the task and campaign of the International Workshop on Spoken Lan- the data sets used. guage Translation 2011. LIUM generated an English tran- script for each audio file successfully processed. These re- 1. INTRODUCTION sults consist of: (i) one-best hypotheses in NIST CTM for- mat, (ii) word lattices in SLF (HTK) format, following a The increasing amount of digital multimedia content avail- 4-gram topology, and (iii) confusion networks, in an ATT able is inspiring new scenarios of user interaction. The FSM-like format. Search and Hyperlinking Task at MediaEval 2013 envisioned the following scenario: a user is searching for a segment of video that they know to be contained in a video collection 2.2 Video cues (henceforth the target “known-item”). If the user finds the In addition to spoken content, visual descriptions of video segment, he may wish to find additional information about content can potentially help for searching and hyperlinking. some aspect of this segment. Computer systems should sup- We provided the participants with shot boundaries, one ex- port users in this use scenario by providing links to satisfy tracted keyframe per shot, as well as the outputs of concept the user’s information needs. This use scenario is a refine- detectors (see below) and face detectors (see below) for these ment of a similar task at MediaEval 2012, see [4] for an keyframes. overview of employed techniques. This paper describes the For each video, shot boundaries were determined and a experimental data set provided to task participants for Me- single key frame per shot was extracted by a system kindly diaEval 2013 and details of the two subtasks and their eval- provided by Technicolor [8]. The extracted frame was the uation. most stable I-frame within its shot. In total, the system extracted approximately 1,200,000 shots/keyframes. Con- 2. EXPERIMENTAL DATASET cept detection scores for a list of concepts were provided. The dataset for both subtasks was a collection of 1,260 These concepts were selected by extracting keywords from hours of video provided by the BBC. The average length metadata and spoken content. We used the on-the-fly video of a video was roughly 30 minutes and most videos were detector Visor, which was kindly provided by the Computer in the English language. The collection was used both for Vision Group of University of Oxford [2]. To make the con- training and testing of systems. The BBC kindly provided fidence scores comparable over multiple detectors, we used human generated textual metadata and manual transcripts them as variables in a logistic regression framework, which for each video. Participants were also provided with the ensures the scores lie in the range [0 : 1]. We set the logistic output of two automatic speech recognition (ASR) systems regression parameters to the expected value of the parame- and visual analysis. We describe these information sources ters from over 374 detectors on the internet archive collection in the following subsections. used in TRECVid 2011. The appearance of faces in videos can be helpful informa- 2.1 Speech recognition transcripts tion for search and linking. INRIA [3] kindly provided pos- The audio was extracted from the video stream using the sible bounding boxes in keyframes with a confidence score ffmpeg software toolbox (sample rate = 16,000Hz, number that the bounding box contains a face. Additionally, the of channels = 1). Based on this data, two sets of ASR tran- tool also contained for each bounding box, the n most sim- scripts were created: ilar faces (bounding boxes) in the dataset. 1 http://www.vocapia.com/ 2 Copyright is held by the author/owner(s). http://www-lium.univ-lemans.fr/en/content/language- MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain and-speech-technology-lst 3. USER STUDY different pairs for crowdsoucing assessment. Users at BBC For the definition of realistic queries and anchors, we con- studies evaluated only 1 run per each participant which re- ducted a study with 30 users between the ages of 18 and sulted in 2081 pairs, with 2078 being diverse. The manual 30. By browsing the collection, the users selected items, assessment of these links resulted in the ground truth used a segment of a video with a start and an end time, that to calculate precision at fixed rank cutoffs and MAP for all were interesting to them. The users were then instructed the participants runs. Both mturk and BBC ground truths to consider these items as a known-item which they have were released to the participants for further performance to refind. We asked the users to formulate text and visual analysis. queries that they would use in a search engine to carry our 6. ACKNOWLEDGEMENTS their refinding. The study resulted in 50 known-items and corresponding multimodal queries. Subsequently, we asked This work was supported by Science Foundation Ireland the users to mark so-called anchors, or segments, related to (Grant 08/RFP/CMS1677) Research Frontiers Programme other items from within the collection within the known- 2008 and (Grant 07/CE/I1142) as part of the Centre for item for which they would like to see links. A second session Next Generation Localisation (CNGL) project at DCU, and of the study was conducted after the Task participants sub- by the funding from the European Commission’s 7th Frame- mitted their results. A set of users partially overlapping work Programme (FP7) under AXES ICT-269980. The user with the first group (17 participants) were presented with studies were executed in collaboration with Jana Eggink and the selected anchors and with the hyperlinks proposed by Andy O’Dwyer from BBC Research, to whom the authors the participants. The users had to assess the suitability of are greatful. the proposed hyperlinks. Returning users assessed the an- 7. REFERENCES chors that they defined themselves. The reader can find a [1] R. Aly, R. Ordelman, M. Eskevich, G. J. F. Jones, and more elaborate description of this user study in [1]. S. Chen. Linking inside a video collection: what and how to measure? In WWW (Companion Volume), 4. SEARCH SUBTASK pages 457–460, 2013. We are interested in cross-comparison of one method being [2] K. Chatfield, V. Lempitsky, A. Vedaldi, and applied on all three types of transcripts. Thus we required A. Zisserman. The devil is in the details: an the participants to submit up to 5 different approaches or evaluation of recent feature encoding method. In their combinations, each being tested on all three transcripts. British Machine Vision Conference (BMVC 2011), We used the following three metrics in order to evalu- Dundee, United Kingdom, 2011. ate the submissions of the workshop participants: mean re- [3] R. G. Cinbis, J. Verbeek, and C. Schmid. ciprocal rank (MRR), mean generalized average precision Unsupervised Metric Learning for Face Identification (mGAP) and mean average segment precision (MASP). MRR in TV Video. In Proceedings of ICCV 2011, Barcelona, assesses the ranking of the relevant units. mGAP [9] rewards Spain, 2011. techniques that not only find the relevant items earlier in the [4] M. Eskevich, G. J. Jones, R. Aly, R. J. Ordelman, ranked output list, but also are closer to the ideal point to S. Chen, D. Nadeem, C. Guinaudeau, G. Gravier, begin playback (the “jump-in” point) of the relevant con- P. Sébillot, T. de Nies, P. Debevere, R. Van de Walle, tent. MASP [5] takes into account the ranking of the results P. Galuscakova, P. Pecina, and M. Larson. Multimedia and the length of both relevant and irrelevant segments that information seeking through search and hyperlinking. need to be listened to before reaching the relevant content. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, 5. LINKING SUBTASK ICMR’13, pages 287–294, 2013. For the Hyperlinking subtask, the workshop participants [5] M. Eskevich, W. Magdy, and G. J. F. Jones. New were provided with the so-called anchors created by the users metrics for meaningful evaluation of informally in the user study at the BBC and had to generate link tar- structured speech retrieval. In Proceedings of ECIR gets. To be more concrete, the participants had to return 2012, pages 170–181, Barcelona, Spain, 2012. a list of potential video segment link targets ranked by the [6] J.-L. Gauvain. The Quaero Program: Multilingual and likelihood of being relevant to the anchor or to the anchor in Multimedia Technologies. IWSLT 2010, 2010. the context of corresponding known-item segment (though [7] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI always independently of the initial known-item query). Broadcast News transcription system. Speech To evaluate the linking subtask we used crowdsourcing via Communication, 37(1-2):89–108, 2002. Amazon’s Mechanical Turk platform3 , whereas the second [8] A. Massoudi, F. Lefebvre, C. Demarty, L. Oisel, and stage of user study at BBC allowed us to assess the reliability B. Chupeau. A video fingerprint based on visual digest of the crowdsourcing results. and local fingerprints. In International Conference on Due to time and resource constraints, we chose a random Image Processing (ICIP 2006), pages 2297–2300, 2006. subset of 30 anchors out of initial 98 for the formal task as- [9] P. Pecina, P. Hoffmannova, G. J. F. Jones, Y. Zhang, sessment. For these anchors and potential links, we used a and D. W. Oard. Overview of the CLEF 2007 pooling method to group the videos from the top 10 ranks cross-language speech retrieval track. In Proceedings of of no more than 5 submitted runs of each of the partici- CLEF 2007, pages 674–686. Springer, 2007. pants. Submission were selected to maximize the diversity [10] A. Rousseau, F. Bougares, P. Deléglise, H. Schwenk, of the linking methods used in the pools to be assessed. This and Y. Estèv. LIUM’s systems for the IWSLT 2011 resulted in 9195 anchor-target pairs, that represented 7637 Speech Translation Tasks. In Proceedings of IWSLT 3 2011, San Francisco, USA, 2011. www.mturk.com