SAVA at MediaEval 2015: Search and Anchoring in Video Archives Maria Eskevich1 , Robin Aly2 , Roeland Ordelman2 , David N. Racca3 , Shu Chen3 , Gareth J.F. Jones3 1 EURECOM, Sophia Antipolis, France; 2 University of Twente, The Netherlands 3 ADAPT Centre, School of Computing, Dublin City University, Ireland maria.eskevich@gmail.com; {r.aly, ordelman}@ewi.utwente.nl; {dracca, gjones}@computing.dcu.ie; shu.chen4@mail.dcu.ie ABSTRACT • Search for multimedia content This promotes the The Search and Anchoring in Video Archives (SAVA) task development of search methods that use multiple modal- at MediaEval 2015 consists of two sub-tasks: (i) search for ities (e.g., speech, visual content, speaker emotions, multimedia content within a video archive using multimodal etc) to answer search queries by returning relevant queries referring to information contained in the audio and video segments of unrestricted size. Similar to the visual streams/content, and (ii) automatic selection of video earlier MediaEval 2013 Search & Hyperlinking edition segments within a list of videos that can be used as anchors of this sub-task [4], participants were provided with a for further hyperlinking within the archive. The task used a two-fielded query, where one field refers to spoken con- collection of roughly 2700 hours of the BBC broadcast TV tent and the other refers to the visual content of rel- material for the former sub-task, and about 70 files taken evant segments. Participants could use either or both from this collection for the latter sub-task. The search sub- fields to find video segments within the collection. task is based on an ad-hoc retrieval scenario, and is evalu- • Automatic anchor selection This explores meth- ated using a pooling procedure across participants submis- ods to automatically identify anchors for a given set sions with crowdsourcing relevance assessment using Ama- of videos, where anchors are media fragments (with zon Mechanical Turk (MTurk). The evaluation used met- their boundaries defined by their start and end time) rics that are variations of MAP adjusted for this task. For for which users could require additional information. the anchor selection sub-task overlapping regions of interest What constitutes an anchor depends on the video, e.g., across participants submissions were assessed using MTurk in a news programme it could be a mention of persons, workers, and mean reciprocal rank (MRR), precision and and in a documentary it could be the view of particular recall were calculated for evaluation. buildings. Participants were provided with a number of videos of different types and were requested to au- tomatically identify anchors within these videos. 1. INTRODUCTION Current developments in the technologies for recording and storing of multimedia content are leading to very rapid 2. EXPERIMENTAL DATASET growth in the resulting multimedia archives. Moreover the The dataset for both sub-tasks is a collection of 4021 hours digitisation of the content created in previous decades is be- of videos provided by the BBC, which are split into a de- ing added to this contemporary material. This stored infor- velopment set of 1335 hours, and a test set of 2686 hours. mation can potentially be used by a wide variety of users The average length of a video was roughly 45 minutes, and including multimedia professionals, e.g. archivists, journal- most videos were in the English language. The test col- ists, and the general public. We envisage the main aim of lection was broadcast content of date spans 01.04.2008 – the SAVA task in assisting these different users in their inter- 11.05.2008 and 12.05.2008 – 31.07.2008 for the development action with the available collections by facilitating efficient and test sets respectively. The BBC kindly provided human access to relevant content. The solutions to the challenges generated textual metadata and manual transcripts for each of the SAVA task should help the users: 1) to retrieve in- video. Participants were also provided with the output of teresting parts of the archived multimedia documents when several content analysis methods, which we describe in the issuing audio-visual queries to a search system; 2) to im- following subsections. prove the browsing aspect of this activity by providing users Although both sub-tasks are based on the same collec- with the content that has pre-defined or changing on-the-fly tion, they use different set of videos within each sub-task anchor points that can lead them to further discoveries on framework. For both development and testing of the system topics of interest within the collection. Thus the SAVA task within the ‘Search for multimedia content’ sub-task the par- consists of two sub-tasks: Search for multimedia content and ticipants used the test set of the video collection. While the Automatic anchor selection. videos for the ‘Automatic anchor selection’ were taken from both development and test set of the video collection in or- der to have a uniform representation of the files containing Copyright is held by the author/owner(s). previously defined manually created anchors that were used MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany for sub-task assessment. 2.1 Audio Content files respectively for the development and testing of the The audio was extracted from the video stream using the approaches. The users represented the general public: ffmpeg software toolbox (sample rate = 16,000Hz, no. of they had to be 18-30 years old and had to use search channels = 1). Based on this data, the transcripts were engines and services such as Youtube on a daily basis. created using the following ASR approaches and provided The anchors provided in this ground truth are by no to participants: means exhaustive, they only exemplify potential an- chors that can be defined within a given video. • LIMSI-CNRS/Vocapia1 using the VoxSigma vrbs trans system (version eng-usa 4.0) [7]. More elaborate description of this user study design and • The LIUM system2 [11], is based on the CMU Sphinx the anchor definition procedure can be found in [2] and [9] project. The LIUM system provided three output for- respectively. mats: (1) one-best transcripts in NIST CTM format, (2) word lattices in SLF (HTK) format, following a 4- 4. REQUIRED RUNS gram topology, and (3) confusion networks in a format As our evaluation makes use of cross-comparison between similar to ATT FSM. runs, we did not limit the participants in the number of • The NST/Sheffield system3 is trained on multi-genre submissions for either of the tasks. However, we stated that sets of BBC data that do not overlap with the col- due to finite resources, only limited number of runs would lection used for the task, and uses deep neural net- be assessed through crowdsourcing. works [8]. The ASR transcript contains speaker di- arization, similar to the LIMSI-CNRS/Vocapia tran- 5. RELEVANCE ASSESSMENT AND EVAL- scipts. UATION METRICS Additionally, prosodic features were extracted using the OpenSMILE tool version 2.0 rc1 [6]4 . The following list To evaluate the submissions of the search sub-task, First, of prosodic features were calculated over sliding windows of the runs were normalised: videos with corrupted audio- 10 milliseconds: root mean squared (RMS) energy, loudness, visual content due to bugs in the employed software ffmpeg probability of voicing, fundamental frequency (F0), harmon- were dismissed, segments shorter than 10 seconds were ex- ics to noise ratio (HNR), voice quality, and pitch direction panded to this length, segments longer than 2 minutes were (classes falling, flat, raising, and direction score). cut after this length (using the original’s segment start), and segments overlapping with previously returned segments 2.2 Visual Content were adjusted to remove the overlap. Second, we used the pooling method with selected runs. Third, the top 10 ranks The computer vision groups at University of Leuven (KUL) of all submitted runs were evaluated using crowdsourcing and University of Oxford (OXU) provided the output of con- technologies. We report precision oriented metrics, such cept detectors for 1,537 concepts from ImageNet5 using dif- as precision at various cutoffs and mean average precision ferent training approaches. The approach by KUL uses ex- (MAP), using different approaches to take into account seg- amples from ImageNet as positive examples [12], while OXU ment overlap, as described in [1, 10]. uses an on-the-fly concept detection approach, which down- For the anchoring sub-task, we used the top-25 ranks of loads training examples through Google image search [3]. all submissions, and merged overlapping segments. The re- sulting segments were judged by MTurk workers who gave 3. TASK INPUT DEFINITION their opinion on these segments taken from the context of As we assumed that both types of user activities behind the videos. For the MRR, recall/precision, a result segment the sub-tasks frameworks can be carried out by both profes- in the run is judged relevant if it overlaps with a relevant sionals and general audience, we involved representatives of combined segment. both user categories into the ground truth creation: 6. SUMMARY AND CONCLUSIONS • Search for multimedia content: 9 development set and 30 test set queries were defined by professionals This paper describes the setup of the search and anchoring with the following profile: 1) they work in the field, sub-tasks at the MediaEval 2015. While the definition of the e.g. they were journalists, archivists, etc; 2) they were search task is built on the experience of several years, the native English speakers, and 3) they were generally anchoring sub-task was new in 2015. Here, we describe the familiar with BBC content. For each query in the de- data provided to the task participants and the methods used velopment set these users defined two relevant video to generate the input data and to evaluate submitted results. segments in order to ensure the existence of potential relevant content for an ad hoc search. 7. ACKNOWLEDGMENTS This work was supported by the European Commission’s • Automatic anchor selection: We used the video 7th Framework Programme (FP7) under FP7-ICT 269980 files containing the manually defined anchors in 2013- (AXES) and FP7-ICT 287911 (LinkedTV); Bpifrance within 2014 Search & Hyperlinking tasks [4, 5]: 42 and 33 the NexGen-TV Project, under grant number F1504054U; 1 http://www.vocapia.com/ the Dutch national program COMMIT/; Science Founda- 2 http://www-lium.univ-lemans.fr/en/content/language- tion Ireland (Grant No 12/CE/I2267) as part of the Centre and-speech-technology-lst for Next Generation Localisation (CNGL) project at DCU. 3 http://www.natural-speech-technology.org The user studies were executed in collaboration with Jana 4 http://opensmile.sourceforge.net/ Eggink and Andy O’Dwyer from BBC Research, to whom 5 http://image-net.org/popularity percentile readme.html the authors are grateful. 8. REFERENCES [1] R. Aly, M. Eskevich, R. Ordelman, and G. J. F. Jones. Adapting binary information retrieval evaluation metrics for segment-based retrieval tasks. Technical Report 1312.1913, ArXiv e-prints, 2013. [2] R. Aly, R. Ordelman, M. Eskevich, G. J. F. Jones, and S. Chen. Linking inside a video collection - what and how to measure? In Proceedings of the 22nd International Conference on World Wide Web Companion, IW3C2 2013, Rio de Janeiro, Brazil, pages 457–460, Brazil, May 2013. [3] K. Chatfield and A. Zisserman. Visor: Towards on-the-fly large-scale object category retrieval. In Computer Vision–ACCV 2012, pages 432–446. Springer, 2013. [4] M. Eskevich, R. Aly, R. Ordelman, S. Chen, and G. J. F. Jones. The Search and Hyperlinking Task at MediaEval 2013. In MediaEval 2013 Workshop, Barcelona, Spain, 2013. [5] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman, S. Chen, and G. J. F. Jones. The Search and Hyperlinking task at MediaEval 2014. In Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Catalunya, Spain, 2014. [6] F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In Proceedings of ACM Multimedia 2013, pages 835–838, Barcelona, Spain. [7] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI Broadcast News transcription system. Speech Communication, 37(1-2):89–108, 2002. [8] P. Lanchantin, P. Bell, M. J. F. Gales, T. Hain, X. Liu, Y. Long, J. Quinnell, S. Renals, O. Saz, M. S. Seigel, P. Swietojanski, and P. C. Woodland. Automatic transcription of multi-genre media archives. In Proceedings of the First Workshop on Speech, Language and Audio in Multimedia (SLAM@INTERSPEECH), volume 1012 of CEUR Workshop Proceedings, pages 26–31. CEUR-WS.org, 2013. [9] R. J. F. Ordelman, M. Eskevich, R. Aly, B. Huet, and G. J. F. Jones. Defining and evaluating video hyperlinking for navigating multimedia archives. In Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy - Companion Volume, pages 727–732, 2015. [10] D. N. Racca and G. J. F. Jones. Evaluating Search and Hyperlinking: an example of the design, test, refine cycle for metric development. In Working Notes Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, 2015. [11] A. Rousseau, P. Deléglise, and Y. Estève. Enhancing the TED-LIUM corpus with selected data for language modeling and more ted talks. In The 9th edition of the Language Resources and Evaluation Conference (LREC 2014), Reykjavik, Iceland, May 2014. [12] T. Tommasi, T. Tuytelaars, and B. Caputo. A testbed for cross-dataset analysis. CoRR, abs/1402.5923, 2014.