=Paper=
{{Paper
|id=Vol-1263/paper30
|storemode=property
|title=The Search and Hyperlinking Task at MediaEval 2014
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_30.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/EskevichAROCJ14
}}
==The Search and Hyperlinking Task at MediaEval 2014==
The Search and Hyperlinking Task at MediaEval 2014 Maria Eskevich1 , Robin Aly2 , David N. Racca1 , Roeland Ordelman2 , Shu Chen1,3 , Gareth J.F. Jones1 1 CNGL Centre for Global Intelligent Content, School of Computing, Dublin City University, Ireland 2 University of Twente, The Netherlands 3 INSIGHT Centre for Data Analytics, Dublin City University, Ireland {meskevich, dracca, gjones} @computing.dcu.ie shu.chen4@mail.dcu.ie {r.aly, ordelman} @ewi.utwente.nl ABSTRACT roughly 45 minutes, and most videos were in the English The Search and Hyperlinking Task at MediaEval 2014 is the language. The test collection was broadcast content of date third edition of this task. As in previous versions, it con- spans 01.04.2008 – 11.05.2008 and 12.05.2008 – 31.07.2008 sisted of two sub-tasks: (i) answering search queries from a for the development and test sets respectively. The BBC collection of roughly 2700 hours of BBC broadcast TV mate- kindly provided human generated textual metadata and man- rial, and (ii) linking anchor segments from within the videos ual transcripts for each video. Participants were also pro- to other target segments within the video collection. For vided with the output of several content analysis methods, MediaEval 2014, both sub-tasks were based on an ad-hoc which we describe in the following subsections. retrieval scenario, and were evaluated using a pooling pro- cedure across participants submissions with crowdsourcing 2.1 Audio Analysis relevance assessment using Amazon Mechanical Turk. The audio was extracted from the video stream using the ffmpeg software toolbox (sample rate = 16,000Hz, no. of 1. INTRODUCTION channels = 1). Based on this data, the transcripts were created using the following ASR approaches and provided The full value of the rapidly growing archives of newly to participants: produced digital multimedia content and digitalisation of (i) LIMSI-CNRS/Vocapia1 , which uses the VoxSigma vrbs trans previously created analog audio and video material will only system (version eng-usa 4.0) [7]. Compared to the tran- be realized with the development of technologies that allow scripts created for the 2013 edition of this task, the sys- users to explore them through search and retrieval of poten- tem’s models had been updated with partial support from tially interesting content. the Quaero program [6]. The Search and Hyperlinking Task at MediaEval 2014 en- (ii) The LIUM system2 [10], is based on the CMU Sphinx visioned the following scenario: a user is searching for rele- project. The LIUM system provided three output formats: vant segments within a video collection that address a cer- (1) one-best transcripts in NIST CTM format, (2) word lat- tain topic of interest expressed in a query. If the user finds tices in SLF (HTK) format, following a 4-gram topology, and a segment which is relevant to their initial information need (3) confusion networks in a format similar to ATT FSM. expressed through the query, they may wish to find addi- (iii) The NST/Sheffield system3 is trained on multi-genre tional information about some aspect of this segment. sets of BBC data that does not overlap with the collection The task framework asks participants to create systems used for the task, and uses deep neural networks [8]. The that support the search and linking aspects of the task. The ASR transcript contains speaker diarization, similar to the use scenario is the same as in the Search and Hyperlinking LIMSI-CNRS/Vocapia transcipts. task 2013 [4] with the main difference being that the search Additionally, prosodic features were extracted using the sub-task has changed from known-item to ad-hoc. This pa- OpenSMILE tool version 2.0 rc1 [5]4 . The following list per describes the experimental data set provided to task par- of prosodic features were calculated over sliding windows of ticipants for MediaEval 2014, details of the two sub-tasks, 10 milliseconds: root mean squared (RMS) energy, loudness, and their evaluation. probability of voicing, fundamental frequency (F0), harmon- ics to noise ratio (HNR), voice quality, and pitch direction 2. EXPERIMENTAL DATASET (classes falling, flat, raising, and direction score). Prosodic The dataset for both subtasks was a collection of 4021 information was provided for the first time in 2014 to en- hours of videos provided by the BBC, which we split into courage participants to explore its potential value for the a development set of 1335 hours, which coincided with the Search and Hyperlinking sub-tasks. test collection used in the 2013 edition of this task, and a 1 test set of 2686 hours. The average length of a video was http://www.vocapia.com/ 2 http://www-lium.univ-lemans.fr/en/content/language- and-speech-technology-lst 3 Copyright is held by the author/owner(s). http://www.natural-speech-technology.org 4 MediaEval 2014 Workshop, October 17-18, 2014, Barcelona, Spain http://opensmile.sourceforge.net/ 2.2 Video Analysis 5. ACKNOWLEDGEMENTS The computer vision groups at University of Leuven (KUL) This work was supported by Science Foundation Ireland and University of Oxford (OXU) provided the output of con- (Grants 08/RFP/CMS1677 and 12/CE/I2267) and Research cept detectors for 1537 concepts from ImageNet5 using dif- Frontiers Programme 2008 (Grant 07/CE/I1142) as part ferent training approaches. The approach by KUL uses ex- of the Centre for Next Generation Localisation (CNGL) amples from ImageNet as positive examples [11], while OXU project at DCU, and by the funding from the European uses an on-the-fly concept detection approach, which down- Commission’s 7th Framework Programme (FP7) under AXES loads training examples through Google image search [3]. ICT-269980. The user studies were executed in collabora- tion with Jana Eggink and Andy O’Dwyer from BBC Re- 3. USER STUDY search, to whom the authors are grateful. In order to create realistic queries and anchors for our 6. REFERENCES test set, we conducted a study with 28 users between aged [1] R. Aly, M. Eskevich, R. Ordelman, and G. J. F. Jones. between 18 and 30 from the general public around London, Adapting binary information retrieval evaluation U.K. The study was similar to our previous study carried out metrics for segment-based retrieval tasks. Technical for MediaEval 2013 [2], with the main difference being the Report 1312.1913, ArXiv e-prints, 2013. focus on information needs with multiple relevant segments. The study focused on a home user scenario, and for this, to [2] R. Aly, R. Ordelman, M. Eskevich, G. J. F. Jones, and reflect the current wide usage of computer tablets, partici- S. Chen. Linking inside a video collection: what and pants used a version of the AXES video search system [9] on how to measure? In WWW (Companion Volume), iPads to search and browse within the video collection. The pages 457–460, 2013. user study consisted of the following steps: i) a participant [3] K. Chatfield and A. Zisserman. Visor: Towards defined an information need using natural language, ii) they on-the-fly large-scale object category retrieval. In searched the test set with a shorter query, one they might Computer Vision–ACCV 2012, pages 432–446. use to search of the Youtube video repository, 3) after select- Springer, 2013. ing several possible relevant segments, they defined anchor [4] M. Eskevich, R. Aly, R. Ordelman, S. Chen, and points or regions within each segment and stated what kind G. J. F. Jones. The Search and Hyperlinking Task at of links they would expect for this anchor. MediaEval 2013. In Proceedings of the MediaEval 2013 Users were then instructed to define queries that they ex- Workshop, Barcelona, Spain, 2013. pected to have more than one relevant video segment in the [5] F. Eyben, F. Weninger, F. Gross, and B. Schuller. collection. These queries consisted of several terms, and Recent developments in opensmile, the munich were used as input to a standard online search engine, e.g. open-source multimedia feature extractor. In “sightseeing london”. The study resulted in 36 ad-hoc search Proceedings of ACM Multimedia 2013, pages 835–838, queries for the test set. The development set for the task Barcelona, Spain. consisted of 50 known-item queries from the MediaEval 2013 [6] J.-L. Gauvain. The Quaero Program: Multilingual and Search and Hyperlinking task. Multimedia Technologies. In Proceedings of IWSLT Subsequently, as in the 2013 studies, we asked the partici- 2010, Paris, France, 2010. pants to mark so-called anchors, or segments they would like [7] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI to see links to, within some the segments that are relevant Broadcast News transcription system. Speech to the issued search queries. The reader can find a more Communication, 37(1-2):89–108, 2002. elaborate description of this user study design in [2]. [8] P. Lanchantin, P. Bell, M. J. F. Gales, T. Hain, X. Liu, Y. Long, J. Quinnell, S. Renals, O. Saz, M. S. 4. REQUIRED RUNS SUBMISSIONS AND Seigel, P. Swietojanski, and P. C. Woodland. EVALUATION PROCEDURE FOR THE Automatic transcription of multi-genre media archives. In Proceedings of the First Workshop on SEARCH AND LINKING SUB-TASKS Speech, Language and Audio in Multimedia For the 2014 task, as well as ad hoc search, we were inter- (SLAM@INTERSPEECH), volume 1012 of CEUR ested in cross-comparison of methods being applied across all Workshop Proceedings, pages 26–31. CEUR-WS.org, four provided transcripts: one manual and 3 ASR. Thus, we 2013. allowed participants to submit up to 5 different approaches [9] K. McGuinness, R. Aly, K. Chatfield, O. Parkhi, or their combinations, each being tested on all four tran- R. Arandjelovic, M. Douze, M. Kemman, M. Kleppe, scripts, for both sub-tasks. In case any of the groups based P. van der Kreeft, K. Macquarrie, A. Ozerov, N. E. their methods on video features only, they could submit this O’Connor, F. De Jong, A. Zisserman, C. Schmid, and type of run in addition as well. P. Perez. The AXES research video search system. In To evaluate the submissions of the search and linking sub- Proceedings of the IEEE ICASSP 2014. tasks a pooling method was used to select submitted seg- [10] A. Rousseau, P. Deléglise, and Y. Estève. Enhancing ments and link targets for relevance assessment. The top-N the ted-lium corpus with selected data for language ranks of all submitted runs were evaluated using crowdsourc- modeling and more ted talks. In The 9th edition of the ing technologies. We report precision oriented metrics, such Language Resources and Evaluation Conference as precision at various cutoffs and mean average precision (LREC 2014), Reykjavik, Iceland, May 2014. (MAP), using different approaches to take into account seg- [11] T. Tommasi, T. Tuytelaars, and B. Caputo. A testbed ment overlap, as described in [1]. for cross-dataset analysis. CoRR, abs/1402.5923, 2014. 5 http://image-net.org/popularity percentile readme.html