The Search and Hyperlinking Task at MediaEval 2013

          Maria Eskevich1 , Robin Aly2 , Roeland Ordelman2 , Shu Chen3 , Gareth J.F. Jones1
                      1
                          CNGL Centre for Global Intelligent Content, Dublin City University, Ireland
                                         2
                                           University of Twente, The Netherlands
                            3
                              INSIGHT Centre for Data Analytics, Dublin City University, Ireland
                                      {meskevich,gjones} @computing.dcu.ie
                                              shu.chen4@mail.dcu.ie
                                        {r.aly, ordelman} @ewi.utwente.nl

ABSTRACT                                                            (i) All audio files were transcribed by LIMSI-CNRS/Vocapia1
The Search and Hyperlinking Task formed part of the Medi-        using the VoxSigma vrbs trans system (version eng-usa 4.0)
aEval 2013 evaluation campaign. The Task consisted of two        [7]. The models used by the system have been updated with
sub-tasks: (1) answering known-item queries from a collec-       partial support from the Quaero program [6].
tion of roughly 1200 hours of broadcast TV material, and            (ii) The LIUM system2 [10] is based on the CMU Sphinx
(2) linking anchors within the known-item to other parts of      project, and was developed to participate in the evaluation
the video collection. We provide an overview of the task and     campaign of the International Workshop on Spoken Lan-
the data sets used.                                              guage Translation 2011. LIUM generated an English tran-
                                                                 script for each audio file successfully processed. These re-
1.    INTRODUCTION                                               sults consist of: (i) one-best hypotheses in NIST CTM for-
                                                                 mat, (ii) word lattices in SLF (HTK) format, following a
  The increasing amount of digital multimedia content avail-
                                                                 4-gram topology, and (iii) confusion networks, in an ATT
able is inspiring new scenarios of user interaction. The
                                                                 FSM-like format.
Search and Hyperlinking Task at MediaEval 2013 envisioned
the following scenario: a user is searching for a segment of
video that they know to be contained in a video collection       2.2    Video cues
(henceforth the target “known-item”). If the user finds the         In addition to spoken content, visual descriptions of video
segment, he may wish to find additional information about        content can potentially help for searching and hyperlinking.
some aspect of this segment. Computer systems should sup-        We provided the participants with shot boundaries, one ex-
port users in this use scenario by providing links to satisfy    tracted keyframe per shot, as well as the outputs of concept
the user’s information needs. This use scenario is a refine-     detectors (see below) and face detectors (see below) for these
ment of a similar task at MediaEval 2012, see [4] for an         keyframes.
overview of employed techniques. This paper describes the           For each video, shot boundaries were determined and a
experimental data set provided to task participants for Me-      single key frame per shot was extracted by a system kindly
diaEval 2013 and details of the two subtasks and their eval-     provided by Technicolor [8]. The extracted frame was the
uation.                                                          most stable I-frame within its shot. In total, the system
                                                                 extracted approximately 1,200,000 shots/keyframes. Con-
2.    EXPERIMENTAL DATASET                                       cept detection scores for a list of concepts were provided.
  The dataset for both subtasks was a collection of 1,260        These concepts were selected by extracting keywords from
hours of video provided by the BBC. The average length           metadata and spoken content. We used the on-the-fly video
of a video was roughly 30 minutes and most videos were           detector Visor, which was kindly provided by the Computer
in the English language. The collection was used both for        Vision Group of University of Oxford [2]. To make the con-
training and testing of systems. The BBC kindly provided         fidence scores comparable over multiple detectors, we used
human generated textual metadata and manual transcripts          them as variables in a logistic regression framework, which
for each video. Participants were also provided with the         ensures the scores lie in the range [0 : 1]. We set the logistic
output of two automatic speech recognition (ASR) systems         regression parameters to the expected value of the parame-
and visual analysis. We describe these information sources       ters from over 374 detectors on the internet archive collection
in the following subsections.                                    used in TRECVid 2011.
                                                                    The appearance of faces in videos can be helpful informa-
2.1    Speech recognition transcripts                            tion for search and linking. INRIA [3] kindly provided pos-
   The audio was extracted from the video stream using the       sible bounding boxes in keyframes with a confidence score
ffmpeg software toolbox (sample rate = 16,000Hz, number          that the bounding box contains a face. Additionally, the
of channels = 1). Based on this data, two sets of ASR tran-      tool also contained for each bounding box, the n most sim-
scripts were created:                                            ilar faces (bounding boxes) in the dataset.

                                                                 1
                                                                  http://www.vocapia.com/
                                                                 2
Copyright is held by the author/owner(s).                         http://www-lium.univ-lemans.fr/en/content/language-
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain   and-speech-technology-lst
3.     USER STUDY                                                 different pairs for crowdsoucing assessment. Users at BBC
   For the definition of realistic queries and anchors, we con-   studies evaluated only 1 run per each participant which re-
ducted a study with 30 users between the ages of 18 and           sulted in 2081 pairs, with 2078 being diverse. The manual
30. By browsing the collection, the users selected items,         assessment of these links resulted in the ground truth used
a segment of a video with a start and an end time, that           to calculate precision at fixed rank cutoffs and MAP for all
were interesting to them. The users were then instructed          the participants runs. Both mturk and BBC ground truths
to consider these items as a known-item which they have           were released to the participants for further performance
to refind. We asked the users to formulate text and visual        analysis.
queries that they would use in a search engine to carry our       6.   ACKNOWLEDGEMENTS
their refinding. The study resulted in 50 known-items and
corresponding multimodal queries. Subsequently, we asked            This work was supported by Science Foundation Ireland
the users to mark so-called anchors, or segments, related to      (Grant 08/RFP/CMS1677) Research Frontiers Programme
other items from within the collection within the known-          2008 and (Grant 07/CE/I1142) as part of the Centre for
item for which they would like to see links. A second session     Next Generation Localisation (CNGL) project at DCU, and
of the study was conducted after the Task participants sub-       by the funding from the European Commission’s 7th Frame-
mitted their results. A set of users partially overlapping        work Programme (FP7) under AXES ICT-269980. The user
with the first group (17 participants) were presented with        studies were executed in collaboration with Jana Eggink and
the selected anchors and with the hyperlinks proposed by          Andy O’Dwyer from BBC Research, to whom the authors
the participants. The users had to assess the suitability of      are greatful.
the proposed hyperlinks. Returning users assessed the an-         7.   REFERENCES
chors that they defined themselves. The reader can find a          [1] R. Aly, R. Ordelman, M. Eskevich, G. J. F. Jones, and
more elaborate description of this user study in [1].                  S. Chen. Linking inside a video collection: what and
                                                                       how to measure? In WWW (Companion Volume),
4.     SEARCH SUBTASK                                                  pages 457–460, 2013.
  We are interested in cross-comparison of one method being        [2] K. Chatfield, V. Lempitsky, A. Vedaldi, and
applied on all three types of transcripts. Thus we required            A. Zisserman. The devil is in the details: an
the participants to submit up to 5 different approaches or             evaluation of recent feature encoding method. In
their combinations, each being tested on all three transcripts.        British Machine Vision Conference (BMVC 2011),
  We used the following three metrics in order to evalu-               Dundee, United Kingdom, 2011.
ate the submissions of the workshop participants: mean re-         [3] R. G. Cinbis, J. Verbeek, and C. Schmid.
ciprocal rank (MRR), mean generalized average precision                Unsupervised Metric Learning for Face Identification
(mGAP) and mean average segment precision (MASP). MRR                  in TV Video. In Proceedings of ICCV 2011, Barcelona,
assesses the ranking of the relevant units. mGAP [9] rewards           Spain, 2011.
techniques that not only find the relevant items earlier in the    [4] M. Eskevich, G. J. Jones, R. Aly, R. J. Ordelman,
ranked output list, but also are closer to the ideal point to          S. Chen, D. Nadeem, C. Guinaudeau, G. Gravier,
begin playback (the “jump-in” point) of the relevant con-              P. Sébillot, T. de Nies, P. Debevere, R. Van de Walle,
tent. MASP [5] takes into account the ranking of the results           P. Galuscakova, P. Pecina, and M. Larson. Multimedia
and the length of both relevant and irrelevant segments that           information seeking through search and hyperlinking.
need to be listened to before reaching the relevant content.           In Proceedings of the 3rd ACM conference on
                                                                       International conference on multimedia retrieval,
5.     LINKING SUBTASK                                                 ICMR’13, pages 287–294, 2013.
   For the Hyperlinking subtask, the workshop participants         [5] M. Eskevich, W. Magdy, and G. J. F. Jones. New
were provided with the so-called anchors created by the users          metrics for meaningful evaluation of informally
in the user study at the BBC and had to generate link tar-             structured speech retrieval. In Proceedings of ECIR
gets. To be more concrete, the participants had to return              2012, pages 170–181, Barcelona, Spain, 2012.
a list of potential video segment link targets ranked by the       [6] J.-L. Gauvain. The Quaero Program: Multilingual and
likelihood of being relevant to the anchor or to the anchor in         Multimedia Technologies. IWSLT 2010, 2010.
the context of corresponding known-item segment (though            [7] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI
always independently of the initial known-item query).                 Broadcast News transcription system. Speech
   To evaluate the linking subtask we used crowdsourcing via           Communication, 37(1-2):89–108, 2002.
Amazon’s Mechanical Turk platform3 , whereas the second            [8] A. Massoudi, F. Lefebvre, C. Demarty, L. Oisel, and
stage of user study at BBC allowed us to assess the reliability        B. Chupeau. A video fingerprint based on visual digest
of the crowdsourcing results.                                          and local fingerprints. In International Conference on
   Due to time and resource constraints, we chose a random             Image Processing (ICIP 2006), pages 2297–2300, 2006.
subset of 30 anchors out of initial 98 for the formal task as-     [9] P. Pecina, P. Hoffmannova, G. J. F. Jones, Y. Zhang,
sessment. For these anchors and potential links, we used a             and D. W. Oard. Overview of the CLEF 2007
pooling method to group the videos from the top 10 ranks               cross-language speech retrieval track. In Proceedings of
of no more than 5 submitted runs of each of the partici-               CLEF 2007, pages 674–686. Springer, 2007.
pants. Submission were selected to maximize the diversity         [10] A. Rousseau, F. Bougares, P. Deléglise, H. Schwenk,
of the linking methods used in the pools to be assessed. This          and Y. Estèv. LIUM’s systems for the IWSLT 2011
resulted in 9195 anchor-target pairs, that represented 7637            Speech Translation Tasks. In Proceedings of IWSLT
3                                                                      2011, San Francisco, USA, 2011.
    www.mturk.com