=Paper= {{Paper |id=Vol-1263/paper30 |storemode=property |title=The Search and Hyperlinking Task at MediaEval 2014 |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_30.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/EskevichAROCJ14 }} ==The Search and Hyperlinking Task at MediaEval 2014== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_30.pdf
      The Search and Hyperlinking Task at MediaEval 2014

                    Maria Eskevich1 , Robin Aly2 , David N. Racca1 , Roeland Ordelman2 ,
                                      Shu Chen1,3 , Gareth J.F. Jones1
         1
             CNGL Centre for Global Intelligent Content, School of Computing, Dublin City University, Ireland
                                        2
                                          University of Twente, The Netherlands
                         3
                           INSIGHT Centre for Data Analytics, Dublin City University, Ireland
                                {meskevich, dracca, gjones} @computing.dcu.ie
                                            shu.chen4@mail.dcu.ie
                                      {r.aly, ordelman} @ewi.utwente.nl

ABSTRACT                                                         roughly 45 minutes, and most videos were in the English
The Search and Hyperlinking Task at MediaEval 2014 is the        language. The test collection was broadcast content of date
third edition of this task. As in previous versions, it con-     spans 01.04.2008 – 11.05.2008 and 12.05.2008 – 31.07.2008
sisted of two sub-tasks: (i) answering search queries from a     for the development and test sets respectively. The BBC
collection of roughly 2700 hours of BBC broadcast TV mate-       kindly provided human generated textual metadata and man-
rial, and (ii) linking anchor segments from within the videos    ual transcripts for each video. Participants were also pro-
to other target segments within the video collection. For        vided with the output of several content analysis methods,
MediaEval 2014, both sub-tasks were based on an ad-hoc           which we describe in the following subsections.
retrieval scenario, and were evaluated using a pooling pro-
cedure across participants submissions with crowdsourcing        2.1   Audio Analysis
relevance assessment using Amazon Mechanical Turk.                  The audio was extracted from the video stream using the
                                                                 ffmpeg software toolbox (sample rate = 16,000Hz, no. of
1.   INTRODUCTION                                                channels = 1). Based on this data, the transcripts were
                                                                 created using the following ASR approaches and provided
   The full value of the rapidly growing archives of newly       to participants:
produced digital multimedia content and digitalisation of           (i) LIMSI-CNRS/Vocapia1 , which uses the VoxSigma vrbs trans
previously created analog audio and video material will only     system (version eng-usa 4.0) [7]. Compared to the tran-
be realized with the development of technologies that allow      scripts created for the 2013 edition of this task, the sys-
users to explore them through search and retrieval of poten-     tem’s models had been updated with partial support from
tially interesting content.                                      the Quaero program [6].
   The Search and Hyperlinking Task at MediaEval 2014 en-           (ii) The LIUM system2 [10], is based on the CMU Sphinx
visioned the following scenario: a user is searching for rele-   project. The LIUM system provided three output formats:
vant segments within a video collection that address a cer-      (1) one-best transcripts in NIST CTM format, (2) word lat-
tain topic of interest expressed in a query. If the user finds   tices in SLF (HTK) format, following a 4-gram topology, and
a segment which is relevant to their initial information need    (3) confusion networks in a format similar to ATT FSM.
expressed through the query, they may wish to find addi-            (iii) The NST/Sheffield system3 is trained on multi-genre
tional information about some aspect of this segment.            sets of BBC data that does not overlap with the collection
   The task framework asks participants to create systems        used for the task, and uses deep neural networks [8]. The
that support the search and linking aspects of the task. The     ASR transcript contains speaker diarization, similar to the
use scenario is the same as in the Search and Hyperlinking       LIMSI-CNRS/Vocapia transcipts.
task 2013 [4] with the main difference being that the search        Additionally, prosodic features were extracted using the
sub-task has changed from known-item to ad-hoc. This pa-         OpenSMILE tool version 2.0 rc1 [5]4 . The following list
per describes the experimental data set provided to task par-    of prosodic features were calculated over sliding windows of
ticipants for MediaEval 2014, details of the two sub-tasks,      10 milliseconds: root mean squared (RMS) energy, loudness,
and their evaluation.                                            probability of voicing, fundamental frequency (F0), harmon-
                                                                 ics to noise ratio (HNR), voice quality, and pitch direction
2.   EXPERIMENTAL DATASET                                        (classes falling, flat, raising, and direction score). Prosodic
  The dataset for both subtasks was a collection of 4021         information was provided for the first time in 2014 to en-
hours of videos provided by the BBC, which we split into         courage participants to explore its potential value for the
a development set of 1335 hours, which coincided with the        Search and Hyperlinking sub-tasks.
test collection used in the 2013 edition of this task, and a
                                                                 1
test set of 2686 hours. The average length of a video was          http://www.vocapia.com/
                                                                 2
                                                                   http://www-lium.univ-lemans.fr/en/content/language-
                                                                 and-speech-technology-lst
                                                                 3
Copyright is held by the author/owner(s).                          http://www.natural-speech-technology.org
                                                                 4
MediaEval 2014 Workshop, October 17-18, 2014, Barcelona, Spain     http://opensmile.sourceforge.net/
2.2      Video Analysis                                           5.   ACKNOWLEDGEMENTS
   The computer vision groups at University of Leuven (KUL)          This work was supported by Science Foundation Ireland
and University of Oxford (OXU) provided the output of con-        (Grants 08/RFP/CMS1677 and 12/CE/I2267) and Research
cept detectors for 1537 concepts from ImageNet5 using dif-        Frontiers Programme 2008 (Grant 07/CE/I1142) as part
ferent training approaches. The approach by KUL uses ex-          of the Centre for Next Generation Localisation (CNGL)
amples from ImageNet as positive examples [11], while OXU         project at DCU, and by the funding from the European
uses an on-the-fly concept detection approach, which down-        Commission’s 7th Framework Programme (FP7) under AXES
loads training examples through Google image search [3].          ICT-269980. The user studies were executed in collabora-
                                                                  tion with Jana Eggink and Andy O’Dwyer from BBC Re-
3.     USER STUDY                                                 search, to whom the authors are grateful.
   In order to create realistic queries and anchors for our       6.   REFERENCES
test set, we conducted a study with 28 users between aged
                                                                   [1] R. Aly, M. Eskevich, R. Ordelman, and G. J. F. Jones.
between 18 and 30 from the general public around London,
                                                                       Adapting binary information retrieval evaluation
U.K. The study was similar to our previous study carried out
                                                                       metrics for segment-based retrieval tasks. Technical
for MediaEval 2013 [2], with the main difference being the
                                                                       Report 1312.1913, ArXiv e-prints, 2013.
focus on information needs with multiple relevant segments.
The study focused on a home user scenario, and for this, to        [2] R. Aly, R. Ordelman, M. Eskevich, G. J. F. Jones, and
reflect the current wide usage of computer tablets, partici-           S. Chen. Linking inside a video collection: what and
pants used a version of the AXES video search system [9] on            how to measure? In WWW (Companion Volume),
iPads to search and browse within the video collection. The            pages 457–460, 2013.
user study consisted of the following steps: i) a participant      [3] K. Chatfield and A. Zisserman. Visor: Towards
defined an information need using natural language, ii) they           on-the-fly large-scale object category retrieval. In
searched the test set with a shorter query, one they might             Computer Vision–ACCV 2012, pages 432–446.
use to search of the Youtube video repository, 3) after select-        Springer, 2013.
ing several possible relevant segments, they defined anchor        [4] M. Eskevich, R. Aly, R. Ordelman, S. Chen, and
points or regions within each segment and stated what kind             G. J. F. Jones. The Search and Hyperlinking Task at
of links they would expect for this anchor.                            MediaEval 2013. In Proceedings of the MediaEval 2013
   Users were then instructed to define queries that they ex-          Workshop, Barcelona, Spain, 2013.
pected to have more than one relevant video segment in the         [5] F. Eyben, F. Weninger, F. Gross, and B. Schuller.
collection. These queries consisted of several terms, and              Recent developments in opensmile, the munich
were used as input to a standard online search engine, e.g.            open-source multimedia feature extractor. In
“sightseeing london”. The study resulted in 36 ad-hoc search           Proceedings of ACM Multimedia 2013, pages 835–838,
queries for the test set. The development set for the task             Barcelona, Spain.
consisted of 50 known-item queries from the MediaEval 2013         [6] J.-L. Gauvain. The Quaero Program: Multilingual and
Search and Hyperlinking task.                                          Multimedia Technologies. In Proceedings of IWSLT
   Subsequently, as in the 2013 studies, we asked the partici-         2010, Paris, France, 2010.
pants to mark so-called anchors, or segments they would like       [7] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI
to see links to, within some the segments that are relevant            Broadcast News transcription system. Speech
to the issued search queries. The reader can find a more               Communication, 37(1-2):89–108, 2002.
elaborate description of this user study design in [2].            [8] P. Lanchantin, P. Bell, M. J. F. Gales, T. Hain,
                                                                       X. Liu, Y. Long, J. Quinnell, S. Renals, O. Saz, M. S.
4.     REQUIRED RUNS SUBMISSIONS AND                                   Seigel, P. Swietojanski, and P. C. Woodland.
       EVALUATION PROCEDURE FOR THE                                    Automatic transcription of multi-genre media
                                                                       archives. In Proceedings of the First Workshop on
       SEARCH AND LINKING SUB-TASKS                                    Speech, Language and Audio in Multimedia
   For the 2014 task, as well as ad hoc search, we were inter-         (SLAM@INTERSPEECH), volume 1012 of CEUR
ested in cross-comparison of methods being applied across all          Workshop Proceedings, pages 26–31. CEUR-WS.org,
four provided transcripts: one manual and 3 ASR. Thus, we              2013.
allowed participants to submit up to 5 different approaches        [9] K. McGuinness, R. Aly, K. Chatfield, O. Parkhi,
or their combinations, each being tested on all four tran-             R. Arandjelovic, M. Douze, M. Kemman, M. Kleppe,
scripts, for both sub-tasks. In case any of the groups based           P. van der Kreeft, K. Macquarrie, A. Ozerov, N. E.
their methods on video features only, they could submit this           O’Connor, F. De Jong, A. Zisserman, C. Schmid, and
type of run in addition as well.                                       P. Perez. The AXES research video search system. In
   To evaluate the submissions of the search and linking sub-          Proceedings of the IEEE ICASSP 2014.
tasks a pooling method was used to select submitted seg-          [10] A. Rousseau, P. Deléglise, and Y. Estève. Enhancing
ments and link targets for relevance assessment. The top-N             the ted-lium corpus with selected data for language
ranks of all submitted runs were evaluated using crowdsourc-           modeling and more ted talks. In The 9th edition of the
ing technologies. We report precision oriented metrics, such           Language Resources and Evaluation Conference
as precision at various cutoffs and mean average precision             (LREC 2014), Reykjavik, Iceland, May 2014.
(MAP), using different approaches to take into account seg-       [11] T. Tommasi, T. Tuytelaars, and B. Caputo. A testbed
ment overlap, as described in [1].                                     for cross-dataset analysis. CoRR, abs/1402.5923, 2014.
5
    http://image-net.org/popularity percentile readme.html