SAVA at MediaEval 2015:
                    Search and Anchoring in Video Archives

 Maria Eskevich1 , Robin Aly2 , Roeland Ordelman2 , David N. Racca3 , Shu Chen3 , Gareth J.F. Jones3
                        1
                            EURECOM, Sophia Antipolis, France; 2 University of Twente, The Netherlands
                               3
                                 ADAPT Centre, School of Computing, Dublin City University, Ireland
                            maria.eskevich@gmail.com; {r.aly, ordelman}@ewi.utwente.nl;
                            {dracca, gjones}@computing.dcu.ie; shu.chen4@mail.dcu.ie

ABSTRACT                                                                • Search for multimedia content This promotes the
The Search and Anchoring in Video Archives (SAVA) task                    development of search methods that use multiple modal-
at MediaEval 2015 consists of two sub-tasks: (i) search for               ities (e.g., speech, visual content, speaker emotions,
multimedia content within a video archive using multimodal                etc) to answer search queries by returning relevant
queries referring to information contained in the audio and               video segments of unrestricted size. Similar to the
visual streams/content, and (ii) automatic selection of video             earlier MediaEval 2013 Search & Hyperlinking edition
segments within a list of videos that can be used as anchors              of this sub-task [4], participants were provided with a
for further hyperlinking within the archive. The task used a              two-fielded query, where one field refers to spoken con-
collection of roughly 2700 hours of the BBC broadcast TV                  tent and the other refers to the visual content of rel-
material for the former sub-task, and about 70 files taken                evant segments. Participants could use either or both
from this collection for the latter sub-task. The search sub-             fields to find video segments within the collection.
task is based on an ad-hoc retrieval scenario, and is evalu-
                                                                        • Automatic anchor selection This explores meth-
ated using a pooling procedure across participants submis-
                                                                          ods to automatically identify anchors for a given set
sions with crowdsourcing relevance assessment using Ama-
                                                                          of videos, where anchors are media fragments (with
zon Mechanical Turk (MTurk). The evaluation used met-
                                                                          their boundaries defined by their start and end time)
rics that are variations of MAP adjusted for this task. For
                                                                          for which users could require additional information.
the anchor selection sub-task overlapping regions of interest
                                                                          What constitutes an anchor depends on the video, e.g.,
across participants submissions were assessed using MTurk
                                                                          in a news programme it could be a mention of persons,
workers, and mean reciprocal rank (MRR), precision and
                                                                          and in a documentary it could be the view of particular
recall were calculated for evaluation.
                                                                          buildings. Participants were provided with a number
                                                                          of videos of different types and were requested to au-
                                                                          tomatically identify anchors within these videos.
1.   INTRODUCTION
   Current developments in the technologies for recording
and storing of multimedia content are leading to very rapid        2.    EXPERIMENTAL DATASET
growth in the resulting multimedia archives. Moreover the             The dataset for both sub-tasks is a collection of 4021 hours
digitisation of the content created in previous decades is be-     of videos provided by the BBC, which are split into a de-
ing added to this contemporary material. This stored infor-        velopment set of 1335 hours, and a test set of 2686 hours.
mation can potentially be used by a wide variety of users          The average length of a video was roughly 45 minutes, and
including multimedia professionals, e.g. archivists, journal-      most videos were in the English language. The test col-
ists, and the general public. We envisage the main aim of          lection was broadcast content of date spans 01.04.2008 –
the SAVA task in assisting these different users in their inter-   11.05.2008 and 12.05.2008 – 31.07.2008 for the development
action with the available collections by facilitating efficient    and test sets respectively. The BBC kindly provided human
access to relevant content. The solutions to the challenges        generated textual metadata and manual transcripts for each
of the SAVA task should help the users: 1) to retrieve in-         video. Participants were also provided with the output of
teresting parts of the archived multimedia documents when          several content analysis methods, which we describe in the
issuing audio-visual queries to a search system; 2) to im-         following subsections.
prove the browsing aspect of this activity by providing users         Although both sub-tasks are based on the same collec-
with the content that has pre-defined or changing on-the-fly       tion, they use different set of videos within each sub-task
anchor points that can lead them to further discoveries on         framework. For both development and testing of the system
topics of interest within the collection. Thus the SAVA task       within the ‘Search for multimedia content’ sub-task the par-
consists of two sub-tasks: Search for multimedia content and       ticipants used the test set of the video collection. While the
Automatic anchor selection.                                        videos for the ‘Automatic anchor selection’ were taken from
                                                                   both development and test set of the video collection in or-
                                                                   der to have a uniform representation of the files containing
Copyright is held by the author/owner(s).                          previously defined manually created anchors that were used
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        for sub-task assessment.
2.1     Audio Content                                                   files respectively for the development and testing of the
   The audio was extracted from the video stream using the              approaches. The users represented the general public:
ffmpeg software toolbox (sample rate = 16,000Hz, no. of                 they had to be 18-30 years old and had to use search
channels = 1). Based on this data, the transcripts were                 engines and services such as Youtube on a daily basis.
created using the following ASR approaches and provided                 The anchors provided in this ground truth are by no
to participants:                                                        means exhaustive, they only exemplify potential an-
                                                                        chors that can be defined within a given video.
    • LIMSI-CNRS/Vocapia1 using the VoxSigma vrbs trans
      system (version eng-usa 4.0) [7].                              More elaborate description of this user study design and
    • The LIUM system2 [11], is based on the CMU Sphinx            the anchor definition procedure can be found in [2] and [9]
      project. The LIUM system provided three output for-          respectively.
      mats: (1) one-best transcripts in NIST CTM format,
      (2) word lattices in SLF (HTK) format, following a 4-        4.   REQUIRED RUNS
      gram topology, and (3) confusion networks in a format          As our evaluation makes use of cross-comparison between
      similar to ATT FSM.                                          runs, we did not limit the participants in the number of
    • The NST/Sheffield system3 is trained on multi-genre          submissions for either of the tasks. However, we stated that
      sets of BBC data that do not overlap with the col-           due to finite resources, only limited number of runs would
      lection used for the task, and uses deep neural net-         be assessed through crowdsourcing.
      works [8]. The ASR transcript contains speaker di-
      arization, similar to the LIMSI-CNRS/Vocapia tran-           5.   RELEVANCE ASSESSMENT AND EVAL-
      scipts.
                                                                        UATION METRICS
   Additionally, prosodic features were extracted using the
OpenSMILE tool version 2.0 rc1 [6]4 . The following list              To evaluate the submissions of the search sub-task, First,
of prosodic features were calculated over sliding windows of       the runs were normalised: videos with corrupted audio-
10 milliseconds: root mean squared (RMS) energy, loudness,         visual content due to bugs in the employed software ffmpeg
probability of voicing, fundamental frequency (F0), harmon-        were dismissed, segments shorter than 10 seconds were ex-
ics to noise ratio (HNR), voice quality, and pitch direction       panded to this length, segments longer than 2 minutes were
(classes falling, flat, raising, and direction score).             cut after this length (using the original’s segment start),
                                                                   and segments overlapping with previously returned segments
2.2     Visual Content                                             were adjusted to remove the overlap. Second, we used the
                                                                   pooling method with selected runs. Third, the top 10 ranks
   The computer vision groups at University of Leuven (KUL)
                                                                   of all submitted runs were evaluated using crowdsourcing
and University of Oxford (OXU) provided the output of con-
                                                                   technologies. We report precision oriented metrics, such
cept detectors for 1,537 concepts from ImageNet5 using dif-
                                                                   as precision at various cutoffs and mean average precision
ferent training approaches. The approach by KUL uses ex-
                                                                   (MAP), using different approaches to take into account seg-
amples from ImageNet as positive examples [12], while OXU
                                                                   ment overlap, as described in [1, 10].
uses an on-the-fly concept detection approach, which down-
                                                                      For the anchoring sub-task, we used the top-25 ranks of
loads training examples through Google image search [3].
                                                                   all submissions, and merged overlapping segments. The re-
                                                                   sulting segments were judged by MTurk workers who gave
3.    TASK INPUT DEFINITION                                        their opinion on these segments taken from the context of
   As we assumed that both types of user activities behind         the videos. For the MRR, recall/precision, a result segment
the sub-tasks frameworks can be carried out by both profes-        in the run is judged relevant if it overlaps with a relevant
sionals and general audience, we involved representatives of       combined segment.
both user categories into the ground truth creation:
                                                                   6.   SUMMARY AND CONCLUSIONS
     • Search for multimedia content: 9 development set
       and 30 test set queries were defined by professionals         This paper describes the setup of the search and anchoring
       with the following profile: 1) they work in the field,      sub-tasks at the MediaEval 2015. While the definition of the
       e.g. they were journalists, archivists, etc; 2) they were   search task is built on the experience of several years, the
       native English speakers, and 3) they were generally         anchoring sub-task was new in 2015. Here, we describe the
       familiar with BBC content. For each query in the de-        data provided to the task participants and the methods used
       velopment set these users defined two relevant video        to generate the input data and to evaluate submitted results.
       segments in order to ensure the existence of potential
       relevant content for an ad hoc search.                      7.   ACKNOWLEDGMENTS
                                                                      This work was supported by the European Commission’s
     • Automatic anchor selection: We used the video               7th Framework Programme (FP7) under FP7-ICT 269980
       files containing the manually defined anchors in 2013-      (AXES) and FP7-ICT 287911 (LinkedTV); Bpifrance within
       2014 Search & Hyperlinking tasks [4, 5]: 42 and 33          the NexGen-TV Project, under grant number F1504054U;
1
  http://www.vocapia.com/                                          the Dutch national program COMMIT/; Science Founda-
2
  http://www-lium.univ-lemans.fr/en/content/language-              tion Ireland (Grant No 12/CE/I2267) as part of the Centre
and-speech-technology-lst                                          for Next Generation Localisation (CNGL) project at DCU.
3
  http://www.natural-speech-technology.org                         The user studies were executed in collaboration with Jana
4
  http://opensmile.sourceforge.net/                                Eggink and Andy O’Dwyer from BBC Research, to whom
5
  http://image-net.org/popularity percentile readme.html           the authors are grateful.
8.   REFERENCES
 [1] R. Aly, M. Eskevich, R. Ordelman, and G. J. F. Jones.
     Adapting binary information retrieval evaluation
     metrics for segment-based retrieval tasks. Technical
     Report 1312.1913, ArXiv e-prints, 2013.
 [2] R. Aly, R. Ordelman, M. Eskevich, G. J. F. Jones,
     and S. Chen. Linking inside a video collection - what
     and how to measure? In Proceedings of the 22nd
     International Conference on World Wide Web
     Companion, IW3C2 2013, Rio de Janeiro, Brazil,
     pages 457–460, Brazil, May 2013.
 [3] K. Chatfield and A. Zisserman. Visor: Towards
     on-the-fly large-scale object category retrieval. In
     Computer Vision–ACCV 2012, pages 432–446.
     Springer, 2013.
 [4] M. Eskevich, R. Aly, R. Ordelman, S. Chen, and
     G. J. F. Jones. The Search and Hyperlinking Task at
     MediaEval 2013. In MediaEval 2013 Workshop,
     Barcelona, Spain, 2013.
 [5] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman,
     S. Chen, and G. J. F. Jones. The Search and
     Hyperlinking task at MediaEval 2014. In Working
     Notes Proceedings of the MediaEval 2014 Workshop,
     Barcelona, Catalunya, Spain, 2014.
 [6] F. Eyben, F. Weninger, F. Gross, and B. Schuller.
     Recent developments in openSMILE, the Munich
     open-source multimedia feature extractor. In
     Proceedings of ACM Multimedia 2013, pages 835–838,
     Barcelona, Spain.
 [7] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI
     Broadcast News transcription system. Speech
     Communication, 37(1-2):89–108, 2002.
 [8] P. Lanchantin, P. Bell, M. J. F. Gales, T. Hain,
     X. Liu, Y. Long, J. Quinnell, S. Renals, O. Saz, M. S.
     Seigel, P. Swietojanski, and P. C. Woodland.
     Automatic transcription of multi-genre media
     archives. In Proceedings of the First Workshop on
     Speech, Language and Audio in Multimedia
     (SLAM@INTERSPEECH), volume 1012 of CEUR
     Workshop Proceedings, pages 26–31. CEUR-WS.org,
     2013.
 [9] R. J. F. Ordelman, M. Eskevich, R. Aly, B. Huet, and
     G. J. F. Jones. Defining and evaluating video
     hyperlinking for navigating multimedia archives. In
     Proceedings of the 24th International Conference on
     World Wide Web, WWW 2015, Florence, Italy -
     Companion Volume, pages 727–732, 2015.
[10] D. N. Racca and G. J. F. Jones. Evaluating Search
     and Hyperlinking: an example of the design, test,
     refine cycle for metric development. In Working Notes
     Proceedings of the MediaEval 2015 Workshop, Wurzen,
     Germany, 2015.
[11] A. Rousseau, P. Deléglise, and Y. Estève. Enhancing
     the TED-LIUM corpus with selected data for language
     modeling and more ted talks. In The 9th edition of the
     Language Resources and Evaluation Conference
     (LREC 2014), Reykjavik, Iceland, May 2014.
[12] T. Tommasi, T. Tuytelaars, and B. Caputo. A testbed
     for cross-dataset analysis. CoRR, abs/1402.5923, 2014.