<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SAVA at MediaEval 2015: Search and Anchoring in Video Archives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Eskevich</string-name>
          <email>maria.eskevich@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robin Aly</string-name>
          <email>r.aly@ewi.utwente.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roeland Ordelman</string-name>
          <email>ordelman@ewi.utwente.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David N. Racca</string-name>
          <email>dracca@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shu Chen</string-name>
          <email>shu.chen4@mail.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gareth J.F. Jones</string-name>
          <email>gjones@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADAPT Centre, School of Computing, Dublin City University</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>EURECOM</institution>
          ,
          <addr-line>Sophia Antipolis</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Twente</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The Search and Anchoring in Video Archives (SAVA) task at MediaEval 2015 consists of two sub-tasks: (i) search for multimedia content within a video archive using multimodal queries referring to information contained in the audio and visual streams/content, and (ii) automatic selection of video segments within a list of videos that can be used as anchors for further hyperlinking within the archive. The task used a collection of roughly 2700 hours of the BBC broadcast TV material for the former sub-task, and about 70 les taken from this collection for the latter sub-task. The search subtask is based on an ad-hoc retrieval scenario, and is evaluated using a pooling procedure across participants submissions with crowdsourcing relevance assessment using Amazon Mechanical Turk (MTurk). The evaluation used metrics that are variations of MAP adjusted for this task. For the anchor selection sub-task overlapping regions of interest across participants submissions were assessed using MTurk workers, and mean reciprocal rank (MRR), precision and recall were calculated for evaluation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Current developments in the technologies for recording
and storing of multimedia content are leading to very rapid
growth in the resulting multimedia archives. Moreover the
digitisation of the content created in previous decades is
being added to this contemporary material. This stored
information can potentially be used by a wide variety of users
including multimedia professionals, e.g. archivists,
journalists, and the general public. We envisage the main aim of
the SAVA task in assisting these di erent users in their
interaction with the available collections by facilitating e cient
access to relevant content. The solutions to the challenges
of the SAVA task should help the users: 1) to retrieve
interesting parts of the archived multimedia documents when
issuing audio-visual queries to a search system; 2) to
improve the browsing aspect of this activity by providing users
with the content that has pre-de ned or changing on-the- y
anchor points that can lead them to further discoveries on
topics of interest within the collection. Thus the SAVA task
consists of two sub-tasks: Search for multimedia content and
Automatic anchor selection.
Search for multimedia content This promotes the
development of search methods that use multiple
modalities (e.g., speech, visual content, speaker emotions,
etc) to answer search queries by returning relevant
video segments of unrestricted size. Similar to the
earlier MediaEval 2013 Search &amp; Hyperlinking edition
of this sub-task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], participants were provided with a
two- elded query, where one eld refers to spoken
content and the other refers to the visual content of
relevant segments. Participants could use either or both
elds to nd video segments within the collection.
Automatic anchor selection This explores
methods to automatically identify anchors for a given set
of videos, where anchors are media fragments (with
their boundaries de ned by their start and end time)
for which users could require additional information.
What constitutes an anchor depends on the video, e.g.,
in a news programme it could be a mention of persons,
and in a documentary it could be the view of particular
buildings. Participants were provided with a number
of videos of di erent types and were requested to
automatically identify anchors within these videos.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>EXPERIMENTAL DATASET</title>
      <p>The dataset for both sub-tasks is a collection of 4021 hours
of videos provided by the BBC, which are split into a
development set of 1335 hours, and a test set of 2686 hours.
The average length of a video was roughly 45 minutes, and
most videos were in the English language. The test
collection was broadcast content of date spans 01.04.2008 {
11.05.2008 and 12.05.2008 { 31.07.2008 for the development
and test sets respectively. The BBC kindly provided human
generated textual metadata and manual transcripts for each
video. Participants were also provided with the output of
several content analysis methods, which we describe in the
following subsections.</p>
      <p>Although both sub-tasks are based on the same
collection, they use di erent set of videos within each sub-task
framework. For both development and testing of the system
within the `Search for multimedia content' sub-task the
participants used the test set of the video collection. While the
videos for the `Automatic anchor selection' were taken from
both development and test set of the video collection in
order to have a uniform representation of the les containing
previously de ned manually created anchors that were used
for sub-task assessment.
2.1
The audio was extracted from the video stream using the
mpeg software toolbox (sample rate = 16,000Hz, no. of
channels = 1). Based on this data, the transcripts were
created using the following ASR approaches and provided
to participants:</p>
      <p>
        LIMSI-CNRS/Vocapia1 using the VoxSigma vrbs trans
system (version eng-usa 4.0) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The LIUM system2 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], is based on the CMU Sphinx
project. The LIUM system provided three output
formats: (1) one-best transcripts in NIST CTM format,
(2) word lattices in SLF (HTK) format, following a
4gram topology, and (3) confusion networks in a format
similar to ATT FSM.
      </p>
      <p>
        The NST/She eld system3 is trained on multi-genre
sets of BBC data that do not overlap with the
collection used for the task, and uses deep neural
networks [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The ASR transcript contains speaker
diarization, similar to the LIMSI-CNRS/Vocapia
transcipts.
      </p>
      <p>
        Additionally, prosodic features were extracted using the
OpenSMILE tool version 2.0 rc1 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]4. The following list
of prosodic features were calculated over sliding windows of
10 milliseconds: root mean squared (RMS) energy, loudness,
probability of voicing, fundamental frequency (F0),
harmonics to noise ratio (HNR), voice quality, and pitch direction
(classes falling, at, raising, and direction score).
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Visual Content</title>
      <p>
        The computer vision groups at University of Leuven (KUL)
and University of Oxford (OXU) provided the output of
concept detectors for 1,537 concepts from ImageNet5 using
different training approaches. The approach by KUL uses
examples from ImageNet as positive examples [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], while OXU
uses an on-the- y concept detection approach, which
downloads training examples through Google image search [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>TASK INPUT DEFINITION</title>
      <p>As we assumed that both types of user activities behind
the sub-tasks frameworks can be carried out by both
professionals and general audience, we involved representatives of
both user categories into the ground truth creation:
Search for multimedia content: 9 development set
and 30 test set queries were de ned by professionals
with the following pro le: 1) they work in the eld,
e.g. they were journalists, archivists, etc; 2) they were
native English speakers, and 3) they were generally
familiar with BBC content. For each query in the
development set these users de ned two relevant video
segments in order to ensure the existence of potential
relevant content for an ad hoc search.</p>
      <p>
        Automatic anchor selection: We used the video
les containing the manually de ned anchors in
20132014 Search &amp; Hyperlinking tasks [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]: 42 and 33
1http://www.vocapia.com/
2http://www-lium.univ-lemans.fr/en/content/languageand-speech-technology-lst
3http://www.natural-speech-technology.org
4http://opensmile.sourceforge.net/
5http://image-net.org/popularity percentile readme.html
les respectively for the development and testing of the
approaches. The users represented the general public:
they had to be 18-30 years old and had to use search
engines and services such as Youtube on a daily basis.
The anchors provided in this ground truth are by no
means exhaustive, they only exemplify potential
anchors that can be de ned within a given video.
      </p>
      <p>
        More elaborate description of this user study design and
the anchor de nition procedure can be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
respectively.
4.
      </p>
    </sec>
    <sec id="sec-5">
      <title>REQUIRED RUNS</title>
      <p>As our evaluation makes use of cross-comparison between
runs, we did not limit the participants in the number of
submissions for either of the tasks. However, we stated that
due to nite resources, only limited number of runs would
be assessed through crowdsourcing.
5.</p>
    </sec>
    <sec id="sec-6">
      <title>RELEVANCE ASSESSMENT AND EVAL</title>
    </sec>
    <sec id="sec-7">
      <title>UATION METRICS</title>
      <p>
        To evaluate the submissions of the search sub-task, First,
the runs were normalised: videos with corrupted
audiovisual content due to bugs in the employed software mpeg
were dismissed, segments shorter than 10 seconds were
expanded to this length, segments longer than 2 minutes were
cut after this length (using the original's segment start),
and segments overlapping with previously returned segments
were adjusted to remove the overlap. Second, we used the
pooling method with selected runs. Third, the top 10 ranks
of all submitted runs were evaluated using crowdsourcing
technologies. We report precision oriented metrics, such
as precision at various cuto s and mean average precision
(MAP), using di erent approaches to take into account
segment overlap, as described in [
        <xref ref-type="bibr" rid="ref1 ref10">1, 10</xref>
        ].
      </p>
      <p>For the anchoring sub-task, we used the top-25 ranks of
all submissions, and merged overlapping segments. The
resulting segments were judged by MTurk workers who gave
their opinion on these segments taken from the context of
the videos. For the MRR, recall/precision, a result segment
in the run is judged relevant if it overlaps with a relevant
combined segment.
6.</p>
    </sec>
    <sec id="sec-8">
      <title>SUMMARY AND CONCLUSIONS</title>
      <p>This paper describes the setup of the search and anchoring
sub-tasks at the MediaEval 2015. While the de nition of the
search task is built on the experience of several years, the
anchoring sub-task was new in 2015. Here, we describe the
data provided to the task participants and the methods used
to generate the input data and to evaluate submitted results.
7.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by the European Commission's
7th Framework Programme (FP7) under FP7-ICT 269980
(AXES) and FP7-ICT 287911 (LinkedTV); Bpifrance within
the NexGen-TV Project, under grant number F1504054U;
the Dutch national program COMMIT/; Science
Foundation Ireland (Grant No 12/CE/I2267) as part of the Centre
for Next Generation Localisation (CNGL) project at DCU.
The user studies were executed in collaboration with Jana
Eggink and Andy O'Dwyer from BBC Research, to whom
the authors are grateful.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Adapting binary information retrieval evaluation metrics for segment-based retrieval tasks</article-title>
          .
          <source>Technical Report 1312</source>
          .
          <year>1913</year>
          , ArXiv e-prints,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Linking inside a video collection - what and how to measure?</article-title>
          <source>In Proceedings of the 22nd International Conference on World Wide Web Companion</source>
          ,
          <year>IW3C2 2013</year>
          , Rio de Janeiro, Brazil, pages
          <volume>457</volume>
          {
          <fpage>460</fpage>
          ,
          <string-name>
            <surname>Brazil</surname>
          </string-name>
          , May
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Chat eld and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          . Visor:
          <article-title>Towards on-the- y large-scale object category retrieval</article-title>
          .
          <source>In Computer Vision{ACCV</source>
          <year>2012</year>
          , pages
          <fpage>432</fpage>
          {
          <fpage>446</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>The Search and Hyperlinking Task at MediaEval 2013</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Racca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>The Search and Hyperlinking task at MediaEval 2014</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2014 Workshop</source>
          , Barcelona, Catalunya, Spain,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gross</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <article-title>Recent developments in openSMILE, the Munich open-source multimedia feature extractor</article-title>
          .
          <source>In Proceedings of ACM Multimedia</source>
          <year>2013</year>
          , pages
          <fpage>835</fpage>
          {
          <fpage>838</fpage>
          ,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Adda. The LIMSI Broadcast</surname>
          </string-name>
          <article-title>News transcription system</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>37</volume>
          (
          <issue>1-2</issue>
          ):
          <volume>89</volume>
          {
          <fpage>108</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lanchantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J. F.</given-names>
            <surname>Gales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quinnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Renals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Saz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Seigel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Swietojanski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Woodland</surname>
          </string-name>
          .
          <article-title>Automatic transcription of multi-genre media archives</article-title>
          .
          <source>In Proceedings of the First Workshop on Speech, Language and Audio in Multimedia (SLAM@INTERSPEECH)</source>
          , volume
          <volume>1012</volume>
          <source>of CEUR Workshop Proceedings</source>
          , pages
          <volume>26</volume>
          {
          <fpage>31</fpage>
          . CEUR-WS.org,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. J. F.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Huet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>De ning and evaluating video hyperlinking for navigating multimedia archives</article-title>
          .
          <source>In Proceedings of the 24th International Conference on World Wide Web, WWW</source>
          <year>2015</year>
          , Florence, Italy - Companion Volume, pages
          <volume>727</volume>
          {
          <fpage>732</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Racca</surname>
          </string-name>
          and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Evaluating Search and Hyperlinking: an example of the design, test, re ne cycle for metric development</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2015 Workshop</source>
          , Wurzen, Germany,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rousseau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Deleglise</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Esteve</surname>
          </string-name>
          .
          <article-title>Enhancing the TED-LIUM corpus with selected data for language modeling and more ted talks</article-title>
          .
          <source>In The 9th edition of the Language Resources and Evaluation Conference (LREC</source>
          <year>2014</year>
          ), Reykjavik, Iceland, May
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tommasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tuytelaars</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          .
          <article-title>A testbed for cross-dataset analysis</article-title>
          .
          <source>CoRR, abs/1402.5923</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>