<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Working Notes for the Placing Task at MediaEval 2011</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pascal Kelm</string-name>
          <email>kelm@nue.tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Adam Rae Yahoo! Research inc.com</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pavel Serdyukov Yandex</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Technische Universität Berlin</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Vannesa Murdock Yahoo! Research inc.com</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>This paper provides a description of the MediaEval 2011 Placing Task. The task requires participants to automatically assign latitude and longitude coordinates to each of the provided test videos. This kind of geographical location tag, or geotag, helps users localise videos, allowing their media to be anchored to real world locations. Currently, however, most videos online are not labelled with this kind of data. This task encourages participants to nd innovative ways of doing this labelling automatically. The data comes from Flickr|an example of a photo sharing website that allows users to both encode their photos and videos with geotags, as well as use them when searching and browsing. This paper describes the task, the data sets provided and how the individual participants results are evaluated.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Geotags</kwd>
        <kwd>Location</kwd>
        <kwd>Video Annotation</kwd>
        <kwd>Benchmark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>This task invites participants to propose new and creative
approaches to tackling the problem of automatic
annotation of video with geotags. These tags are usually added
in one of two ways: by the photo device (e.g. camera or
camera-equipped mobile phone) or manually by the user.
An increasing number of device are becoming available that
can automatically encode geotags, using the Global Position
System, mobile cell towers or look-up of the coordinates of
local Wi-Fi networks. Users are also becoming more aware
of the value of adding such data manually, as shown by the
increase in photo management software and websites that
allows users to annotate, browse and search according to
location (e.g. Flickr, Apple's iPhoto and Aperture, Google
Picasa WebAlbums).</p>
      <p>This work was supported by the European Commission
under contract FP7-248984 GLOCAL.
However, newly uploaded digital media and videos in
particular, with any form of geographical coordinates, are still
relatively rare compared to the total quantity uploaded. There
is also a signi cant amount of data that has already been
uploaded that does not currently have geotags.</p>
      <p>
        This task challenges participants to develop techniques to
automatically annotate videos using their visual content and
some selected, associated textual metadata. In particular,
we wish to see those taking part extend and improve upon
the work of previous tasks at MediaEval and elsewhere in
the community [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref6 ref7">6, 2, 1, 3, 7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. DATA</title>
      <p>
        The data set is an extension of the MediaEval 2010
Placing Task data set [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and contains a set of geotagged Flickr
videos as well as the metadata for geotagged Flickr images.
A set of basic visual features extracted for all images and
for the frames of the videos is provided to participants. All
selected videos and images are shared by their owners under
the Creative Commons license.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Development data</title>
      <p>Development data is the combination of the development
and test data from the MediaEval 2010 Placing Task. The
two sets are pooled to form the 2011 development set.
We include as much metadata as is publicly accessible to
make available to participants a variety of information sources
for use when predicting locations. This includes the title,
tags (labelled Keywords in the provided metadata les),
description and comments. We also include information about
the user who uploaded the videos and about his/her
contacts, his/her favourite labelled images and the list of all
videos she/he has uploaded in the past.</p>
      <p>It should be emphasised that the task requires the
participants to predict the latitude and longitude for each video.
The prediction of the names of locations or other geographic
context information is outside the scope of this task.
The development set comes with the ground truth values for
each video. This information is contained in the metadata
in the eld &lt;Location&gt;.
Frames are extracted at 4 second intervals from the videos
and saved as individual JPEG-format images, using the freely
available mpeg1 tool.
2.1.2 Flickr images
For development purposes, we distribute metadata for 3,185,258
Flickr photos uniformly sampled from all parts of the world,
using geographic bounding boxes of various sizes via the
Flickr API(http://www.flickr.com/services/api/). Whilst
the images themselves are not distributed in this task, they
are publicly accessible on Flickr (if they have not been
removed since the data set was gathered) and the provided
metadata contains links to the source images.</p>
      <p>From these images, their existing metadata is extracted.
Most, but not all, photos have textual tags. All photos have
geotags of at least region level accuracy. The accuracy
attribute encodes at which zoom level the uploader used when
placing the photo on a map. There are 16 zoom levels and
hence 16 accuracy levels (e.g., 3 - country level, 6 - region
level, 12 - city level, 16 - street level).</p>
      <p>While these images and their metadata are potentially
helpful for development purposes, the evaluation test set,
however, only includes videos.</p>
      <p>
        We also generated visual feature descriptors for the extracted
video keyframes and training images, using the open source
library LIRE [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] available online2, with the default
parameter settings and the default image size of 500 pixels on the
longest side. This feature set comprises of the following:
Colour and Edge Directivity Descriptor
Gabor Texture
Fuzzy Colour and Texture Histogram
Colour Histogram
Scalable Colour
Auto Colour Correlogram
Tamura Texture
Edge Histogram
      </p>
      <p>
        Colour Layout
The Scalable Colour Edge Histogram and Colour Layout
features are implemented as speci ed in the MPEG-7 schema [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. GROUND TRUTH AND EVALUATION</title>
      <p>The geo-coordinates associated with the Flickr videos will be
used as the ground truth. Since these do not always serve to
precisely pinpoint the location of a video, the evaluation will
be carried out at each of a series of widening circles: 1km,
10km, 100km, 1000km, 10000km. If a reported location is
found within a given circle radius, it is counted as correctly
localised. The accuracy over each circle will be reported.
The orthodromic and Euclidean distances between the ground
truth coordinates and those reported by participants will
also be calculated.
We are also interested in the issue of videos that have been
uploaded by an uploader who was unseen in the
development (i.e., training) data. In order to examine this issue, we
calculate a second set of scores over the part of the test data
containing only unseen uploaders.</p>
    </sec>
    <sec id="sec-5">
      <title>4. TASK DETAILS</title>
      <p>Participants may submit between three and ve runs. They
can make use of image metadata and audio and visual
features, as well as external resources, depending on the run.
A minimum of one run that uses only audio/visual features
is required. The other two required runs allow for the free
use of the provided data (but no other), with either the
option of using a gazetteer or not. Participants may submit an
optional additional run that uses a gazetteer, as well as a
optional run that allows for the crawling of additional material
from outside of the provided data (the general run).
Participants are not allowed to re- nd the provided videos
on-line and use actual geotags (or other related data) for
preparing their runs. This is to ensure that participants help
contribute to a realistic and sensible benchmark in which all
test videos as \unseen". The participants are also asked to
not crawl Flickr for any additional videos or images and use
only those provided in the data sets (with exception made
for the optional general run).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          and
          <string-name>
            <surname>A. Efros.</surname>
          </string-name>
          <article-title>Im2gps: estimating geographic information from a single image</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2008</year>
          .
          <article-title>CVPR 2008</article-title>
          . IEEE Conference on, pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          8,
          <string-name>
            <surname>june</surname>
          </string-name>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kelm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmiedeke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sikora</surname>
          </string-name>
          <article-title>. Multi-modal, multi-resource methods for placing ickr videos on the map</article-title>
          .
          <source>In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR '11</source>
          , pages
          <issue>52:1</issue>
          {
          <issue>52</issue>
          :
          <fpage>8</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rudinac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wartena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Murdock</surname>
          </string-name>
          , G. Friedland,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Automatic tagging and geotagging in video collections and communities</article-title>
          .
          <source>In ACM International Conference on Multimedia Retrieval (ICMR</source>
          <year>2011</year>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lux</surname>
          </string-name>
          and
          <string-name>
            <surname>S. A.</surname>
          </string-name>
          <article-title>Chatzichristo s. Lire: lucene image retrieval: an extensible java cbir library</article-title>
          .
          <source>In Proceeding of the 16th ACM international conference on Multimedia, MM '08</source>
          , pages
          <fpage>1085</fpage>
          {
          <fpage>1088</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Manjunath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ohm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Yamada</surname>
          </string-name>
          .
          <article-title>Color and texture descriptors</article-title>
          .
          <source>IEEE Transactions on circuits and systems for video technology</source>
          ,
          <volume>11</volume>
          (
          <issue>6</issue>
          ):
          <volume>703</volume>
          {
          <fpage>715</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Murdock</surname>
          </string-name>
          , and R. van Zwol.
          <article-title>Placing ickr photos on a map</article-title>
          .
          <source>In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <source>SIGIR '09</source>
          , pages
          <fpage>484</fpage>
          {
          <fpage>491</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>O.</given-names>
            <surname>Van Laere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhoedt</surname>
          </string-name>
          .
          <article-title>Finding locations of ickr resources using language models and similarity search</article-title>
          .
          <source>In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 48. ACM</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>