Working Notes for the Placing Task at MediaEval 2011∗

                      Adam Rae                      Vannesa Murdock                     Pavel Serdyukov
                   Yahoo! Research                     Yahoo! Research                          Yandex
                adamrae@yahoo-                      vmurdock@yahoo-                pavser@yandex-team.ru
                    inc.com                             inc.com
                                                      Pascal Kelm
                                                Technische Universität Berlin
                                                 kelm@nue.tu-berlin.de

ABSTRACT                                                         However, newly uploaded digital media and videos in partic-
This paper provides a description of the MediaEval 2011          ular, with any form of geographical coordinates, are still rel-
Placing Task. The task requires participants to automati-        atively rare compared to the total quantity uploaded. There
cally assign latitude and longitude coordinates to each of the   is also a significant amount of data that has already been
provided test videos. This kind of geographical location tag,    uploaded that does not currently have geotags.
or geotag, helps users localise videos, allowing their media
to be anchored to real world locations. Currently, however,      This task challenges participants to develop techniques to
most videos online are not labelled with this kind of data.      automatically annotate videos using their visual content and
This task encourages participants to find innovative ways        some selected, associated textual metadata. In particular,
of doing this labelling automatically. The data comes from       we wish to see those taking part extend and improve upon
Flickr—an example of a photo sharing website that allows         the work of previous tasks at MediaEval and elsewhere in
users to both encode their photos and videos with geotags,       the community [6, 2, 1, 3, 7].
as well as use them when searching and browsing.

This paper describes the task, the data sets provided and        2.    DATA
how the individual participants results are evaluated.           The data set is an extension of the MediaEval 2010 Plac-
                                                                 ing Task data set [3] and contains a set of geotagged Flickr
                                                                 videos as well as the metadata for geotagged Flickr images.
Keywords                                                         A set of basic visual features extracted for all images and
Geotags, Location, Video Annotation, Benchmark                   for the frames of the videos is provided to participants. All
                                                                 selected videos and images are shared by their owners under
1.    INTRODUCTION                                               the Creative Commons license.
This task invites participants to propose new and creative
approaches to tackling the problem of automatic annota-
tion of video with geotags. These tags are usually added         2.1     Development data
in one of two ways: by the photo device (e.g. camera or          Development data is the combination of the development
camera-equipped mobile phone) or manually by the user.           and test data from the MediaEval 2010 Placing Task. The
An increasing number of device are becoming available that       two sets are pooled to form the 2011 development set.
can automatically encode geotags, using the Global Position
System, mobile cell towers or look-up of the coordinates of      We include as much metadata as is publicly accessible to
local Wi-Fi networks. Users are also becoming more aware         make available to participants a variety of information sources
of the value of adding such data manually, as shown by the       for use when predicting locations. This includes the title,
increase in photo management software and websites that          tags (labelled Keywords in the provided metadata files), de-
allows users to annotate, browse and search according to         scription and comments. We also include information about
location (e.g. Flickr, Apple’s iPhoto and Aperture, Google       the user who uploaded the videos and about his/her con-
Picasa WebAlbums).                                               tacts, his/her favourite labelled images and the list of all
                                                                 videos she/he has uploaded in the past.
∗This work was supported by the European Commission un-
der contract FP7-248984 GLOCAL.                                  It should be emphasised that the task requires the partici-
                                                                 pants to predict the latitude and longitude for each video.
                                                                 The prediction of the names of locations or other geographic
                                                                 context information is outside the scope of this task.

                                                                 The development set comes with the ground truth values for
                                                                 each video. This information is contained in the metadata
                                                                 in the field <Location>.

MediaEval’11 1–2 September, 2011, Pisa, Italy
                                                                 2.1.1    Video keyframes
Frames are extracted at 4 second intervals from the videos        We are also interested in the issue of videos that have been
and saved as individual JPEG-format images, using the freely      uploaded by an uploader who was unseen in the develop-
available ffmpeg 1 tool.                                          ment (i.e., training) data. In order to examine this issue, we
                                                                  calculate a second set of scores over the part of the test data
                                                                  containing only unseen uploaders.
2.1.2     Flickr images
For development purposes, we distribute metadata for 3,185,258
Flickr photos uniformly sampled from all parts of the world,      4.   TASK DETAILS
using geographic bounding boxes of various sizes via the          Participants may submit between three and five runs. They
Flickr API(http://www.flickr.com/services/api/). Whilst           can make use of image metadata and audio and visual fea-
the images themselves are not distributed in this task, they      tures, as well as external resources, depending on the run.
are publicly accessible on Flickr (if they have not been re-      A minimum of one run that uses only audio/visual features
moved since the data set was gathered) and the provided           is required. The other two required runs allow for the free
metadata contains links to the source images.                     use of the provided data (but no other), with either the op-
                                                                  tion of using a gazetteer or not. Participants may submit an
From these images, their existing metadata is extracted.          optional additional run that uses a gazetteer, as well as a op-
Most, but not all, photos have textual tags. All photos have      tional run that allows for the crawling of additional material
geotags of at least region level accuracy. The accuracy at-       from outside of the provided data (the general run).
tribute encodes at which zoom level the uploader used when
placing the photo on a map. There are 16 zoom levels and          Participants are not allowed to re-find the provided videos
hence 16 accuracy levels (e.g., 3 - country level, 6 - region     on-line and use actual geotags (or other related data) for
level, 12 - city level, 16 - street level).                       preparing their runs. This is to ensure that participants help
                                                                  contribute to a realistic and sensible benchmark in which all
While these images and their metadata are potentially help-       test videos as “unseen”. The participants are also asked to
ful for development purposes, the evaluation test set, how-       not crawl Flickr for any additional videos or images and use
ever, only includes videos.                                       only those provided in the data sets (with exception made
                                                                  for the optional general run).
We also generated visual feature descriptors for the extracted
video keyframes and training images, using the open source        5.   REFERENCES
library LIRE [4] available online2 , with the default param-      [1] J. Hays and A. Efros. Im2gps: estimating geographic
eter settings and the default image size of 500 pixels on the         information from a single image. In Computer Vision
longest side. This feature set comprises of the following:            and Pattern Recognition, 2008. CVPR 2008. IEEE
                                                                      Conference on, pages 1 –8, june 2008.
                                                                  [2] P. Kelm, S. Schmiedeke, and T. Sikora. Multi-modal,
     • Colour and Edge Directivity Descriptor                         multi-resource methods for placing flickr videos on the
     • Gabor Texture                                                  map. In Proceedings of the 1st ACM International
     • Fuzzy Colour and Texture Histogram                             Conference on Multimedia Retrieval, ICMR ’11, pages
     • Colour Histogram                                               52:1–52:8, New York, NY, USA, 2011. ACM.
     • Scalable Colour                                            [3] M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac,
     • Auto Colour Correlogram                                        C. Wartena, V. Murdock, G. Friedland, R. Ordelman,
     • Tamura Texture                                                 and G. Jones. Automatic tagging and geotagging in
     • Edge Histogram                                                 video collections and communities. In ACM
                                                                      International Conference on Multimedia Retrieval
     • Colour Layout
                                                                      (ICMR 2011), 2011.
                                                                  [4] M. Lux and S. A. Chatzichristofis. Lire: lucene image
The Scalable Colour Edge Histogram and Colour Layout fea-             retrieval: an extensible java cbir library. In Proceeding
tures are implemented as specified in the MPEG-7 schema [5].          of the 16th ACM international conference on
                                                                      Multimedia, MM ’08, pages 1085–1088, New York, NY,
                                                                      USA, 2008. ACM.
3.     GROUND TRUTH AND EVALUATION                                [5] B. S. Manjunath, J. Ohm, V. V. Vasudevan, and
The geo-coordinates associated with the Flickr videos will be         A. Yamada. Color and texture descriptors. IEEE
used as the ground truth. Since these do not always serve to          Transactions on circuits and systems for video
precisely pinpoint the location of a video, the evaluation will       technology, 11(6):703–715, 2001.
be carried out at each of a series of widening circles: 1km,      [6] P. Serdyukov, V. Murdock, and R. van Zwol. Placing
10km, 100km, 1000km, 10000km. If a reported location is               flickr photos on a map. In Proceedings of the 32nd
found within a given circle radius, it is counted as correctly        international ACM SIGIR conference on Research and
localised. The accuracy over each circle will be reported.            development in information retrieval, SIGIR ’09, pages
                                                                      484–491, New York, NY, USA, 2009. ACM.
The orthodromic and Euclidean distances between the ground        [7] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding
truth coordinates and those reported by participants will             locations of flickr resources using language models and
also be calculated.                                                   similarity search. In Proceedings of the 1st ACM
1
    http://www.ffmpeg.org/                                            International Conference on Multimedia Retrieval,
2
    http://www.semanticmetadata.net/lire/                             page 48. ACM, 2011.