The Placing Task at MediaEval 2015

                Jaeyoung Choi1,2 , Claudia Hauff2 , Olivier Van Laere3 , and Bart Thomee4
                                 1
                                     International Computer Science Institute, Berkeley, USA
                                         2
                                           Delft University of Technology, the Netherlands
                                                3
                                                  Blueshift Labs, San Francisco, USA
                                                          4
                                                            Yahoo Labs, USA
                 jaeyoung@icsi.berkeley.edu, c.hauff@tudelft.nl, oliviervanlaere@gmail.com, bthomee@yahoo-inc.com


ABSTRACT                                                                       Training                  Testing
The sixth edition of the Placing Task at MediaEval intro-                  #Photos #Videos           #Photos #Videos
duces two new sub-tasks: (1) locale-based placing, which em-                         Locale-based placing sub-task
phasizes the need to move away from an evaluation purely
                                                                           4, 672, 382     22, 767     931, 573       18, 316
based on latitude and longitude towards an entity-centered
evaluation, and (2) mobility-based placing, which addresses                         Mobility-based placing sub-task
predicting missing locations within a sequence of movements;                 148, 349            0      33, 026            0
the latter is a specific real-world use case that so far has re-
ceived little attention within the research community. Two
                                                                    Table 1: Overview of training and test sets for both
additional changes over the previous years are the introduc-
                                                                    sub-tasks.
tion of open source organizer baselines for both sub-tasks
shortly after the official data release, and the implementa-        licensed photos and videos with associated metadata. Sim-
tion of a live leaderboard, which allows the participants to        ilar to last year’s edition [2], we sampled a subset of the
gain insights into the effectiveness of their approaches com-       YFCC100M for training and testing, see Table 1. The need for
pared to the official baselines and in relation to each other at    two separate datasets arose from the task requirements (de-
an early stage, before the actual run submissions are due.          scribed in Section 3). No user appeared both in the training
                                                                    set and in the test set, and to minimize user and location
1.   INTRODUCTION                                                   bias, each user was limited to contributing at most 250 pho-
The Placing Task challenges participants to develop tech-           tos and 50 videos, where no photos/videos were included
niques to automatically annotate photos and videos with             that were taken by a user less than 10 minutes apart. The
their geolocation using their visual content and/or textual         rather uncontrolled nature of the data (sampled from lon-
metadata. In particular, we wish to see those taking part to        gitudinal, large-scale, noisy and biased raw data) confronts
extend and improve upon the contributions of participants           participants with additional challenges. To lower the en-
from previous editions, as well as of the research community        trance barrier, we precomputed and provided participants
at large, e.g. [7, 10, 3, 1, 5, 8]. Although the Placing Task       with fifteen visual, and three aural features commonly used
has indeed been shown to be a “research catalyst” [6] for           in multimedia analysis for each of the media objects includ-
geoprediction of social multimedia, with each edition of the        ing SIFT, Gist, color and texture histograms for visual anal-
task it becomes a greater challenge to alter the benchmark          ysis, and MFCC for audio analysis [2].
sufficiently to allow and motivate participants to make sub-
stantial changes to their frameworks and systems instead of         3.     TASKS
small technical ones—this year’s introduction of organizer          Locale-based sub-task: In this sub-task, participants were
baselines, a leaderboard, as well as novel sub-tasks were           given a hierarchy of places across the world, ranging across
driven by this consideration.                                       neighborhoods, cities, regions, countries and continents. For
                                                                    each photo and video, they were asked to pick a node (i.e.
2.   DATA                                                           a place) from the hierarchy in which they most confidently
                                                                    believe it had been taken. While the ground truth locations
This year’s edition of the Placing Task was based on the            of the photos and videos were associated with the most accu-
YFCC100M1 [9], which to date is the largest social multime-         rate nodes (i.e. the leaves) in the hierarchy, the participants
dia collection that is publicly and freely available. The full      could express a reduced confidence in their location esti-
dataset consists of 100 million Flickr2 Creative Commons3           mates by selecting nodes at higher levels in the hierarchy.
1                                                                   If their confidence was sufficiently high, participants could
  https://bit.ly/yfcc100md
2                                                                   naturally directly estimate the geographic coordinate of the
  https://www.flickr.com
3
  https://www.creativecommons.org                                   photo/video instead of choosing a node from the hierarchy.
                                                                       As our place hierarchy we used version 2.0 of the open
                                                                    source GADM database4 , which contains the spatial bound-
                                                                    aries of the world’s administrative areas. As the GADM only
Copyright is held by the author/owner(s).                           4
MediaEval 2015 Workshop, Sept. 14–15, 2015, Wurzen, Germany             http://www.gadm.org
contains data up to city level, we manually supplemented it     6.     BASELINES & LEADERBOARD
with neighbourhood data for several cities obtained from the    As task organizers, we provided two open source baselines
geo-game ClickThatHood5 . In total, the hierarchy contains      to the participants, one for the locale6 sub-task and one for
221,458 leaf nodes that are spread across 253 countries. The    the mobility7 sub-task. Additionally, we implemented a live
hierarchy has a maximum depth of 7 and an average depth         leaderboard that allowed participants to submit runs and
of 4.33, with each place being a variation of the general hi-   view their relative standing towards others, as evaluated on
erarchy:                                                        a representative development set (i.e. part of, but not the
                                                                complete, test set).
Country→State→Province→County→City→Neighborhood

Due to the use of the hierarchy, only photos and videos taken   7.     REFERENCES
within any of the GADM boundaries were part of this sub-         [1] J. Choi, H. Lei, V. Ekambaram, P. Kelm, L. Gottlieb,
task, and thus media captured in or above international wa-          T. Sikora, K. Ramchandran, and G. Friedland. Human
ters were excluded.                                                  vs machine: establishing a human baseline for
                                                                     multimodal location estimation. In Proceedings of the
Mobility-based sub-task: In this sub-task, participants              ACM International Conference on Multimedia, pages
were given a sequence of photos taken in a certain city by           867–876, 2013.
a specific user, of which not all photos were associated with    [2] J. Choi, B. Thomee, G. Friedland, L. Cao, K. Ni,
a geographic coordinate (e.g. the user took some photos              D. Borth, B. Elizalde, L. Gottlieb, C. Carrano,
when GPS was temporarily unavailable). The participants              R. Pearce, et al. The Placing Task: a large-scale
were asked to predict the locations of those photos with             geo-estimation challenge for social-media videos and
missing coordinates. The nearly 150K training photos of              images. In Proceedings of the ACM International
this sub-task were divided into 23,116 sequences, while the          Workshop on Geotagging and Its Applications in
approximately 33K test photos were separated into 5,119              Multimedia, pages 27–31, 2014.
sequences. From each sequence in the test set about 30%
                                                                 [3] C. Hauff and G. Houben. Placing images on the world
of the coordinates were missing, which are the ones that
                                                                     map: a microblog-based enrichment approach. In
needed to be predicted.
                                                                     Proceedings of the ACM Conference on Research and
                                                                     Development in Information Retrieval, pages 691–700,
4.     RUNS                                                          2012.
Participants may submit up to five attempts (‘runs’) for         [4] C. Karney. Algorithms for geodesics. Journal of
each sub-task. They can make use of the provided meta-               Geodesy, 87(1):43–55, 2013.
data and precomputed features, as well as external resources     [5] P. Kelm, S. Schmiedeke, J. Choi, G. Friedland,
(e.g. gazetteers, dictionaries, Web corpora), depending on           V. Ekambaram, K. Ramchandran, and T. Sikora. A
the run type. We distinguish between the following five run          novel fusion method for integrating multiple
types:                                                               modalities and knowledge for multimodal location
                                                                     estimation. In Proceedings of the ACM International
Run 1: Only provided textual metadata may be used.                   Workshop on Geotagging and Its Applications in
Run 2: Only provided visual & aural features may be used.            Multimedia, pages 7–12, 2013.
                                                                 [6] M. Larson, P. Kelm, A. Rae, C. Hauff, B. Thomee,
Run 3: Only provided textual metadata, visual features
                                                                     M. Trevisiol, J. Choi, O. van Laere, S. Schockaert,
    and the visual & aural features may be used.
                                                                     G. Jones, P. Serdyukov, V. Murdock, and
Run 4–5: Everything is allowed, except for crawling the              G. Friedland. The benchmark as a research catalyst:
    exact items contained in the test set, or any items by           charting the progress of geo-prediction for social
    a test user taken within 24 hours before the first and           multimedia. In Multimodal Location Estimation of
    after the last timestamp of a photo sequence in the              Videos and Images. 2014.
    mobility test set.                                           [7] A. Rae and P. Kelm. Working Notes for the Placing
                                                                     Task at MediaEval 2012, 2012.
5.     EVALUATION                                                [8] P. Serdyukov, V. Murdock, and R. van Zwol. Placing
                                                                     Flickr photos on a map. In Proceedings of the ACM
For the locale-based sub-task, the evaluation metric is based
                                                                     Conference on Research and Development in
on a hierarchical distance between the ground truth node
                                                                     Information Retrieval, pages 484–491, 2009.
and the predicted node or coordinate in the place hierarchy.
The mobility-based sub-task is evaluated according to the        [9] B. Thomee, D. Shamma, B. Friedland, G.and Elizalde,
familiar geographic distance-based metric, where for each            K. Ni, D. Poland, D. Borth, and L. Li. YFCC100M:
test item the distance is computed between the ground truth          The new data in multimedia research.
coordinate and the estimated coordinate. One important               Communications of the ACM, 2015. To appear.
difference with past editions is that this year we measure      [10] M. Trevisiol, H. Jégou, J. Delhumeau, and G. Gravier.
geographic distances with Karney’s formula [4]; this formula         Retrieving geo-location of videos with a divide &
is based on the assumption that the shape of the Earth is            conquer hierarchical multimodal approach. In
an oblate spheroid, which produces more accurate distances           Proceedings of the ACM International Conference on
than methods such as the great-circle distance that assume           Multimedia Retrieval, pages 1–8, 2013.
the shape of the Earth to be a sphere.                          6
                                                                    http://bit.ly/1gsrmvx
5                                                               7
    http://www.click-that-hood.com/                                 http://bit.ly/1K8vUy8