The Placing Task at MediaEval 2016

                Jaeyoung Choi1,2 , Claudia Hauff2 , Olivier Van Laere3 , and Bart Thomee4
                              1
                                  International Computer Science Institute, Berkeley, CA, USA
                                        2
                                          Delft University of Technology, the Netherlands
                                            3
                                              Blueshift Labs, San Francisco, CA, USA
                                                  4
                                                    Google, San Bruno, CA, USA
                   jaeyoung@icsi.berkeley.edu, c.hauff@tudelft.nl, oliviervanlaere@gmail.com, bthomee@google.com


ABSTRACT                                                                       Training                  Testing
The seventh edition of the Placing Task at MediaEval fo-                   #Photos #Videos           #Photos #Videos
cuses on two challenges: (1) estimation-based placing, which                4,991,679    24,955      1,497,464      29,934
addresses estimating the geographic location where a photo
or video was taken, and (2) verification-based placing, which       Table 1: Overview of training and test set sizes for
addresses verifying whether a photo or video was indeed             both sub-tasks.
taken at a pre-specified geographic location. Like the previ-
ous edition, we made the organizer baselines for both sub-          ative Commons3 licensed photos and videos with associated
tasks available as open source code, and published a live           metadata. Similar to last year’s edition [1], we sampled a
leaderboard that allows the participants to gain insights into      subset of the YFCC100M for training and testing, see Table 1.
the effectiveness of their approaches compared to the offi-         No user appeared both in the training set and in the test
cial baselines and in relation to each other at an early stage,     set, and to minimize user and location bias, each user was
before the actual run submissions are due.                          limited to contributing at most 250 photos and 50 videos,
                                                                    where no photos/videos were included that were taken by a
1.     INTRODUCTION                                                 user less than 10 minutes apart. We included both test sets
The Placing Task challenges participants to develop tech-           used in the Placing Tasks of 2014 and 2015 in this year’s
niques to automatically determine where in the world pho-           test set, allowing us to assess how the location estimation
tos and videos were captured based on analyzing their vi-           performance has improved over time.
sual content and/or textual metadata, optionally augmented             The rather uncontrolled nature of the data (sampled from
with knowledge from external resources like gazetteers. In          longitudinal, large-scale, noisy and biased raw data) con-
particular, we aim to see those taking part to improve upon         fronts participants with additional challenges. To lower the
the contributions of participants from previous editions, as        entrance barrier, we precomputed and provided participants
well as of the research community at large, e.g. [8, 11, 4, 2,      with fifteen visual, and three aural features commonly used
6, 9]. Although the Placing Task has indeed been shown to           in multimedia analysis for each of the media objects includ-
be a “research catalyst” [7] for geo-prediction of social mul-      ing SIFT, Gist, color and texture histograms for visual anal-
timedia, with each edition of the task it becomes a greater         ysis, and MFCC for audio analysis [3], which together with
challenge to alter the benchmark sufficiently to allow and          the original photo and video content are publicly and freely
motivate participants to make substantial changes to their          available through the Multimedia Commons Initiative4 . In
frameworks and systems instead of small technical ones. The         addition, several expansion packs have been released by the
introduction of the verification sub-task this year was driven      creators of the YFCC100M dataset, such as detected visual
by this consideration, as it requires participants to integrate     concepts and Exif metadata, which could prove useful for
a notion of confidence in their location predictions to decide      the participants.
whether or not a photo or video was taken in a particular
country, state, city or neighborhood.                               3.     TASKS
                                                                    Estimation-based sub-task: In this sub-task, partici-
2.     DATA                                                         pants were given a hierarchy of places across the world, rang-
This year’s edition of the Placing Task was once again based        ing across neighborhoods, cities, regions, countries and con-
on the YFCC100M [10], which to date is the largest pub-             tinents. For each photo and video, they were asked to pick
licly and freely available social multimedia collection, and        a node (i.e. a place) from the hierarchy in which they most
which can be obtained through the Yahoo Webscope pro-               confidently believe it had been taken. While the ground
gram1 . The full dataset consists of 100 million Flickr2 Cre-       truth locations of the photos and videos were associated
1
                                                                    with their actual coordinates and thus in essence the most
    https://bit.ly/yfcc100md                                        accurate nodes (i.e. the leaves) in the hierarchy, the partic-
2
    https://www.flickr.com                                          ipants could express a reduced confidence in their location
                                                                    estimates by selecting nodes at higher levels in the hierarchy.
Copyright is held by the author/owner(s).                           3
MediaEval 2016 Workshop, Oct. 20–21, 2016, Hilversum, Nether-           https://www.creativecommons.org
                                                                    4
lands                                                                   http://www.mmcommons.org
If their confidence was sufficiently high, participants could    a live leaderboard that allowed participants to submit runs
naturally directly estimate the geographic coordinate of the     and view their relative standing towards others, as evaluated
photo/video instead of choosing a node from the hierarchy.       on a representative development set (i.e. part of the, but not
   As our place hierarchy we used the Places expansion pack      the complete, test set).
of the YFCC100M dataset, in which each geotagged photo and
video is geotagged to its corresponding place, which follows     7.   REFERENCES
a variation of the general hierarchy:
                                                                  [1] J. Choi, C. Hauff, O. Van Laere, and B. Thomee. The
Country→State→City→Neighborhood                                       Placing Task at MediaEval 2015. In Working Notes of
                                                                      the MediaEval Benchmarking Initiative for Multimedia
Due to the use of the hierarchy, only photos and videos that          Evaluation, 2015.
were successfully reverse geocoded were included in this sub-     [2] J. Choi, H. Lei, V. Ekambaram, P. Kelm, L. Gottlieb,
task, and thus media captured in or above international wa-           T. Sikora, K. Ramchandran, and G. Friedland. Human
ters were excluded.                                                   vs machine: establishing a human baseline for
                                                                      multimodal location estimation. In Proceedings of the
Verification-based sub-task: In this sub-task, partici-               ACM International Conference on Multimedia, pages
pants were given a photo or video and a place from the hi-            867–876, 2013.
erarchy, and were asked to verify whether or not the media        [3] J. Choi, B. Thomee, G. Friedland, L. Cao, K. Ni,
item was really captured in the given place. In the test set,         D. Borth, B. Elizalde, L. Gottlieb, C. Carrano,
we randomly switched the locations of 50% of the photos               R. Pearce, et al. The Placing Task: a large-scale
and videos, where we required that those switched were at             geo-estimation challenge for social-media videos and
least taken in a different country. Then, for 25% of the me-          images. In Proceedings of the ACM International
dia items we removed the neighborhood level and below, for            Workshop on Geotagging and Its Applications in
25% the city level and below, and for 25% the state level and         Multimedia, pages 27–31, 2014.
below, enabling us to assess how the level of the hierarchy
                                                                  [4] C. Hauff and G. Houben. Placing images on the world
affects the verification quality of the participants’ systems.
                                                                      map: a microblog-based enrichment approach. In
                                                                      Proceedings of the ACM Conference on Research and
4.     RUNS                                                           Development in Information Retrieval, pages 691–700,
Participants may submit up to five attempts (‘runs’) for              2012.
each sub-task. They can make use of the provided meta-            [5] C. Karney. Algorithms for geodesics. Journal of
data and precomputed features, as well as external resources          Geodesy, 87(1):43–55, 2013.
(e.g. gazetteers, dictionaries, Web corpora), depending on        [6] P. Kelm, S. Schmiedeke, J. Choi, G. Friedland,
the run type. We distinguish between the following five run           V. Ekambaram, K. Ramchandran, and T. Sikora. A
types:                                                                novel fusion method for integrating multiple
                                                                      modalities and knowledge for multimodal location
Run 1: Only provided textual metadata may be used.
                                                                      estimation. In Proceedings of the ACM International
Run 2: Only provided visual & aural features may be used.             Workshop on Geotagging and Its Applications in
                                                                      Multimedia, pages 7–12, 2013.
Run 3: Only provided textual metadata, visual features
    and the visual & aural features may be used.                  [7] M. Larson, P. Kelm, A. Rae, C. Hauff, B. Thomee,
                                                                      M. Trevisiol, J. Choi, O. van Laere, S. Schockaert,
Run 4–5: Everything is allowed, except for crawling the               G. Jones, P. Serdyukov, V. Murdock, and
    exact items contained in the test set.                            G. Friedland. The benchmark as a research catalyst:
                                                                      charting the progress of geo-prediction for social
5.     EVALUATION                                                     multimedia. In Multimodal Location Estimation of
For the estimation-based sub-task, the evaluation metric is           Videos and Images. 2014.
based on the geographic distance between the ground truth         [8] A. Rae and P. Kelm. Working Notes for the Placing
coordinate and the predicted coordinate or place from the             Task at MediaEval 2012, 2012.
hierarchy. Whenever a participant estimates a place from          [9] P. Serdyukov, V. Murdock, and R. van Zwol. Placing
the hierarchy, we substitute it by its geographic centroid.           Flickr photos on a map. In Proceedings of the ACM
We measure geographic distances with Karney’s formula [5];            Conference on Research and Development in
this formula is based on the assumption that the shape of             Information Retrieval, pages 484–491, 2009.
the Earth is an oblate spheroid, which produces more accu-       [10] B. Thomee, D. Shamma, G. Friedland, B. Elizalde,
rate distances than methods such as the great-circle distance         K. Ni, D. Poland, D. Borth, and L. Li. YFCC100M:
that assume the shape of the Earth to be a sphere. For the            The new data in multimedia research.
verification-based sub-task, we measure the classification ac-        Communications of the ACM, 59(2):64–73, 2016.
curacy.                                                          [11] M. Trevisiol, H. Jégou, J. Delhumeau, and G. Gravier.
                                                                      Retrieving geo-location of videos with a divide &
6.     BASELINES & LEADERBOARD                                        conquer hierarchical multimodal approach. In
                                                                      Proceedings of the ACM International Conference on
As task organizers, we provided two open source baselines5
                                                                      Multimedia Retrieval, pages 1–8, 2013.
to the participants, one for the estimation sub-task and one
for the verification sub-task. Additionally, we implemented
5
    http://bit.ly/2dnggcg