1. INTRODUCTION

Jaeyoung Choi

jaeyoung@icsi.berkeley.edu 1 2

Claudia Hauff

Olivier Van Laere

oliviervanlaere@gmail.com 0

Bart Thomee

bthomee@yahoo-inc.com 3 0 Blueshift Labs , San Francisco , USA 1 Delft University of Technology , the Netherlands 2 International Computer Science Institute , Berkeley , USA 3 Yahoo Labs , USA

2015

14 15

The sixth edition of the Placing Task at MediaEval introduces two new sub-tasks: (1) locale-based placing, which emphasizes the need to move away from an evaluation purely based on latitude and longitude towards an entity-centered evaluation, and (2) mobility-based placing, which addresses predicting missing locations within a sequence of movements; the latter is a speci c real-world use case that so far has received little attention within the research community. Two additional changes over the previous years are the introduction of open source organizer baselines for both sub-tasks shortly after the o cial data release, and the implementation of a live leaderboard, which allows the participants to gain insights into the e ectiveness of their approaches compared to the o cial baselines and in relation to each other at an early stage, before the actual run submissions are due.

1. INTRODUCTION

The Placing Task challenges participants to develop techniques to automatically annotate photos and videos with their geolocation using their visual content and/or textual metadata. In particular, we wish to see those taking part to extend and improve upon the contributions of participants from previous editions, as well as of the research community at large, e.g. [ 7, 10, 3, 1, 5, 8 ]. Although the Placing Task has indeed been shown to be a \research catalyst" [ 6 ] for geoprediction of social multimedia, with each edition of the task it becomes a greater challenge to alter the benchmark su ciently to allow and motivate participants to make substantial changes to their frameworks and systems instead of small technical ones|this year's introduction of organizer baselines, a leaderboard, as well as novel sub-tasks were driven by this consideration.

DATA

This year's edition of the Placing Task was based on the YFCC100M1 [ 9 ], which to date is the largest social multimedia collection that is publicly and freely available. The full dataset consists of 100 million Flickr2 Creative Commons3

1https://bit.ly/yfcc100md 2https://www.flickr.com 3https://www.creativecommons.org Training

#Photos #Videos

Testing #Photos #Videos

Locale-based placing sub-task 4; 672; 382

Mobility-based placing sub-task licensed photos and videos with associated metadata. Similar to last year's edition [ 2 ], we sampled a subset of the YFCC100M for training and testing, see Table 1. The need for two separate datasets arose from the task requirements (described in Section 3). No user appeared both in the training set and in the test set, and to minimize user and location bias, each user was limited to contributing at most 250 photos and 50 videos, where no photos/videos were included that were taken by a user less than 10 minutes apart. The rather uncontrolled nature of the data (sampled from longitudinal, large-scale, noisy and biased raw data) confronts participants with additional challenges. To lower the entrance barrier, we precomputed and provided participants with fteen visual, and three aural features commonly used in multimedia analysis for each of the media objects including SIFT, Gist, color and texture histograms for visual analysis, and MFCC for audio analysis [ 2 ]. 3.

TASKS

Locale-based sub-task: In this sub-task, participants were given a hierarchy of places across the world, ranging across neighborhoods, cities, regions, countries and continents. For each photo and video, they were asked to pick a node (i.e. a place) from the hierarchy in which they most con dently believe it had been taken. While the ground truth locations of the photos and videos were associated with the most accurate nodes (i.e. the leaves) in the hierarchy, the participants could express a reduced con dence in their location estimates by selecting nodes at higher levels in the hierarchy. If their con dence was su ciently high, participants could naturally directly estimate the geographic coordinate of the photo/video instead of choosing a node from the hierarchy.

As our place hierarchy we used version 2.0 of the open source GADM database4, which contains the spatial boundaries of the world's administrative areas. As the GADM only contains data up to city level, we manually supplemented it with neighbourhood data for several cities obtained from the geo-game ClickThatHood5. In total, the hierarchy contains 221,458 leaf nodes that are spread across 253 countries. The hierarchy has a maximum depth of 7 and an average depth of 4.33, with each place being a variation of the general hierarchy: Country!State!Province!County!City!Neighborhood Due to the use of the hierarchy, only photos and videos taken within any of the GADM boundaries were part of this subtask, and thus media captured in or above international waters were excluded.

Mobility-based sub-task: In this sub-task, participants were given a sequence of photos taken in a certain city by a speci c user, of which not all photos were associated with a geographic coordinate (e.g. the user took some photos when GPS was temporarily unavailable). The participants were asked to predict the locations of those photos with missing coordinates. The nearly 150K training photos of this sub-task were divided into 23,116 sequences, while the approximately 33K test photos were separated into 5,119 sequences. From each sequence in the test set about 30% of the coordinates were missing, which are the ones that needed to be predicted.

RUNS

Participants may submit up to ve attempts (`runs') for each sub-task. They can make use of the provided metadata and precomputed features, as well as external resources (e.g. gazetteers, dictionaries, Web corpora), depending on the run type. We distinguish between the following ve run types: Run 1: Only provided textual metadata may be used. Run 2: Only provided visual & aural features may be used. Run 3: Only provided textual metadata, visual features and the visual & aural features may be used.

Run 4{5: Everything is allowed, except for crawling the exact items contained in the test set, or any items by a test user taken within 24 hours before the rst and after the last timestamp of a photo sequence in the mobility test set.

EVALUATION

For the locale-based sub-task, the evaluation metric is based on a hierarchical distance between the ground truth node and the predicted node or coordinate in the place hierarchy. The mobility-based sub-task is evaluated according to the familiar geographic distance-based metric, where for each test item the distance is computed between the ground truth coordinate and the estimated coordinate. One important di erence with past editions is that this year we measure geographic distances with Karney's formula [ 4 ]; this formula is based on the assumption that the shape of the Earth is an oblate spheroid, which produces more accurate distances than methods such as the great-circle distance that assume the shape of the Earth to be a sphere. As task organizers, we provided two open source baselines to the participants, one for the locale6 sub-task and one for the mobility7 sub-task. Additionally, we implemented a live leaderboard that allowed participants to submit runs and view their relative standing towards others, as evaluated on a representative development set (i.e. part of, but not the complete, test set). 7.

6http://bit.ly/1gsrmvx

7http://bit.ly/1K8vUy8

[1]

Choi ,

Lei ,

Ekambaram ,

Kelm ,

Gottlieb ,

Sikora ,

Ramchandran , and

Friedland . Human vs machine: establishing a human baseline for multimodal location estimation . In Proceedings of the ACM International Conference on Multimedia , pages 867 { 876 , 2013 .

[2]

Choi ,

Thomee ,

Friedland ,

Cao ,

Ni ,

Borth ,

Elizalde ,

Gottlieb ,

Carrano ,

Pearce , et al. The Placing Task: a large-scale geo-estimation challenge for social-media videos and images . In Proceedings of the ACM International Workshop on Geotagging and Its Applications in Multimedia , pages 27 { 31 , 2014 .

[3]

Hau and

Houben . Placing images on the world map: a microblog-based enrichment approach . In Proceedings of the ACM Conference on Research and Development in Information Retrieval , pages 691 { 700 , 2012 .

[4]

Karney . Algorithms for geodesics . Journal of Geodesy , 87 ( 1 ): 43 { 55 , 2013 .

[5]

Kelm ,

Schmiedeke ,

Choi ,

Friedland ,

Ekambaram ,

Ramchandran , and

Sikora . A novel fusion method for integrating multiple modalities and knowledge for multimodal location estimation . In Proceedings of the ACM International Workshop on Geotagging and Its Applications in Multimedia , pages 7 { 12 , 2013 .

[6]

Larson ,

Kelm ,

Rae ,

Hau ,

Thomee ,

Trevisiol ,

Choi , O. van Laere , S.

Schockaert , G. Jones, P.

Serdyukov , V.

Murdock , and G. Friedland.

The benchmark as a research catalyst: charting the progress of geo-prediction for social multimedia . In Multimodal Location Estimation of Videos and Images . 2014 .

[7]

Rae and

Kelm . Working Notes for the Placing Task at MediaEval 2012 , 2012 .

[8]

Serdyukov ,

Murdock , and R. van Zwol. Placing Flickr photos on a map . In Proceedings of the ACM Conference on Research and Development in Information Retrieval , pages 484 { 491 , 2009 .

[9]

Thomee ,

Shamma ,

Friedland , G. and Elizalde ,

Ni ,

Poland ,

Borth , and

Li . YFCC100M: The new data in multimedia research . Communications of the ACM , 2015 . To appear.

[10]

Trevisiol ,

Jegou ,

Delhumeau , and

Gravier. Retrieving geo -location of videos with a divide & conquer hierarchical multimodal approach . In Proceedings of the ACM International Conference on Multimedia Retrieval , pages 1 {8 , 2013 .