The Placing Task at MediaEval 2016 Jaeyoung Choi1,2 , Claudia Hauff2 , Olivier Van Laere3 , and Bart Thomee4 1 International Computer Science Institute, Berkeley, CA, USA 2 Delft University of Technology, the Netherlands 3 Blueshift Labs, San Francisco, CA, USA 4 Google, San Bruno, CA, USA jaeyoung@icsi.berkeley.edu, c.hauff@tudelft.nl, oliviervanlaere@gmail.com, bthomee@google.com ABSTRACT Training Testing The seventh edition of the Placing Task at MediaEval fo- #Photos #Videos #Photos #Videos cuses on two challenges: (1) estimation-based placing, which 4,991,679 24,955 1,497,464 29,934 addresses estimating the geographic location where a photo or video was taken, and (2) verification-based placing, which Table 1: Overview of training and test set sizes for addresses verifying whether a photo or video was indeed both sub-tasks. taken at a pre-specified geographic location. Like the previ- ous edition, we made the organizer baselines for both sub- ative Commons3 licensed photos and videos with associated tasks available as open source code, and published a live metadata. Similar to last year’s edition [1], we sampled a leaderboard that allows the participants to gain insights into subset of the YFCC100M for training and testing, see Table 1. the effectiveness of their approaches compared to the offi- No user appeared both in the training set and in the test cial baselines and in relation to each other at an early stage, set, and to minimize user and location bias, each user was before the actual run submissions are due. limited to contributing at most 250 photos and 50 videos, where no photos/videos were included that were taken by a 1. INTRODUCTION user less than 10 minutes apart. We included both test sets The Placing Task challenges participants to develop tech- used in the Placing Tasks of 2014 and 2015 in this year’s niques to automatically determine where in the world pho- test set, allowing us to assess how the location estimation tos and videos were captured based on analyzing their vi- performance has improved over time. sual content and/or textual metadata, optionally augmented The rather uncontrolled nature of the data (sampled from with knowledge from external resources like gazetteers. In longitudinal, large-scale, noisy and biased raw data) con- particular, we aim to see those taking part to improve upon fronts participants with additional challenges. To lower the the contributions of participants from previous editions, as entrance barrier, we precomputed and provided participants well as of the research community at large, e.g. [8, 11, 4, 2, with fifteen visual, and three aural features commonly used 6, 9]. Although the Placing Task has indeed been shown to in multimedia analysis for each of the media objects includ- be a “research catalyst” [7] for geo-prediction of social mul- ing SIFT, Gist, color and texture histograms for visual anal- timedia, with each edition of the task it becomes a greater ysis, and MFCC for audio analysis [3], which together with challenge to alter the benchmark sufficiently to allow and the original photo and video content are publicly and freely motivate participants to make substantial changes to their available through the Multimedia Commons Initiative4 . In frameworks and systems instead of small technical ones. The addition, several expansion packs have been released by the introduction of the verification sub-task this year was driven creators of the YFCC100M dataset, such as detected visual by this consideration, as it requires participants to integrate concepts and Exif metadata, which could prove useful for a notion of confidence in their location predictions to decide the participants. whether or not a photo or video was taken in a particular country, state, city or neighborhood. 3. TASKS Estimation-based sub-task: In this sub-task, partici- 2. DATA pants were given a hierarchy of places across the world, rang- This year’s edition of the Placing Task was once again based ing across neighborhoods, cities, regions, countries and con- on the YFCC100M [10], which to date is the largest pub- tinents. For each photo and video, they were asked to pick licly and freely available social multimedia collection, and a node (i.e. a place) from the hierarchy in which they most which can be obtained through the Yahoo Webscope pro- confidently believe it had been taken. While the ground gram1 . The full dataset consists of 100 million Flickr2 Cre- truth locations of the photos and videos were associated 1 with their actual coordinates and thus in essence the most https://bit.ly/yfcc100md accurate nodes (i.e. the leaves) in the hierarchy, the partic- 2 https://www.flickr.com ipants could express a reduced confidence in their location estimates by selecting nodes at higher levels in the hierarchy. Copyright is held by the author/owner(s). 3 MediaEval 2016 Workshop, Oct. 20–21, 2016, Hilversum, Nether- https://www.creativecommons.org 4 lands http://www.mmcommons.org If their confidence was sufficiently high, participants could a live leaderboard that allowed participants to submit runs naturally directly estimate the geographic coordinate of the and view their relative standing towards others, as evaluated photo/video instead of choosing a node from the hierarchy. on a representative development set (i.e. part of the, but not As our place hierarchy we used the Places expansion pack the complete, test set). of the YFCC100M dataset, in which each geotagged photo and video is geotagged to its corresponding place, which follows 7. REFERENCES a variation of the general hierarchy: [1] J. Choi, C. Hauff, O. Van Laere, and B. Thomee. The Country→State→City→Neighborhood Placing Task at MediaEval 2015. In Working Notes of the MediaEval Benchmarking Initiative for Multimedia Due to the use of the hierarchy, only photos and videos that Evaluation, 2015. were successfully reverse geocoded were included in this sub- [2] J. Choi, H. Lei, V. Ekambaram, P. Kelm, L. Gottlieb, task, and thus media captured in or above international wa- T. Sikora, K. Ramchandran, and G. Friedland. Human ters were excluded. vs machine: establishing a human baseline for multimodal location estimation. In Proceedings of the Verification-based sub-task: In this sub-task, partici- ACM International Conference on Multimedia, pages pants were given a photo or video and a place from the hi- 867–876, 2013. erarchy, and were asked to verify whether or not the media [3] J. Choi, B. Thomee, G. Friedland, L. Cao, K. Ni, item was really captured in the given place. In the test set, D. Borth, B. Elizalde, L. Gottlieb, C. Carrano, we randomly switched the locations of 50% of the photos R. Pearce, et al. The Placing Task: a large-scale and videos, where we required that those switched were at geo-estimation challenge for social-media videos and least taken in a different country. Then, for 25% of the me- images. In Proceedings of the ACM International dia items we removed the neighborhood level and below, for Workshop on Geotagging and Its Applications in 25% the city level and below, and for 25% the state level and Multimedia, pages 27–31, 2014. below, enabling us to assess how the level of the hierarchy [4] C. Hauff and G. Houben. Placing images on the world affects the verification quality of the participants’ systems. map: a microblog-based enrichment approach. In Proceedings of the ACM Conference on Research and 4. RUNS Development in Information Retrieval, pages 691–700, Participants may submit up to five attempts (‘runs’) for 2012. each sub-task. They can make use of the provided meta- [5] C. Karney. Algorithms for geodesics. Journal of data and precomputed features, as well as external resources Geodesy, 87(1):43–55, 2013. (e.g. gazetteers, dictionaries, Web corpora), depending on [6] P. Kelm, S. Schmiedeke, J. Choi, G. Friedland, the run type. We distinguish between the following five run V. Ekambaram, K. Ramchandran, and T. Sikora. A types: novel fusion method for integrating multiple modalities and knowledge for multimodal location Run 1: Only provided textual metadata may be used. estimation. In Proceedings of the ACM International Run 2: Only provided visual & aural features may be used. Workshop on Geotagging and Its Applications in Multimedia, pages 7–12, 2013. Run 3: Only provided textual metadata, visual features and the visual & aural features may be used. [7] M. Larson, P. Kelm, A. Rae, C. Hauff, B. Thomee, M. Trevisiol, J. Choi, O. van Laere, S. Schockaert, Run 4–5: Everything is allowed, except for crawling the G. Jones, P. Serdyukov, V. Murdock, and exact items contained in the test set. G. Friedland. The benchmark as a research catalyst: charting the progress of geo-prediction for social 5. EVALUATION multimedia. In Multimodal Location Estimation of For the estimation-based sub-task, the evaluation metric is Videos and Images. 2014. based on the geographic distance between the ground truth [8] A. Rae and P. Kelm. Working Notes for the Placing coordinate and the predicted coordinate or place from the Task at MediaEval 2012, 2012. hierarchy. Whenever a participant estimates a place from [9] P. Serdyukov, V. Murdock, and R. van Zwol. Placing the hierarchy, we substitute it by its geographic centroid. Flickr photos on a map. In Proceedings of the ACM We measure geographic distances with Karney’s formula [5]; Conference on Research and Development in this formula is based on the assumption that the shape of Information Retrieval, pages 484–491, 2009. the Earth is an oblate spheroid, which produces more accu- [10] B. Thomee, D. Shamma, G. Friedland, B. Elizalde, rate distances than methods such as the great-circle distance K. Ni, D. Poland, D. Borth, and L. Li. YFCC100M: that assume the shape of the Earth to be a sphere. For the The new data in multimedia research. verification-based sub-task, we measure the classification ac- Communications of the ACM, 59(2):64–73, 2016. curacy. [11] M. Trevisiol, H. Jégou, J. Delhumeau, and G. Gravier. Retrieving geo-location of videos with a divide & 6. BASELINES & LEADERBOARD conquer hierarchical multimodal approach. In Proceedings of the ACM International Conference on As task organizers, we provided two open source baselines5 Multimedia Retrieval, pages 1–8, 2013. to the participants, one for the estimation sub-task and one for the verification sub-task. Additionally, we implemented 5 http://bit.ly/2dnggcg