The Placing Task at MediaEval 2015 Jaeyoung Choi1,2 , Claudia Hauff2 , Olivier Van Laere3 , and Bart Thomee4 1 International Computer Science Institute, Berkeley, USA 2 Delft University of Technology, the Netherlands 3 Blueshift Labs, San Francisco, USA 4 Yahoo Labs, USA jaeyoung@icsi.berkeley.edu, c.hauff@tudelft.nl, oliviervanlaere@gmail.com, bthomee@yahoo-inc.com ABSTRACT Training Testing The sixth edition of the Placing Task at MediaEval intro- #Photos #Videos #Photos #Videos duces two new sub-tasks: (1) locale-based placing, which em- Locale-based placing sub-task phasizes the need to move away from an evaluation purely 4, 672, 382 22, 767 931, 573 18, 316 based on latitude and longitude towards an entity-centered evaluation, and (2) mobility-based placing, which addresses Mobility-based placing sub-task predicting missing locations within a sequence of movements; 148, 349 0 33, 026 0 the latter is a specific real-world use case that so far has re- ceived little attention within the research community. Two Table 1: Overview of training and test sets for both additional changes over the previous years are the introduc- sub-tasks. tion of open source organizer baselines for both sub-tasks shortly after the official data release, and the implementa- licensed photos and videos with associated metadata. Sim- tion of a live leaderboard, which allows the participants to ilar to last year’s edition [2], we sampled a subset of the gain insights into the effectiveness of their approaches com- YFCC100M for training and testing, see Table 1. The need for pared to the official baselines and in relation to each other at two separate datasets arose from the task requirements (de- an early stage, before the actual run submissions are due. scribed in Section 3). No user appeared both in the training set and in the test set, and to minimize user and location 1. INTRODUCTION bias, each user was limited to contributing at most 250 pho- The Placing Task challenges participants to develop tech- tos and 50 videos, where no photos/videos were included niques to automatically annotate photos and videos with that were taken by a user less than 10 minutes apart. The their geolocation using their visual content and/or textual rather uncontrolled nature of the data (sampled from lon- metadata. In particular, we wish to see those taking part to gitudinal, large-scale, noisy and biased raw data) confronts extend and improve upon the contributions of participants participants with additional challenges. To lower the en- from previous editions, as well as of the research community trance barrier, we precomputed and provided participants at large, e.g. [7, 10, 3, 1, 5, 8]. Although the Placing Task with fifteen visual, and three aural features commonly used has indeed been shown to be a “research catalyst” [6] for in multimedia analysis for each of the media objects includ- geoprediction of social multimedia, with each edition of the ing SIFT, Gist, color and texture histograms for visual anal- task it becomes a greater challenge to alter the benchmark ysis, and MFCC for audio analysis [2]. sufficiently to allow and motivate participants to make sub- stantial changes to their frameworks and systems instead of 3. TASKS small technical ones—this year’s introduction of organizer Locale-based sub-task: In this sub-task, participants were baselines, a leaderboard, as well as novel sub-tasks were given a hierarchy of places across the world, ranging across driven by this consideration. neighborhoods, cities, regions, countries and continents. For each photo and video, they were asked to pick a node (i.e. 2. DATA a place) from the hierarchy in which they most confidently believe it had been taken. While the ground truth locations This year’s edition of the Placing Task was based on the of the photos and videos were associated with the most accu- YFCC100M1 [9], which to date is the largest social multime- rate nodes (i.e. the leaves) in the hierarchy, the participants dia collection that is publicly and freely available. The full could express a reduced confidence in their location esti- dataset consists of 100 million Flickr2 Creative Commons3 mates by selecting nodes at higher levels in the hierarchy. 1 If their confidence was sufficiently high, participants could https://bit.ly/yfcc100md 2 naturally directly estimate the geographic coordinate of the https://www.flickr.com 3 https://www.creativecommons.org photo/video instead of choosing a node from the hierarchy. As our place hierarchy we used version 2.0 of the open source GADM database4 , which contains the spatial bound- aries of the world’s administrative areas. As the GADM only Copyright is held by the author/owner(s). 4 MediaEval 2015 Workshop, Sept. 14–15, 2015, Wurzen, Germany http://www.gadm.org contains data up to city level, we manually supplemented it 6. BASELINES & LEADERBOARD with neighbourhood data for several cities obtained from the As task organizers, we provided two open source baselines geo-game ClickThatHood5 . In total, the hierarchy contains to the participants, one for the locale6 sub-task and one for 221,458 leaf nodes that are spread across 253 countries. The the mobility7 sub-task. Additionally, we implemented a live hierarchy has a maximum depth of 7 and an average depth leaderboard that allowed participants to submit runs and of 4.33, with each place being a variation of the general hi- view their relative standing towards others, as evaluated on erarchy: a representative development set (i.e. part of, but not the complete, test set). Country→State→Province→County→City→Neighborhood Due to the use of the hierarchy, only photos and videos taken 7. REFERENCES within any of the GADM boundaries were part of this sub- [1] J. Choi, H. Lei, V. Ekambaram, P. Kelm, L. Gottlieb, task, and thus media captured in or above international wa- T. Sikora, K. Ramchandran, and G. Friedland. Human ters were excluded. vs machine: establishing a human baseline for multimodal location estimation. In Proceedings of the Mobility-based sub-task: In this sub-task, participants ACM International Conference on Multimedia, pages were given a sequence of photos taken in a certain city by 867–876, 2013. a specific user, of which not all photos were associated with [2] J. Choi, B. Thomee, G. Friedland, L. Cao, K. Ni, a geographic coordinate (e.g. the user took some photos D. Borth, B. Elizalde, L. Gottlieb, C. Carrano, when GPS was temporarily unavailable). The participants R. Pearce, et al. The Placing Task: a large-scale were asked to predict the locations of those photos with geo-estimation challenge for social-media videos and missing coordinates. The nearly 150K training photos of images. In Proceedings of the ACM International this sub-task were divided into 23,116 sequences, while the Workshop on Geotagging and Its Applications in approximately 33K test photos were separated into 5,119 Multimedia, pages 27–31, 2014. sequences. From each sequence in the test set about 30% [3] C. Hauff and G. Houben. Placing images on the world of the coordinates were missing, which are the ones that map: a microblog-based enrichment approach. In needed to be predicted. Proceedings of the ACM Conference on Research and Development in Information Retrieval, pages 691–700, 4. RUNS 2012. Participants may submit up to five attempts (‘runs’) for [4] C. Karney. Algorithms for geodesics. Journal of each sub-task. They can make use of the provided meta- Geodesy, 87(1):43–55, 2013. data and precomputed features, as well as external resources [5] P. Kelm, S. Schmiedeke, J. Choi, G. Friedland, (e.g. gazetteers, dictionaries, Web corpora), depending on V. Ekambaram, K. Ramchandran, and T. Sikora. A the run type. We distinguish between the following five run novel fusion method for integrating multiple types: modalities and knowledge for multimodal location estimation. In Proceedings of the ACM International Run 1: Only provided textual metadata may be used. Workshop on Geotagging and Its Applications in Run 2: Only provided visual & aural features may be used. Multimedia, pages 7–12, 2013. [6] M. Larson, P. Kelm, A. Rae, C. Hauff, B. Thomee, Run 3: Only provided textual metadata, visual features M. Trevisiol, J. Choi, O. van Laere, S. Schockaert, and the visual & aural features may be used. G. Jones, P. Serdyukov, V. Murdock, and Run 4–5: Everything is allowed, except for crawling the G. Friedland. The benchmark as a research catalyst: exact items contained in the test set, or any items by charting the progress of geo-prediction for social a test user taken within 24 hours before the first and multimedia. In Multimodal Location Estimation of after the last timestamp of a photo sequence in the Videos and Images. 2014. mobility test set. [7] A. Rae and P. Kelm. Working Notes for the Placing Task at MediaEval 2012, 2012. 5. EVALUATION [8] P. Serdyukov, V. Murdock, and R. van Zwol. Placing Flickr photos on a map. In Proceedings of the ACM For the locale-based sub-task, the evaluation metric is based Conference on Research and Development in on a hierarchical distance between the ground truth node Information Retrieval, pages 484–491, 2009. and the predicted node or coordinate in the place hierarchy. The mobility-based sub-task is evaluated according to the [9] B. Thomee, D. Shamma, B. Friedland, G.and Elizalde, familiar geographic distance-based metric, where for each K. Ni, D. Poland, D. Borth, and L. Li. YFCC100M: test item the distance is computed between the ground truth The new data in multimedia research. coordinate and the estimated coordinate. One important Communications of the ACM, 2015. To appear. difference with past editions is that this year we measure [10] M. Trevisiol, H. Jégou, J. Delhumeau, and G. Gravier. geographic distances with Karney’s formula [4]; this formula Retrieving geo-location of videos with a divide & is based on the assumption that the shape of the Earth is conquer hierarchical multimodal approach. In an oblate spheroid, which produces more accurate distances Proceedings of the ACM International Conference on than methods such as the great-circle distance that assume Multimedia Retrieval, pages 1–8, 2013. the shape of the Earth to be a sphere. 6 http://bit.ly/1gsrmvx 5 7 http://www.click-that-hood.com/ http://bit.ly/1K8vUy8