Imcube @ MediaEval 2015 Placing Task: A Hierarchical Approach for Geo-referencing Large-Scale Datasets Pascal Kelm, Sebastian Schmiedeke, and Lutz Goldmann Imcube Labs GmbH Berlin, Germany {kelm, schmiedeke, goldmann}@imcube.de ABSTRACT The hierarchy provided contains 221,458 leaf nodes (locations) This paper describes Imcube’s geo-referencing approach, experi- that are spread across 253 countries. Below the second level (Coun- ments, and results at the MediaEval 2015 Placing Task benchmark. try  State) we segment the states into 360 × 180 regions accord- This task requires to develop techniques to automatically annotate ing to the meridians and parallels. We also apply a smaller grid Flickr photos and videos with their geolocation (latitude and longi- of segments with half the spatial dimensions to increase the accu- tude) in two individual subtasks. A hierarchical approach combin- racy and to minimize the computational cost. Each geo-referenced ing textual, visual and optional routing information is employed. training image is assigned to its corresponding grid cell at the low- The results show that for 24% of the images (local-based task) and est level [3]. For each layer of the hierarchy a ranking model is used for 96% of the images (mobile-based task) the error of the esti- to iteratively assign a test image to the most likely spatial segment. mated location is below city level (10 km). Due to the large size of the dataset and the limited processing time, we did not apply a hierarchical language model approach with multiple modalities [3] but adopted a textual re-ranking model. The 1. INTRODUCTION vocabulary of the spatial locations includes stemmed1 words from The MediaEval Placing Task 2015 [1] requires that participants the tags, titles and descriptions. The text similarity function used use systems to automatically estimate the location of Flickr pho- is BM25 [4] as implemented by Lucene2 . The best results for tex- tos and videos using any or all of metadata, visual/ audio content, tual similarity computations were achieved with a training set com- and/or user information. posed of both image and video meta data, regardless of the kind of This year the task introduces two new sub-tasks: The locale- test query. based sub-task addresses the prediction of missing locations for The visual similarity relies on a wide spectrum of visual fea- individual images in an entity-centred way by choosing a location tures to describe the color and texture characteristics of the video from a given ground truth hierarchy. The mobile-based sub-task key frames and photos. These image descriptions are pooled for addresses predicting missing locations within a sequence of photos each leaf node in the different hierarchy level using the mean and shot by a travelling photographer. median value of each descriptor. A kd-tree that contains all appro- Similar to the Placing Task 2014, the training set (4,672,382 pho- priate segments is built for each descriptor in each leaf node. This tos & 22,767 videos) and test sets (931,573 photos & 18,316 videos) procedure speeds up the following search because only a portion of were sampled from the YFCC100M [6] data set. One important dif- data is needed to be computed for nearest neighbour search. ference to the past editions is that this year the distances between Starting at the top of the hierarchy the nodes of the current level the predicted and the ground truth geographic coordinates are eval- are ranked according to their distance to the test image. The overall uated using Karney’s formula [2], which is based on the assumption distance is obtained by fusing the textual and the visual distances that the shape of the Earth is an oblate spheroid. using weighted summation. These weights differ in both fusion In this paper, we present an approach that combines different experiments as described in results section. Then the node with the textual and visual descriptors by applying a hierarchical scheme to lowest distance becomes the most likely location at the given level merge information obtained from several ranked lists. of granularity. By iteratively traversing the hierarchy the method determines the leaf node that has the highest similarity to the test image and returns the corresponding geolocation. 2. SYSTEM DESCRIPTION This section describes the different methods created to solve the 2.2 Mobile-based Sub-task challenges of the locale-based and mobile-based sub-tasks. For this task, we pursue a similar approach as described in sec- tion 2.1 but without the hierarchical layer model and with addi- 2.1 Local-based Sub-task tional routing information. The proposed approach is composed of different steps: (i) hi- We use OpenStreetMap3 to find the shortest route between two erarchical clustering of the provided training set by latitude and photos that have associated geographic coordinates. Tracks with longitude, (ii) visual and textual feature extraction, (iii) generation a distance smaller than 2 km are routed by pedestrian navigation, of ranked lists, (iv) re-ranking and (v) estimation of the location for larger tracks are routed by car navigation, respectively. The results each test item. 1 http://tartarus.org/martin/PorterStemmer/ index.html 2 Copyright is held by the author/owner(s). http://lucene.apache.org/core/ 3 MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany http://www.openstreetmap.org/ Table 1: Results locale-based sub-task. Distance Textual Visual Fusion1 Fusion2 #Items Percentage #Items Percentage #Items Percentage #Items Percentage 0.001 km 678 0.07 % 0 0.00 % 277 0.03 % 1669 0.18 % 0.01 km 1906 0.20 % 3 0.00 % 1030 0.11 % 4549 0.48 % 0.1 km 17437 1.84 % 10 0.00 % 11372 1.20 % 31980 3.37 % 1 km 81274 8.56 % 150 0.02 % 53172 5.60 % 117491 12.37 % 10 km 200103 21.07 % 1676 0.18 % 133321 14.04 % 224080 23.59 % 100 km 352851 37.15 % 5121 0.54 % 275985 29.05 % 353357 37.20 % 1000 km 658519 69.33 % 52002 5.47 % 634327 66.78 % 658519 69.33 % 10000 km 927620 97.66 % 708993 74.64 % 927121 97.60 % 927620 97.66 % Table 2: Results mobile-based sub-task. Distance Routing Visual Weighted Visual Textual #Items Percentage #Items Percentage #Items Percentage #Items Percentage 0.001 km 2 0.02 % 4 0.04 % 3 0.03 % 20 0.21 % 0.01 km 81 0.84 % 138 1.43 % 128 1.32 % 244 2.52 % 0.1 km 1593 16.47 % 1949 20.14 % 1957 20.23 % 1952 20.18 % 1 km 6501 67.19 % 7014 72.50 % 7026 72.62 % 6959 71.93 % 10 km 9171 94.79 % 9274 95.86 % 9276 95.88 % 9280 95.92 % 100 km 9659 99.83 % 9671 99.96 % 9670 99.95 % 9670 99.95 % 1000 km 9674 99.99 % 9675 100.00 % 9675 100.00 % 9675 100.00 % 10000 km 9675 100.00 % 9675 100.00 % 9675 100.00 % 9675 100.00 % of the routing run are predicted linearly in travel time to be a lo- tion of the textual model is used. This combination results in more cation on these tracks. For test images, which do not have both accurate results, since visual feature are not able to solve ambigu- chronological neighbours, the neighbouring route segment is ex- ities in large scale (i.e. , most cityscapes look similar). The results trapolating while considering their distance in time. The other runs show that fusing visual and textual information on finer levels (Fu- use additionally textual and visual features to determine the most sion1) improves the performance for error margins between range similar image along the track. between 10 m and 100 km. The visual similarity is determined as described in [5]. Densely sampled local features (pairwise averaged DCT coefficients) are 3.2 Mobile-based Sub-task represented as a histogram quantised by vector quantisation (a clus- Table 2 shows the results obtained with our four runs (routing, terless bag-of-visual-words approach) [5]. As similarity metric be- visual, weighted visual, and textual) described in section 2.2. Gen- tween training images and the image to be geo-tagged, histogram erally, the use of textual and visual features improves the location intersection of their BoW representation is applied. The two visual performance compared to only using interpolation along routes. runs differ in the assignment of coordinates: the visual run assigns Since most of the routes are quite short, improvements are mainly the coordinates of the visually most similar image from the train- made at smaller error margins. For the margin of error below 1 m ing data, the weighted visual run calculates the coordinates as the the textual approach outperforms all other approaches by a factor of centroid of all training images weighted by their visual similarity. 10. The different runs reach a similar effectiveness with increased The textual run uses the same textual similarity as the location error margins. The weighted visual similarity approach (Weighted task, but the training images are restricted to be located within a Visual) predicts slightly more accurate locations within a range of corridor of 0.001 degree along the estimated routes. 100 m to 1 km. For error margins above 10 km all runs produce similar results. A closer look at the location errors of the individual 3. RESULTS images shows that the textual and visual approaches perform very differently which suggests that a suitable fusion approach may fur- 3.1 Local-based Sub-task ther improve the results. Table 1 shows the accuracies of selected error margins for the different textual and visual runs. Based on the experience from the 4. CONCLUSION previous years we expect the textual run to perform better than the The results of the local-based sub-task show that the best perfor- visual run due to the visual ambiguity at coarser levels. The re- mance can be achieved with a multimodal fusion approach that uses sults clearly show that the visual only approach has low accuracy textual information on coarser levels and the combination of visual in all error margins when compared to the textual only approach. and textual information in finer ones. The results of the mobile- For combining the textual and visual information we have tested based sub-task show that the use of visual and textual information two different fusion models. We design a set of two fusion ex- beside routing information improves the location estimation. The periments to combine textual and visual features. Our first fusion low correlation of the localization errors of the different approaches model (Fusion1) combines the estimations of textual and visual suggests that more advanced fusion approaches will lead to better models equally on each hierarchy level. The second fusion model results. Another interesting direction to improve the accuracy of (Fusion2) only combines these estimations on the finest three hi- the visual approach for both sub-tasks is by using local features to erarchy levels. On the coarsest hierarchy levels, only the estima- distinct landmarks and points of interest. 5. REFERENCES [1] J. Choi, C. Hauff, O. V. Laere, and B. Thomee. The placing task at mediaeval 2015. MediaEval 2015 Workshop, 2015. [2] C. Karney. Algorithms for geodesics. Journal of Geodesy, 87(1):43–55, 2013. [3] P. Kelm, S. Schmiedeke, J. Choi, G. Friedland, V. N. Ekambaram, K. Ramchandran, and T. Sikora. A novel fusion method for integrating multiple modalities and knowledge for multimodal location estimation. In Proceedings of the 2Nd ACM International Workshop on Geotagging and Its Applications in Multimedia, GeoMM ’13, pages 7–12, New York, NY, USA, 2013. ACM. [4] S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proceedings of TREC-3 ’95, pages 109–126, 1995. [5] S. Schmiedeke, P. Kelm, and T. Sikora. Dct-based features for categorisation of social media in compressed domain. In Multimedia Signal Processing (MMSP), 2013 IEEE 15th International Workshop on, pages 295–300, 2013. [6] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li. The new data and new challenges in multimedia research. CoRR, abs/1503.01817, 2015.