Exploration of Feature Combination in Geo-visual Ranking for Visual Content-based Location Prediction Xinchao Li1 , Michael Riegler12 , Martha Larson1 , Alan Hanjalic1 1 Multimedia Information Retrieval Lab, Delft University of Technology 2 Klagenfurt University {x.li-3,m.a.larson,a.hanjalic}@tudelft.nl, michael.riegler@edu.uni-klu.ac.at ABSTRACT In this paper, we present a visual-content-based approach that predicts where in the world a social image was tak- en. We employ a ranking method that assigns a query pho- to the geo-location of its most likely geo-visual neighbor in the social image collection. The experiments carried out on the MediaEval Placing Task 2013 data set support the con- clusion that the exploration of candidate photo’s geo-visual neighbors and the combination of local and global image fea- tures can improve the prediction accuracy of visual-content- Figure 1: Geo-visual ranking system overview based geo-location estimation system. 1. INTRODUCTION ing, the same as in [2], we perform geo-visual expansion of The research question of the Placing Task is how to es- each candidate photo, a, to create a geo-visual expansion set timate the geo-location of one image, given its image at- Ea based on local features. The candidate photos are then tributes, and all available metadata [1]. ranked by P (q|a), which reflects the closeness of their sim- A variety of information sources have been exploited for ilarity to the query photo q. Formally, P (q|a) is expressed predicting geo-location. User-contributed text annotations as, have been used as a basis of a large range of successful geo- ∑ coordinate predication algorithms [3]. This work exploits P (q|a) ∝ Simvis (e, q) (1) the natural link between text annotation and location (e.g., e∈Ea tags often include place names and other location-specific where Ea is the set of geographically nearby photos of photo vocabulary) in order to predict at which location around a with high visual similarities to query q. Then, in the the globe a photo was taken. The drawback of textual an- final step, Location Propagation, the geo-location of the top notations (i.e., metadata) is that it needs to be manually ranked photo is propagated to the query photo. created by the user, a time consuming task. As a result, a large percentage of images are not associated with any tags and cannot be geo-located with text based approaches 3. EXPERIMENTAL FRAMEWORK (13.4% of test photos of the 2013 data set do not contain any tags). As an appealing alternative to text-based approach- 3.1 Dataset es, in this paper, we present a visual-content-based approach The proposed system is evaluated on a set of 8, 801, 050 for geo-coordinate prediction. geo-tagged Flickr1 photos released by the MediaEval 2013 Placing Task [1]. Since the release includes only the metada- 2. SYSTEM DESCRIPTION ta and not the images themselves, we re-crawled the images using the links in the metadata. Because some photos were Our approach consists of four steps, as depicted in the sys- removed after the dataset was collected, the final re-crawled tem overview, Fig. 1. In the first step, Local Feature-based collection contains 8, 799, 260 photos, 8, 537, 368 for training Image Retrieval, we create a set of candidate photos for giv- and 261, 892 for test. en query q by retrieving all visually similar images based on local features from the database up to a visual similarity 3.2 Calculating visual similarity threshold, k. In the second step, Global Feature-based Im- Our approach to geo-location prediction exploits visual age Selection, we rank all the candidate photos from step one similarity between photos. To calculate visual similarity by their visual similarity with the query based on global fea- based on local features, we choose SURF, and the bag-of- tures, and select the top t ranked photos as the final selected visual-words scheme to build the search engine. To calcu- candidate set Evis@k&t . In the third step, Geo-Visual Rank- late visual similarity based on global features, we use Joint Composite Descriptor (JCD), which encodes the color, edge directivity, and texture histogram of the image. Copyright is held by the author/owner(s). 1 MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain http://www.flickr.com/ 4. EXPERIMENTAL RESULTS 4.1 General performance evaluation RUN 1: Baseline, the system only applies Local Feature- based Image Retrieval, and uses the geo-location of the local feature-based visually most similar photo as estimated location. RUN 2: Baseline + Geo-Visual Ranking RUN 3: Baseline + Global Feature-based Image Selec- tion RUN 4: Baseline + Global Feature-based Image Selec- tion + Geo-Visual Ranking The run results are presented in Table 1. In evaluation Figure 2: Distribution of queries over different levels radius 1km, compared with the baseline method, Run1, of geo-visual redundancy. Run2 achieves about 18% improvement and Run3 achieves about 17.7% improvement. The best performed one, Run4, achieves about 37.4% improvement, which is more than the sum of the previous two. Table 1: Run results (261, 892 photos): percentage of test photos located within {1, 10, 100, 500, 1000}km of the ground truth. <1km <10km <100km <500km <1000km Run1 2.0% 2.6% 3.5% 7.9% 14.6% Run2 2.4% 3.1% 4.2% 8.5% 14.8% Run3 2.4% 3.1% 4.0% 8.4% 15.3% Run4 2.8% 3.7% 4.7% 9.2% 15.9% 4.2 Experimental analysis Figure 3: Prediction accuracy within 1km for queries As the query photo is from a social image collection, there with different levels of geo-visual redundancy. are certain properties in the collection that can affect the prediction accuracy of visual-content-based approaches. For example, queries about one popular landmark and queries Feature-based Image Selection can also benefit the sys- about an individual user’s car may get different prediction tem with Geo-Visual Ranking , especially for queries with performance. For the purpose of our investigation, we define medium geo-visual redundancy level. geo-visual redundancy for a given query photo as the number of photos that are taken within 1km radius of the query 5. CONCLUSION photo and also ranked in the top 10, 000 of the rank list We have presented a ranking approach addressing the of the query from the local feature-based image retrieval challenging task of predicting geo-location using only the system. visual content of images. The main observation is that the The distribution of the query photos over the geo-visual combination of local and global image features can compen- redundancy ranges is illustrated in Fig. 2. Over half of the sate each other, and together with geo-visual ranking, they queries do not have visually similar photos within their geo- improve the prediction accuracy of visual-content-based geo- neighborhood, which suggests that in these cases there is not location estimation system. Future work will include an in- another photo in the dataset that depicts the same scene or vestigation of the optimal role of local and global features object at the query location. This observation demonstrates within the geo-visual ranking scheme for visual-content-based how challenging it is to predict the geo-location of a social geo-location estimation. image purely from its visual content. Fig. 3 breaks down the geo-location prediction perfor- 6. REFERENCES mance over different levels of geo-visual redundancy. Com- paring Run3 with Run1, we see that the Global Feature- [1] C. Hauff, B. Thomee, and M. Trevisiol. Working Notes based Image Selection step can improve the local feature- for the Placing Task at MediaEval 2013, 2013. based system for queries with different geo-visual redun- [2] X. Li, M. Larson, and A. Hanjalic. Geo-visual ranking dancy level. Comparing Run2 with Run1 and Run4 with for location prediction of social images. In Proc. ICMR Run3, we see that the Geo-Visual Ranking step can boost ’13, 2013. the performance for queries with high geo-visual redundan- [3] P. Serdyukov, V. Murdock, and R. van Zwol. Placing cy. Comparing Run4 with Run2, we see that the Global Flickr photos on a map. In Proc. SIGIR ’09, 2009.