Multimodal image geocoding: the 2013 RECOD’s approach Lin Tzy Li1,2 , Jurandy Almeida1,3 , Otávio A. B. Penatti1 , Rodrigo T. Calumby1,4 , Daniel C. G. Pedronette1,5 , Marcos A. Gonçalves6 , and Ricardo da S. Torres1 1 RECOD Lab, Institute of Computing, University of Campinas (UNICAMP), Campinas, SP – Brazil, 13083-852 2 Telecommunications Res. & Dev. Center, CPqD Foundation, Campinas, SP – Brazil, 13086-902 3 Institute of Science and Technology, Federal University of Sao Paulo (UNIFESP), Sao Jose dos Campos, SP – Brazil, 12231-280 4 Dept. of Exact Sciences, University of Feira de Santana (UEFS), Feira de Santana, BA – Brazil, 44036-900 5 Dept. of Stat., Applied Math. and Computing, Universidade Estadual Paulista (UNESP), Rio Claro, SP – Brazil, 13506-900 6 Dept. of Computer Science, Federal University of Minas Gerais (UFMG), Belo Horizonte, MG – Brazil, 31270-010 {lintzyli, jurandy, penatti, rtripodi, rtorres}@ic.unicamp.br, daniel@rc.unesp.br, mgoncalv@dcc.ufmg.br ABSTRACT functions used were BM25 and TF-IDF, as implemented by This work describes the approach used by the RECOD team the Lucene API. in the MediaEval Placing Task of 2013, in which we were re- Visual quired to develop an automatic scheme to assign geograph- ical locations to images. Our approach is multimodal, con- Given the large dataset, we had to select carefully the de- sidering textual and visual descriptors, which are combined scriptors to be used. Initially, we have evaluated some of by a rank aggregation strategy. We estimate the location of the descriptors provided with the dataset, like: color and test images based on the coordinates of top-ranked images edge directivity descriptor (CEDD), scalable color (SCD), in the list of combined results. gabor filter. Using the validation set, we have noticed that the best results were achieved by CEDD. Although SCD has shown the best results in [2], in our validation set, it did not 1. INTRODUCTION performed well for our geocoding approach. Geocoding multimedia material has gained great attention Additionally to CEDD, we used BIC (border/interior in the latest years given the importance of providing richer pixel classification). This descriptor was chosen due to its services for users, like placing information on maps. Image good results in large scale experiments [5]. For this, we geocoding is the objective of the Placing Task in 2013, i.e., downloaded the whole photo dataset, resizing the images to it requires participants to assign geographical locations to have at most 100 thousand pixels, as suggested by [6] for images. Details about the Placing task, its dataset, and the large scale experiments, and extracted the 128-dimensional evaluation protocol can be found in [1]. BIC feature vector of each image. The Manhattan distance In this paper, we present our multimodal approach that (L1) was used for both BIC and CEDD. combines different textual and visual descriptors uniformly. We combine them using a rank aggregation strategy, previ- 2.2 Rank aggregation ously introduced in [4]. As last year, we used a rank aggregation strategy to com- bine different descriptors [3]. For this year, due to the size 2. PROPOSED APPROACH of the development set, we created a ranked list limited to We handled the task of automatically assigning a geo- the top 1,000 most similar photos for each test image. graphical location to images using nearest neighbor searches We have used an aggregation function similar to sima (nu- on aggregated ranked lists, which combine textual and visual merator is m instead of 2) proposed in [3]. When the inter- features. The strengths of our approach are its simplicity section of top-1000 lists computed by different features are and its power to combine multiple description modalities. small, the size of the final aggregated list tends to (m×1000), For evaluation purposes in the training phase, we have se- being m the number of features combined. We select the lected a validation set of 5,000 images from the development top-1000 images that present the highest aggregated score set of around 8.5 million images. First, each photo from the as the output of the rank aggregation step. development set was assigned to a fixed cell of 1-by-1 degree based on its ground truth latitude and longitude. Then, the 2.3 Geocoding resulting grid was summarized by the total of photos (den- For geocoding the test images, we have used a nearest sity) in each cell regarding to the dataset size. Finally, the neighbor approach. We used the development set (∼8.5 mil- evaluation images (5,000 photos) were randomly picked up lion images) as geo-profiles and each test image was com- from each cell, by taking into account its density. pared to the whole development set. For comparing the images, we have used each type of feature independently 2.1 Features (textual or visual). For a given test image, the ranked list of each feature is produced. All the lists are then combined Textual by our rank aggregation strategy and the final ranked list is generated. The lat/long of the first image (most similar) in From textual metadata, we used only the photo tags to this final list is assigned to the test image. compute similarities between the images. The tags were stemmed and stopwords were removed. The text similarity 3. OUR SUBMISSIONS & RESULTS Submitted runs Copyright is held by the author/owner(s). MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain Our submissions for this year are: run1: combines 2 textual descriptors: BM25 + TF-IDF; BIC+CEDD (run 2) we improve the results of BIC alone run2: combines 2 visual descriptors: BIC + CEDD; (run 3). The combination of textual and visual descriptors run3: one visual descriptor: BIC; (run 4) was slightly worse than the textual descriptors iso- lated. One possible reason is the large difference between run4: combines 2 textual and 2 visual descriptors: BM25 textual and visual results. + TF-IDF + BIC + CEDD; Observe that for the test set (Table 2), our results were run5: combines 4 textual descriptors: BM25 + TF-IDF1 . quite different for our validation set (Table 1), mainly for the visual features. While in the test3 set, BIC achieved Runs 1 and 5 used only textual features. Thus, for test less than 1% in the 1km radius, in the validation set, it images without tags, there was no way to apply our similar- presented 15.32%. Because of this, in the validation set, ity ranked list approach. As post-processing, we randomly the fusion (run 4) results improved over run 1. The huge selected an item from the development set to transfer its difference between validation and test results might be due latitude and longitude to the test image. to a property of the test set not considered when building the validation set: the users who contributed for the photos 3.1 Results in the training set are different from those who contributed Besides the organizers’ standard evaluation metric, we for the photos in the test set. also applied the WAS score we proposed in [4]. This eval- Regarding the distribution of test results, for the visual uation metric gives an overview of a method’s performance descriptors (runs 2 and 3), the 1st Quartile shows that 25% expressed by a score between [0,1], 0 being very bad and 1 of the items were geocoded at most 1,900km from the correct indicating a perfect estimate with a higher weight assigned location. On the other hand, for the textual descriptors and to more precise results. The WAS takes into account every their combinations (runs 1, 4, and 5), 25% of the items are single result of the whole test set to indicate and summarize very close to their correct locations (less than 3km). the level of precision of an evaluated method as a whole. Let d(i) be the geographic distance between the predicted and the ground truth location of the image i. The proposed 4. CONCLUSIONS score for the result of a given test image i is defined as: Our best results were observed for the methods based only log(1+d(i)) score(i) = 1 − log(1+R , where Rmax is the maximum on textual description. For them, we could geocode within max ) 1km radius around 20% of the testing set (test3). Consider- distance between any two points on the Earth’s surface (half ing visual descriptors, the main challenge this year was the of Earth’s circumference at the Equator is 20,027.5 km). large scale dataset, which poses time and space constraints Let D be a test dataset with n images whose locations in the descriptors to be used. Our rank aggregation strat- need to be predicted. The overall score for the predictions Pn score(i) egy, for the test set, was only effective for combining textual of a method m is defined as: W AS(m) = i=1 n . descriptors. Combining textual and visual descriptors did Table 1: Validation set results. not improve the results. As future work, we would like to Precision Run 1 Run 2 Run 3 Run 4 Run 5 evaluate a more elaborate geocoding approach, similar to 1km 64.56% 16.86% 15.32% 68.82% 64.62% the scheme used to create our validation set, for example. 10km 73.64% 17.68% 16.10% 75.90% 73.60% 100km 77.58% 18.64% 17.04% 78.94% 77.58% 500km 80.20% 22.86% 13.40% 81.10% 80.22% Acknowledgments 1000km 82.18% 28.32% 20.12% 82.74% 82.32% We thank the support of FAPESP (2011/11171-5, WAS score 0.7866 0.3053 0.2889 0.8019 0.7866 2009/10554-8), CNPq (306580/2012-8, 484254/2012-0), Distance distribution 1st Quartile 0.00 698.40 885.30 0.00 0.00 CAPES, FAPEMIG, Samsung, ACM SIGIR, and MediaEval Median 0.03 5,499.40 5,835.80 0.00 0.04 organizers. Table 2: Test results using test3 set (53,000 items). 5. REFERENCES [1] C. Hauff, B. Thomee, and M. Trevisiol. Working Notes Precision Run 1 Run 2 Run 3 Run 4 Run 5 1km 20.14% 0.37% 0.28% 20.11% 18.82% for the Placing Task at MediaEval. In MediaEval 2013 10km 37.60% 0.80% 0.67% 37.10% 35.93% Workshop, volume 1043, October 18-19 2013. 100km 47.66% 1.69% 1.51% 46.97% 45.97% [2] P. Kelm, S. Schmiedeke, and T. Sikora. Multimodal 500km 56.62% 6.73% 6.25% 55.83% 55.74% 1000km 63.17% 14.32% 13.78% 62.26% 62.43% geo-tagging in social media websites using hierarchical WAS score 0.5240 0.1653 0.1623 0.5190 0.5128 spatial segmentation. In International Workshop on Distance distribution Location-Based Social Networks, pages 32–39, 2012. 1st Quartile 1.73 1,869.00 1,962.00 1.76 2.05 [3] L. T. Li, J. Almeida, D. C. G. Pedronette, O. A. B. Median 168.22 6,632.00 6,729.00 196.79 225.67 Penatti, and R. da S. Torres. A multimodal approach As we can observe in Table 2, the test runs based solely on for video geocoding. In Working Notes Proc. MediaEval textual information yielded the best results (runs 1, 4, and Workshop, volume 927, 2012. 5), while those based only on visual descriptors presented [4] L. T. Li, D. C. G. Pedronette, J. Almeida, O. A. low accuracy. The possible reason is the semantic gap, as Penatti, R. T. Calumby, and R. d. S. Torres. A rank there might be many different places with similar visual ap- aggregation framework for video multimodal geocoding. pearance, specially in a large dataset like the one used for Mult. Tools and App., pages 1–37, 2013. training. Another potential issue was the large number of [5] O. A. B. Penatti, E. Valle, and R. da S. Torres. ties in the first positions of ranked lists of visual descrip- Comparative study of global color and texture tors. Given our 1-nn geocoding approach, this probably de- descriptors for web image retrieval. J. Vis. Comm. and graded our results. However, we can see that by combining Image Repr., 23(2):359–380, 2012. 1 [6] F. Perronnin, Z. Akata, Z. Harchaoui, and C. Schmid. Non-English tags were translated to English using the Google Translate service and combined with the original Towards good practice in large-scale learning for image tags. classification. In CVPR, pages 3482–3489, 2012.