Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features Giorgos Kordopatis-Zilos1 , Adrian Popescu2 , Symeon Papadopoulos1 , and Yiannis Kompatsiaris1 1 Information Technologies Institute, CERTH, Greece. [georgekordopatis,papadop,ikom]@iti.gr 2 CEA, LIST, 91190 Gif-sur-Yvette, France. adrian.popescu@cea.fr ABSTRACT transformation, tokenization and removed accents to gener- We describe the participation of the CERTH/CEA-LIST ate a set of terms for every item. The multi-word tags were team in the MediaEval 2016 Placing Task. We submitted further split into their individual components, which were five runs to the estimation-based sub-task: one based only on also included in the item’s term set. Finally, symbols and text by employing a Language Model-based approach with punctuations in the terms were removed, and terms consist- several refinements, one based on visual content, using geo- ing of numerics or less than three characters were discarded. spatial clustering over the most visually similar images, and The core of our approach is a probabilistic Language Model three based on a hybrid scheme exploiting both visual and (LM) [5] built from the terms of the training set items. The textual cues from the multimedia items, trained on datasets earth surface was divided into (nearly) rectangular cells of of different size and origin. The best results were obtained size 0.01◦ ×0.01◦ latitude/longitude, and the term-cell prob- by a hybrid approach trained with external training data abilities were computed based on the user count of each term and using two publicly available gazetteers. in each cell. The most likely cell (mlc) of a query is derived from the summation of the respective term-cell probabilities. The estimated location of the query items with no textual 1. INTRODUCTION information is the centre of the cell with the most users. The goal of the task is to estimate the location of 1,497,464 For feature selection, we used a refined version of the local- photos and 29,934 videos using a set of ≈5M geotagged ity metric [4]: in our last participation, we computed locality items and their metadata for training [1]. All submitted based on the neighbor users that used the same term in the runs are built upon the scheme of our last year’s participa- same cell. To this end, we utilized a coarse grid (0.1◦ × 0.1◦ ) tion [4], integrating several refinements. For the text-based for the calculation, based on which the neighbor users were runs, we focused on improving the pre-processing of meta- assigned to a unique cell, as depicted in Figure 1(a). In data of the training set items and refining the feature se- that setting, it was possible that a pair of users were not as- lection method. For the visual-based runs, we built a more signed to the same cell even if the geodesic distance of their generic deep neural network model for enhanced visual im- items was small. To tackle this issue, we now used a grid age representation. For the hybrid scheme, we devised a of 0.01◦ × 0.01◦ and modified the assignment of the users to score for selecting between the text and visual estimations multiple cells: instead of assigning a user to a unique cell, based on the prediction confidence. To further improve per- we assigned a user to an entire neighborhood, as illustrated formance, we built a model using all geotagged items of the in Figure 1(b). The area highlighted in orange corresponds YFCC dataset [8] (items uploaded by users in the test set to the cells where both users were assigned. The terms with are not included), and we leveraged structured information non-negative locality score form the selected term set T . from open geographical resources such as Geonames1 and The contribution of each term was then weighted based OpenStreetMap2 . on its locality and spatial entropy scores. Spatial entropy is a Gaussian weight function based on the term-cell entropy 2. APPROACH DESCRIPTION of the term [2]. The two measures are combined to generate a weight value for every term in T . 2.1 Text-based location estimation In the first step, the tags and titles of the training set items were pre-processed. We applied URL decoding3 , lowercase 1 http://www.geonames.org/ 2 https://www.openstreetmap.org/ 3 This was necessary because text in different languages was URL encoded in the released dataset. (a) (b) Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands. Figure 1: Locality examples: (a) initial, (b) refined. measure RUN-1 RUN-2 RUN-3 RUN-4 RUN-5 RUN-E To ensure more robust performance in fine granularity, we built an additional LM using a finer grid (0.001◦ × 0.001◦ ). P@10m 0.59 0.08 0.56 0.7 0.72 4.78 P@100m 6.42 1.84 6.58 7.96 8.27 8.41 Having computed the mlc for both coarse and fine granu- P@1km 24.55 5.62 25.03 27.82 28.54 13.67 larities, we selected the most appropriate estimation: this is P@10km 43.32 8.16 43.73 46.52 46.45 16.6 the mlc of the finer grid if it falls within the borders of the P@100km 51.26 10.21 51.69 53.96 53.5 18.83 coarse grid, otherwise it is the mlc of the coarse one. Finally, m. error 65 5031 56 24 27 3432 we employed similarity search as in [9] to derive the location estimates from the the kt = 5 most textually similar images (a) Images inside the selected mlc, computing textual similarity using measure RUN-1 RUN-2 RUN-3 RUN-4 RUN-5 the Jaccard similarity between the corresponding term sets. P@10m 0.55 0.0 0.55 0.69 0.71 Error case analysis of the text method is presented in [3]. P@100m 6.86 0.06 6.86 7.89 8.19 P@1km 22.73 0.5 22.73 25.53 26.16 2.2 Visual-based location estimation P@10km 40.6 2.48 40.6 43.89 43.62 P@100km 48.24 4.97 48.24 51.2 50.44 The employed method is a refined version of the one em- m. error 161 6211 161 68 85 ployed in last year’s participation [4]. The main objectives have been (1) to ensure that the visual features are generic (b) Videos and transferable from a training set independent of YFCC Table 1: Geotagging precision (%) and median error to the subset of the collection used for the task, and (2) to (km) for five runs (+RUN-E for images). provide a compact representation of the features in order to scale up the visual search process. To meet the first objec- tively, conft is the confidence score of the text-based estima- tive, the VGG architecture [7] was fine-tuned with over 5000 tion and is defined in [4], and confv is the confidence score diversified man-made and natural POIs, represented by over of the visual-based estimation (Equation 1). 7 million images. These were downloaded from Flickr using queries with (1) the POI name and a radius of 5km around 3. RUNS AND RESULTS its coordinates and (2) the POI name and the associated city The submitted runs include one text-based (RUN-1), one name. Following the conclusions of [6] regarding the useless- visual-based (RUN-2) and three hybrid runs (RUN-3, RUN-4, ness of manual annotation for POI representation, there was RUN-5). For the first three runs, the system was trained no manual validation of the training set. To meet the sec- on the set released by the organizers. In RUN-4 and RUN- ond objective, we used the same procedure as last year and 5, the training set consisted of all YFCC items excluding compressed the initial features (VGG fc7, 4096 dimensions) those contributed by users appearing in the test set. Also, to 128 dimensions using PCA. The PCA matrix was learned we report the results of an external run (RUN-E), based on on a subset of 250,000 images of the training set. the visual approach but using the full geotagged subset of Having calculated these similarities, we retrieved the top YFCC. The results for RUN-E show that adding more training kv most visually similar images (in our runs we set kv = 20) data significantly improves visual geolocation, especially for and applied a simple spatial clustering scheme based on their short ranges (10m and 100m), where this run outperforms geographical distance. We defined a confidence metric for even the best hybrid run. our visual approach based on the size of the largest cluster: To explore the impact of external data sources, in RUN-5, we further leveraged structured data from Geonames and confv (i) = max((n(i) − nt )/(kv − nt ), 0) (1) OpenStreetMap. In particular, we used the geotagged en- where n(i) is the number of neighbors in the largest clus- tries of the two sources as additional training items for build- ter for query image i, nt is the configuration parameter that ing the text-based LM: from Geonames we used a list of city determines the “strictness” of the confidence score. The con- names along with their alternative names, while from Open- fidence score gets values in the range [0,1]. We empirically StreetMap a list of nodes (points of interest) provided they set nt = 5. Our visual approach is not designed for video were associated with an address. Since training items need analysis, thus all videos were placed in the centre of London, to be associated with a contributor, we considered Geonames which is the densest geotagged region in the world. and OpenStreetMap as the two contributing users. According to Table 1, the best performance at fine granu- 2.3 Hybrid location estimation larities (≤1km) was attained by RUN-5 for both images and The hybrid approach comprises a set of rules that deter- videos. RUN-4 reported the best results in terms of median mine the source of estimation between the text and visual distance error and precision at coarse granularities (>1km). approaches. First, for query images, for which no estimation Comparing the two runs, one may conclude that leverag- could be produced by the text-based approach, the location ing structured geographic information improves geolocation was estimated based on the visual approach. Otherwise, in precision in short ranges (reaching 8.27% and 28.54% in case the visual estimation fell inside the borders of the mlc P @100m and P @1km respectively), with a minor increase calculated by the text-based approach, the visual estima- in median error. Moreover, the combination of visual and tion was selected. If not, the estimation was determined by textual features (RUN-3) improved the overall performance of comparing the confidence scores of the two approaches. the system in case of images, but had no effect on video geo- ( tagging (since no visual information was used from videos). Gv (i) if conft (i) ≤ confv (i) Gh (i) = Gt (i) otherwise (2) 4. ACKNOWLEDGMENTS This work is supported by the REVEAL and USEMP where Gh , Gt and Gv are the estimated locations for query projects, partially funded by the European Commission un- item i of the hybrid, textual and visual approach, respec- der contract numbers 610928 and 611596 respectively. 5. REFERENCES [5] A. Popescu. Cea list’s participation at mediaeval 2013 [1] J. Choi, C. Hauff, O. Van Laere, and B. Thomee. The placing task. In MediaEval 2013 Placing Task, 2013. placing task at mediaeval 2016. In MediaEval 2016 Placing [6] A. Popescu, E. Gadeski, and H. L. Borgne. Scalable domain Task, 2016. adaptation of convolutional neural networks. CoRR, [2] G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. abs/1512.02013, 2015. Geotagging social media content with a refined language [7] K. Simonyan and A. Zisserman. Very deep convolutional modelling approach. In PAISI 2015, pages 21–40, 2015. networks for large-scale image recognition. In International [3] G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Conference on Learning Representations, 2015. In-depth exploration of geotagging performance using [8] B. Thomee et al. The new data and new challenges in sampling strategies on yfcc100m. In Proceedings of the multimedia research. CoRR, abs/1503.01817, 2015. MMCommons 2016. ACM, 2016. [9] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding [4] G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and locations of Flickr resources using language models and Y. Kompatsiaris. CERTH/CEA LIST at MediaEval placing similarity search. ICMR ’11, pages 48:1–48:8, New York, task 2015. In MediaEval 2015 Placing Task, 2015. NY, USA, 2011. ACM.