=Paper=
{{Paper
|id=Vol-1436/Paper58
|storemode=property
|title=CERTH/CEA LIST at MediaEval Placing Task 2015
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper58.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Kordopatis-Zilos15
}}
==CERTH/CEA LIST at MediaEval Placing Task 2015==
CERTH/CEA LIST at MediaEval Placing Task 2015 Giorgos Kordopatis-Zilos1 , Adrian Popescu2 , Symeon Papadopoulos1 , and Yiannis Kompatsiaris1 1 Information Technologies Institute, CERTH, Greece. [georgekordopatis,papadop,ikom]@iti.gr 2 CEA, LIST, 91190 Gif-sur-Yvette, France. adrian.popescu@cea.fr ABSTRACT its spatial entropy through a Gaussian weight function [5], We describe the participation of the CERTH/CEA LIST which is referred to as Spatial Entropy (SE) function. team in the Placing Task of MediaEval 2015. We submit- To ensure more reliable prediction in finer granularities, ted five runs in total to the Locale-based placing sub-task, we built an additional LM using a finer grid (cell side length providing the estimated locations for the test set released of 0.001◦ ). Having computed the MLCs for both the coarse by the organisers. Out of five runs, two are based solely on and fine granularity, we apply an Internal Grid technique [4] textual information, using feature selection and weighting as a means to produce more accurate, yet equally reliable methods over an existing language model-based approach. location estimates. This is achieved by first selecting the One is based on visual content, using geo-spatial cluster- most appropriate granularity (the finer grid cell if considered ing over the most visually similar images, and two runs are reliable, otherwise the coarser grid cell), and then producing based on hybrid approaches, using both visual and textual the location estimate based on the center-of-gravity of the cues from the images. The best results (median error 22km, k most textually similar images inside the selected MLC 27.5% at 1km) were obtained when both visual and textual (k = 5), by employing Similarity Search as in [10]. The features are combined, using external data for training. textual similarity is computed using the Jaccard similarity of the corresponding sets of tags. 1. INTRODUCTION 2.1.1 Feature Selection The goal of the task is to produce location estimates for To increase the robustness of the model and reduce its size, a set of 931,573 photos and 18,316 videos using a set of feature selection was performed based on two measures: the 4.7M geotagged items and their metadata for training [1]. accuracy and the locality of the tags. For the tag-based runs, we built upon the scheme of our Accuracy is computed using the cross-validation scheme 2014 participation [4] and a number of recent extensions on proposed in [5]. The training set is partitioned into p folds it [5], focusing on improved feature selection and feature (here, p = 10). Subsequently, one partition at a time is weighting. For the visual-based location estimation, we use withheld, and the rest p − 1 partitions are used to build the a geospatial clustering scheme of the most visually similar LM. Having built the LM, the location of every item of the images for every query image. A hybrid scheme is composed withheld partition is estimated. The accuracy of a tag is by the combination of the textual and visual approaches. To computed based on Equation 1. further improve the model, we constructed it using all geo- Nr tagged metadata from the YFCC dataset [9], after removing tgeo(t) = , (1) all images from the users contained in the test set. Nt where tgeo(t) is the accuracy score of each tag t, Nr is the 2. APPROACH DESCRIPTION total number of correctly geotagged items tagged with t and Nt is the total number of items tagged with t. The tags with 2.1 Tag-based location estimation non-zero accuracy score form a tag set denoted as Ta . Locality captures the spatial awareness of tags. For every According to our last year’s approach [4] (baseline), the individual tag, the locality score is calculated based on the earth surface is divided in (nearly) rectangular cells of size tag frequency and the neighbor users that have used it in the 0.01◦ latitude/longitude (approximately 1km2 size near the various cells. Every time that a user uses a given tag, he/she equator). We construct a Language Model (LM) [6], i.e. a is assigned to the respective location cell. As a result, each tag-cell probability map, by processing the tags and titles cell has a set of users that have been assigned to it. All users of the training set images. The tag-cell probabilities are assigned to the same cell are considered neighbors (for that computed based on the user count of each tag in each cell. particular cell). Then, the locality score can be computed Then, the Most Likely Cell (MLC) of a query (test) image is by Equation 2. derived from the summation of the respective tag-cell prob- P P 0 0 0 abilities. The contribution of each tag is weighted based on c∈C u∈Ut,c |{u |u ∈ Ut,c , u 6= u}| loc(t) = Nt ∗ 2 , (2) Nt Copyright is held by the author/owner(s). where loc(t) is the locality score of tag t, Nt is the total MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany occurrences of t, C denotes all cells and Ut,c denotes the set of users that used tag t inside cell c. Since all users in Ut,c measure RUN-1 RUN-2 RUN-3 RUN-4 RUN-5 are neighbors, Equation 2 can be simplified to: acc(1m) 0.15 0.01 0.15 0.16 0.16 P P acc(10m) 0.61 0.08 0.62 0.75 0.76 u∈Ut,c |Ut,c | − 1 P c∈C |Ut,c |(|Ut,c | − 1) c∈C acc(100m) 6.40 1.76 6.52 7.73 7.83 loc(t) = = acc(1km) 24.33 5.19 24.61 27.30 27.54 Nt Nt acc(10km) 43.07 7.43 43.41 46.48 46.77 The tags with non-zero locality score are forming a tag set acc(100km) 51.08 9.07 51.45 54.02 54.33 denoted as Tl . The final tag set T used by the approach is acc(1000km) 63.81 23.98 64.18 65.81 66.06 the intersection of the two tag sets: T = Ta ∩ Tl . m. error(km) 69 5663 61 24 22 2.1.2 Feature Weighting Table 1: Geotagging accuracy (%) for different Since the locality metric is sensitive to tag frequency, we ranges and median error (km). RUN-1 and RUN-4 used consider it as an inappropriate for directly weighting tags. only text, RUN-2 relied on visual features, and RUN-3 Alternatively, having computed the locality scores for every and RUN-5 used both visual and text features. tag in T , we sort them based on their scores and calculate their weights using their position in the distribution. |T | − (j − 1) wl = (3) 2.3 Hybrid location estimation |T | For the hybrid approach, we build an LM using the scheme where, wl is the weight value of the tag t on the j-th posi- described in Section 2.1. To achieve further improvement in tion in the distribution and |T | is the total number of tags finer granularities wuith the use of the Similarity Search ap- contained in T . This weighting approach returns values in proach, the similarity between two images derives from the the range (0, 1]. To combine the two weighting functions, combination of the visual and textual similarities. To this we normalize the values of the Spatial Entropy weighting end, we normalize the visual similarities to the range [0, 1]. function, denoted with wse , and use Equation 4 to compute The final similarity for a pair of images is computed as the the final weights. arithmetic mean of the two similarities. We then retrieve w = ω ∗ wse + (1 − ω) ∗ wl (4) the top k = 5 most similar images, within the borders spec- ified by the Internal Grid technique [5], and we use their The value of ω was set to 0.2 through empirical assessment center-of-gravity as the final location estimate. on a sample of 10K images. For those test images, where no estimate can be produced based on the LM or confidence is lower than 0.02 (which 2.1.3 Confidence together amount to approximately 10% of the test set), we To evaluate the confidence of the estimation of each query use the visual approach to produce the estimate. image, we use the confidence measure of Equation 5. P c∈C {p(c|i)|dist(c, mlc) < l} 3. RUNS AND RESULTS conf(i) = P , (5) c∈C p(c|i) We prepared two tag-based (RUN-1, RUN-4), one visual where conf(i) is the confidence for query image i, p(c|i) is the (RUN-2) and two hybrid runs (RUN-3, RUN-5). Runs 1-3 used cell probability of cell c for image i, dist(c1 , c2 ) is the distance the training set released by the organisers; in Runs 4-5, the between the centers of cells c1 and c2 and mlc stands for the entire YFCC dataset was used, excluding all images from Most Likely Cell. users that appeared in the test set. All runs contained esti- mates for the full test set (949,889 items). 2.2 Visual-based location estimation According to Table 1, the best performance in terms of We compute visual-based location estimations with CNN both median error and accuracy in all ranges was attained features adapted for the tourist domain using approximately by RUN-5. Comparing the corresponding runs with different 1000 Points Of Interest (POIs) for training, with approxi- training sets, one may conclude that the use of an extended mately 1200 images per POI, that were fed directly to Caffe training set (that does not contain user-specific information) [3]. These features were computed by fine-tuning the VGG had considerable impact on the accuracy results across all model proposed at ILSVRC 2014 [7]. The outputs of the f c7 ranges. Furthermore, the combination of features (visual layer (4096 dimensions) were compressed to 128 using a PCA and textual) in RUN-5 further improved the overall perfor- matrix learned from a subset of 250, 000 images of the CNN mance (reaching a 7.83% accuracy for the <100m range) and training set and used to compute image similarities. CNN minimizing median error (22km). The visual-only run (RUN- features were selected after a favorable comparison against 2) obtained remarkable results (reaching a 5.19% accuracy compact VLAD features of similar size [8] and with SURF for the <1km range). features of significantly larger size [2]. Having calculated In the future, we plan to look deeper into different weight- these similarities, we retrieve the top k most visually similar ing schemes trying to achieve further improvements. More- images and use their location to perform the estimate. In over, we plan to develop more sophisticated clustering mod- the visual only run (RUN-2), k = 20 and we apply a simple els for the visual-only runs. incremental spatial clustering scheme, in which if the j-th image (out of the k most similar) is within 1km from the closest one of the previous j − 1 images, it is assigned to its 4. ACKNOWLEDGEMENTS cluster, otherwise it forms its own cluster. In the end, the This work is supported by the REVEAL and USEMP largest cluster (or the first in case of equal size) is selected projects, partially funded by the European Commission un- and its centroid is used as the location estimate. der contract numbers 610928 and 611596 respectively. 5. REFERENCES Security Informatics, pages 21–40, 2015. [1] J. Choi, C. Hauff, O. Van Laere, and B. Thomee. The [6] A. Popescu. Cea list’s participation at mediaeval 2013 placing task at mediaeval 2015. In MediaEval 2014 Placing placing task. In MediaEval 2013 Placing Task, 2013. Task, 2015. [7] K. Simonyan and A. Zisserman. Very deep convolutional [2] J. Choi and X. Li. The 2014 ICSI/TU delft location networks for large-scale image recognition. In International estimation system. In Working Notes Proceedings of the Conference on Learning Representations, 2015. MediaEval 2014 Workshop, Barcelona, Catalunya, Spain, [8] E. Spyromitros-Xioufis, S. Papadopoulos, I. Kompatsiaris, October 16-17, 2014., 2014. G. Tsoumakas, and I. Vlahavas. A comprehensive study [3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, over vlad and product quantization in large-scale image R. Girshick, S. Guadarrama, and T. Darrell. Caffe: retrieval. IEEE Transactions on Multimedia, 2014. Convolutional architecture for fast feature embedding. [9] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, arXiv preprint arXiv:1408.5093, 2014. K. Ni, D. Poland, D. Borth, and L. Li. The new data and [4] G. Kordopatis-Zilos, G. Orfanidis, S. Papadopoulos, and new challenges in multimedia research. CoRR, Y. Kompatsiaris. Socialsensor at mediaeval placing task abs/1503.01817, 2015. 2014. In MediaEval 2014 Placing Task, 2014. [10] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding [5] G. Kordopatis-Zilos, S. Papadopoulos, and locations of Flickr resources using language models and Y. Kompatsiaris. Geotagging social media content with a similarity search. ICMR ’11, pages 48:1–48:8, New York, refined language modelling approach. In Intelligence and NY, USA, 2011. ACM.