=Paper=
{{Paper
|id=Vol-1263/paper44
|storemode=property
|title=SocialSensor at MediaEval Placing Task 2014
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_44.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Kordopatis-ZilosOPK14
}}
==SocialSensor at MediaEval Placing Task 2014==
SocialSensor at MediaEval Placing Task 2014 Giorgos Kordopatis-Zilos, Giorgos Orfanidis, Symeon Papadopoulos, Yiannis Kompatsiaris Information Technologies Institute (CERTH-ITI), Thessaloniki, Greece {georgekordopatis, g.orfanidis, papadop, ikom}@iti.gr ABSTRACT equator). Consequently, a grid of cells is created, which We describe the participation of the SocialSensor team in we use to build our language model using the approach de- the Placing Task of MediaEval 2014. We submitted three scribed in [4]. More specifically, we estimate the most prob- runs based on tag information for the full test set, using able cell for a query (test) image based on the respective tag extensions over an existing language modelling approach, probabilities. A tag probability in a particular cell is calcu- and two runs (one based on the full test set and the other on lated as the total number of different Flickr users that used the 25,500 subset) based on visual content, using geospatial the tag inside the cell, divided with the the total count of clustering and supervised-learning. Our best performance different users in all cells. Note that in that way a user can (median error 230km, 23% at 1km) was achieved with the be counted in the total count of all cells more than once. use of tag features, using only internal training data. In order to assign a query image in a cell, we calculate the probability of each cell summing up the contributions of individual tags and title words. The cell with the greatest 1. INTRODUCTION probability is selected as the image cell. If during this pro- The goal of the task is to produce location estimates for cess there is no outcome (i.e. the probability for all cells is a set of 510K images using a set of over 5M geotagged im- zero), we use the description of the query image. For the ages and their metadata for training [1]. For the tag-based test images where there is no result (e.g. complete lack of runs, we built upon the scheme of [4], extending it with text), we set their location equal to the center of the most the use of the Similarity Search method, introduced in [6]. populated cell, of a coarse granularity grid (100km×100km), We also devised an internal grid technique and a Gaussian a kind of maximum likelihood estimation. distribution model based on the spatial entropy of tags to Extensions: We devised the following extensions: adjust the corresponding probabilities. For the visual-based location estimation, we attempted to build visual location Similarity Search: Having assigned a query image to a cell, models, though with limited success. All models were built we then employ the location estimation technique of [6]: we solely on the training data provided by the organizers (i.e. first determine the k most similar training images (using no external gazetteers or Internet data were used). Jaccard similarity on the corresponding sets of tags) and use their center-of-gravity (weighted by the similarity values) as the location estimate for the test image. 2. APPROACHES Internal Grid: In order to ensure more reliable prediction in finer granularities, we built the language model using a finer 2.1 Tag-based location estimation grid (cell side length of 0.001◦ for both latitude and longi- Baseline approach: The baseline method relies on an of- tude, corresponding to a square of ≈100m×100m). Having fline step, in which a complex geographical-tag model is built computed the result from both the coarse and fine granu- from the tags and locations of the approximately 5M images larity, we use an internal grid technique. According to this, of the training set. The metadata used to build the model for a query image, if the estimate based on the finer gran- and the estimation of a query image are the tags, the title ularity falls within the borders of the estimated cell of the and the description. A pre-processing step was first applied coarser granularity, then we consider the fine granularity to remove all punctuation and symbols and to transform all trustworthy and apply similarity search inside the fine cell. characters to lower case. After the pre-processing, all train- Otherwise, we perform similarity search inside the coarser ing images left with empty tags and title are removed, re- granularity cell, since coarser granularity language models sulting in a training set of approximately 4.1M images. Note are by default more trustworthy (due to the use of more that the same pre-processing is applied on the test images data for building them). before the actual location estimation process. Spatial Entropy: In order to adjust the original language In contrast to last year’s clustering [3], we divide the earth model tag probabilities for each cell, we built a Gaussian surface in rectangular cells with a side length of 0.01◦ for weight function based on the values of the spatial tag en- both latitude and longitude (approximately 1km near the tropy. The spatial entropy for each tag tk is calculated based on its probabilities over all m cells of the grid. m X Copyright is held by the author/owner(s). e(tk ) = − p(tk |ci ) log p(tk |ci ) (1) MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain i=1 We chose a Gaussian model because the tags with either too measure Run 1 Run 2 Run 3 Run 4 Run 5 high or too low entropy values typically carry no geographic acc(10m) 0.5 0 0 0.03 0.31 cues, and we would therefore need to suppress their influence acc(100m) 5.85 0 0.01 0.65 4.36 on the location estimation process. Equation 2 presents the acc(1km) 23.02 0.03 0.16 21.87 22.24 acc(10km) 39.92 0.76 1.27 38.96 38.98 entropy-based cell estimation equation. acc(100km) 46.87 2.18 3.00 46.13 46.13 T X acc(1000km) 60.11 17.35 17.72 59.87 59.87 p(ci |j) = P (tk |ci ) ∗ N (e(tk ), µ, σ) (2) median error 230 6232 6086 258 259 k=1 Table 1: Geotagging accuracy (%) and median error (km) for five ranges. Runs 1, 4 and 5 used text metadata, while where p(ci |j) is the probability of cell ci for image j, T is Runs 2 and 3 relied on visual features. the number of tags for image j, P (tk |ci ) is the probability of tag k for cell ci and ek is the value of the entropy of tag k. N is the Gaussian function, and the parameters µ, σ are estimated using the distribution over the training set. 3. RUNS AND RESULTS 2.2 Visual-based location estimation As described above, we prepared three tag-based runs and two visual runs. The tag-based runs are the Run 1, using the To build the visual location models, we relied on two fea- language model, similarity search, internal grid and spatial tures, SURF+VLAD and CS-LBP+VLAD, concatenating entropy, Run 4, using the language model and the center of them in a single vector. In particular, we first calculated cells as estimated location, and Run 5, using the language the interest points of each image, and then extracted both model and similarity search. Run 2 was based on the Sub- SURF and CS-LBP descriptors corresponding to them. The cluster Selection step of subsection 2.2 using the center of parameters used for CS-LBP [2] were P = 8, R = 2, and the the subcluster as location estimate. Run 3 was based on the number of bins N = 16. L2 normalization was applied for combination of Subcluster Selection with Similarity Search SURF and L1 for CS-LBP. For both features we used dis- (according to subsection 2.2). For Run 3, we used a subset tinct multiple vocabularies learned on independent collec- of 25,500 images due to lack of time. For the rest of the runs tions (four visual vocabularies with k = 128 centroids each) we used the full test set of 510K images. and applied dimensionality reduction using PCA separately According to Table 1, the best performance in terms of to each VLAD vector, keeping more principal components both median error and accuracy in all ranges was attained for the SURF+VLAD vector to a factor of 3-1 (due to the by Run 1. Comparing Run 4 and 5, it can be seen that correspondingly higher dimensionality of the non-reduced similarity search had considerable impact on the low range SURF+VLAD). The final VLAD vectors had a concate- accuracy results. Also the combination of all features in Run nated length of 1024 and were L2 normalized. For VLAD, 1 improves further the overall performance (reaching a 5.85% we used the implementation of [5]. accuracy for the < 100m range), but the median error is still The main part of the model building included the train- quite high (230km), which means further improvements can ing of linear SVM to separate the samples in a predefined be achieved. The visual runs yielded very poor results. number of spatial clusters and subclusters (we used 50 clus- In the future, we plan to look into utilizing external data ters and up to 50 subclusters corresponding to each clus- for training, in particular the Flickr 100M Creative Com- ter). The clusters/subclusters were created using k-means mons dataset and gazetteers. Furthermore, we will look into on the coordinates of the training images, while the number alternative ways to utilize visual information for geotagging. of subclusters was determined by the number of samples N assigned to each cluster (min(round(N/3000), 50)). Acknowledgements: This work is supported by the So- cialSensor FP7 project, partially funded by the EC under Subcluster Selection: For each cluster a one-vs-rest approach contract number 287975. was applied resulting in 50-d prediction score vectors, while for the subclusters a similar approach was used but only for intra-cluster samples (resulting in better performance than 4. REFERENCES [1] J. Choi and al. The placing task: A large-scale using both intra- and inter-cluster samples). The decision geo-estimation challenge for social-media videos and images. about the cluster membership for each sample was a com- In Proceedings of the 3rd ACM GeoMM Workshop, 2014. bination of the estimation scores provided by the cluster [2] M. Heikkilä, M. Pietikäinen, and C. Schmid. Description of prediction score vectors and also the scores corresponding interest regions with local binary patterns. Pattern to the best subcluster score in each cluster. Finally, priors Recognition, 42(3):425–436, 2009. (based on number of images per cluster/subcluster) were ap- [3] G. Kordopatis-Zilos, S. Papadopoulos, plied to the respective scores, since they were found to lead E. Spyromitros-Xioufis, A. L. Symeonidis, and Y. Kompatsiaris. CERTH at MediaEval Placing Task 2013. to some improvement. In Proceedings of MediaEval 2013. Similarity Search: To achieve location estimations of finer [4] A. Popescu. CEA LIST’s participation at MediaEval 2013 granularity, we applied a similarity search step at the sub- Placing Task. In Proceedings of MediaEval 2013. cluster level. In particular, the query image was compared [5] E. Spyromitros-Xioufis, S. Papadopoulos, Y. Kompatsiaris, to 1000 samples from the selected subcluster (sampling was G. Tsoumakas, and I. Vlahavas. A comprehensive study over necessary for efficiency reasons), and the location of the most VLAD and Product Quantization in large-scale image retrieval. Trans. on Multimedia, 16(6):1713–1728, 2014. similar of those was returned. Similarity was computed [6] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding based on a low-dimensional concept-based representation, locations of Flickr resources using language models and using the 94 concepts of ImageCLEF 2012 (i.e. each image similarity search. ICMR ’11, pages 48:1–48:8, New York, was represented by 94 prediction scores coming from a set NY, USA, 2011. ACM. of corresponding pre-trained concept models).