1. INTRODUCTION

Giorgos Kordopatis-Zilos

Adrian Popescu

adrian.popescu@cea.fr

Symeon Papadopoulos

papadop@iti.gr 0

Yiannis Kompatsiaris

0 0 Information Technologies Institute , CERTH , Greece

2015

14 15

We describe the participation of the CERTH/CEA LIST team in the Placing Task of MediaEval 2015. We submitted ve runs in total to the Locale-based placing sub-task, providing the estimated locations for the test set released by the organisers. Out of ve runs, two are based solely on textual information, using feature selection and weighting methods over an existing language model-based approach. One is based on visual content, using geo-spatial clustering over the most visually similar images, and two runs are based on hybrid approaches, using both visual and textual cues from the images. The best results (median error 22km, 27.5% at 1km) were obtained when both visual and textual features are combined, using external data for training.

1. INTRODUCTION

The goal of the task is to produce location estimates for a set of 931,573 photos and 18,316 videos using a set of 4.7M geotagged items and their metadata for training [ 1 ]. For the tag-based runs, we built upon the scheme of our 2014 participation [ 4 ] and a number of recent extensions on it [ 5 ], focusing on improved feature selection and feature weighting. For the visual-based location estimation, we use a geospatial clustering scheme of the most visually similar images for every query image. A hybrid scheme is composed by the combination of the textual and visual approaches. To further improve the model, we constructed it using all geotagged metadata from the YFCC dataset [ 9 ], after removing all images from the users contained in the test set. 2.1

APPROACH DESCRIPTION Tag-based location estimation

According to our last year's approach [ 4 ] (baseline), the earth surface is divided in (nearly) rectangular cells of size 0.01 latitude/longitude (approximately 1km2 size near the equator). We construct a Language Model (LM) [ 6 ], i.e. a tag-cell probability map, by processing the tags and titles of the training set images. The tag-cell probabilities are computed based on the user count of each tag in each cell. Then, the Most Likely Cell (MLC) of a query (test) image is derived from the summation of the respective tag-cell probabilities. The contribution of each tag is weighted based on its spatial entropy through a Gaussian weight function [ 5 ], which is referred to as Spatial Entropy (SE) function.

To ensure more reliable prediction in ner granularities, we built an additional LM using a ner grid (cell side length of 0.001 ). Having computed the MLCs for both the coarse and ne granularity, we apply an Internal Grid technique [ 4 ] as a means to produce more accurate, yet equally reliable location estimates. This is achieved by rst selecting the most appropriate granularity (the ner grid cell if considered reliable, otherwise the coarser grid cell), and then producing the location estimate based on the center-of-gravity of the k most textually similar images inside the selected MLC (k = 5), by employing Similarity Search as in [ 10 ]. The textual similarity is computed using the Jaccard similarity of the corresponding sets of tags. 2.1.1

Feature Selection

To increase the robustness of the model and reduce its size, feature selection was performed based on two measures: the accuracy and the locality of the tags.

Accuracy is computed using the cross-validation scheme proposed in [ 5 ]. The training set is partitioned into p folds (here, p = 10). Subsequently, one partition at a time is withheld, and the rest p 1 partitions are used to build the LM. Having built the LM, the location of every item of the withheld partition is estimated. The accuracy of a tag is computed based on Equation 1.

tgeo(t) =

Nr Nt ; (1) where tgeo(t) is the accuracy score of each tag t, Nr is the total number of correctly geotagged items tagged with t and Nt is the total number of items tagged with t. The tags with non-zero accuracy score form a tag set denoted as Ta.

Locality captures the spatial awareness of tags. For every individual tag, the locality score is calculated based on the tag frequency and the neighbor users that have used it in the various cells. Every time that a user uses a given tag, he/she is assigned to the respective location cell. As a result, each cell has a set of users that have been assigned to it. All users assigned to the same cell are considered neighbors (for that particular cell). Then, the locality score can be computed by Equation 2.

loc(t) = Nt

P c2C

Pu2Ut;c jfu0ju0 2 Ut;c; u0 6= ugj

Nt2 ; (2) where loc(t) is the locality score of tag t, Nt is the total occurrences of t, C denotes all cells and Ut;c denotes the set of users that used tag t inside cell c. Since all users in Ut;c are neighbors, Equation 2 can be simpli ed to: loc(t) =

Since the locality metric is sensitive to tag frequency, we consider it as an inappropriate for directly weighting tags. Alternatively, having computed the locality scores for every tag in T , we sort them based on their scores and calculate their weights using their position in the distribution. wl = jT j (j jT j 1) where, wl is the weight value of the tag t on the j-th position in the distribution and jT j is the total number of tags contained in T . This weighting approach returns values in the range (0; 1]. To combine the two weighting functions, we normalize the values of the Spatial Entropy weighting function, denoted with wse, and use Equation 4 to compute the nal weights.

w = ! wse + (1 !) wl The value of ! was set to 0:2 through empirical assessment on a sample of 10K images. 2.1.3

Confidence

To evaluate the con dence of the estimation of each query image, we use the con dence measure of Equation 5. conf(i) =

Pc2C fp(cji)jdist(c; mlc) < lg

Pc2C p(cji) ; where conf(i) is the con dence for query image i, p(cji) is the cell probability of cell c for image i, dist(c1; c2) is the distance between the centers of cells c1 and c2 and mlc stands for the Most Likely Cell. 2.2

Visual-based location estimation

We compute visual-based location estimations with CNN features adapted for the tourist domain using approximately 1000 Points Of Interest (POIs) for training, with approximately 1200 images per POI, that were fed directly to Ca e [ 3 ]. These features were computed by ne-tuning the VGG model proposed at ILSVRC 2014 [ 7 ]. The outputs of the f c7 layer (4096 dimensions) were compressed to 128 using a PCA matrix learned from a subset of 250; 000 images of the CNN training set and used to compute image similarities. CNN features were selected after a favorable comparison against compact VLAD features of similar size [ 8 ] and with SURF features of signi cantly larger size [ 2 ]. Having calculated these similarities, we retrieve the top k most visually similar images and use their location to perform the estimate. In the visual only run (RUN-2), k = 20 and we apply a simple incremental spatial clustering scheme, in which if the j-th image (out of the k most similar) is within 1km from the closest one of the previous j 1 images, it is assigned to its cluster, otherwise it forms its own cluster. In the end, the largest cluster (or the rst in case of equal size) is selected and its centroid is used as the location estimate. (4) (5) (3) 2.3

Hybrid location estimation measure acc(1m) acc(10m) acc(100m) acc(1km) acc(10km) acc(100km) acc(1000km) m. error(km)

For the hybrid approach, we build an LM using the scheme described in Section 2.1. To achieve further improvement in ner granularities wuith the use of the Similarity Search approach, the similarity between two images derives from the combination of the visual and textual similarities. To this end, we normalize the visual similarities to the range [ 0, 1 ]. The nal similarity for a pair of images is computed as the arithmetic mean of the two similarities. We then retrieve the top k = 5 most similar images, within the borders speci ed by the Internal Grid technique [ 5 ], and we use their center-of-gravity as the nal location estimate.

For those test images, where no estimate can be produced based on the LM or con dence is lower than 0.02 (which together amount to approximately 10% of the test set), we use the visual approach to produce the estimate. 3.

RUNS AND RESULTS

We prepared two tag-based (RUN-1, RUN-4), one visual (RUN-2) and two hybrid runs (RUN-3, RUN-5). Runs 1-3 used the training set released by the organisers; in Runs 4-5, the entire YFCC dataset was used, excluding all images from users that appeared in the test set. All runs contained estimates for the full test set (949,889 items).

According to Table 1, the best performance in terms of both median error and accuracy in all ranges was attained by RUN-5. Comparing the corresponding runs with di erent training sets, one may conclude that the use of an extended training set (that does not contain user-speci c information) had considerable impact on the accuracy results across all ranges. Furthermore, the combination of features (visual and textual) in RUN-5 further improved the overall performance (reaching a 7.83% accuracy for the <100m range) and minimizing median error (22km). The visual-only run (RUN2) obtained remarkable results (reaching a 5.19% accuracy for the <1km range).

In the future, we plan to look deeper into di erent weighting schemes trying to achieve further improvements. Moreover, we plan to develop more sophisticated clustering models for the visual-only runs.

ACKNOWLEDGEMENTS

This work is supported by the REVEAL and USEMP projects, partially funded by the European Commission under contract numbers 610928 and 611596 respectively.

[1]

Choi ,

Hau , O. Van Laere , and

Thomee . The placing task at mediaeval 2015 . In MediaEval 2014 Placing Task, 2015 .

[2]

Choi and

Li . The 2014 ICSI/TU delft location estimation system . In Working Notes Proceedings of the MediaEval 2014 Workshop , Barcelona, Catalunya, Spain, October 16-17 , 2014 ., 2014 .

[3]

Jia ,

Shelhamer ,

Donahue ,

Karayev ,

Long ,

Girshick ,

Guadarrama , and T. Darrell. Ca e: Convolutional architecture for fast feature embedding . arXiv preprint arXiv:1408.5093 , 2014 .

[4]

Kordopatis-Zilos ,

Orfanidis ,

Papadopoulos , and

Kompatsiaris . Socialsensor at mediaeval placing task 2014 . In MediaEval 2014 Placing Task, 2014 .

[5]

Kordopatis-Zilos ,

Papadopoulos , and

Kompatsiaris . Geotagging social media content with a re ned language modelling approach . In Intelligence and Security Informatics , pages 21 { 40 , 2015 .

[6]

Popescu . Cea list's participation at mediaeval 2013 placing task . In MediaEval 2013 Placing Task , 2013 .

[7]

Simonyan and

Zisserman . Very deep convolutional networks for large-scale image recognition . In International Conference on Learning Representations , 2015 .

[8]

Spyromitros-Xiou s , S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. A comprehensive study over vlad and product quantization in large-scale image retrieval . IEEE Transactions on Multimedia , 2014 .

[9]

Thomee ,

D. A.

Shamma ,

Friedland ,

Elizalde ,

Ni ,

Poland ,

Borth , and

Li . The new data and new challenges in multimedia research . CoRR, abs/1503. 01817 , 2015 .

[10]

Van Laere ,

Schockaert , and

Dhoedt . Finding locations of Flickr resources using language models and similarity search . ICMR '11 , pages 48:1 { 48 : 8 , New York, NY, USA, 2011 . ACM.