=Paper= {{Paper |id=Vol-1436/Paper58 |storemode=property |title=CERTH/CEA LIST at MediaEval Placing Task 2015 |pdfUrl=https://ceur-ws.org/Vol-1436/Paper58.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/Kordopatis-Zilos15 }} ==CERTH/CEA LIST at MediaEval Placing Task 2015== https://ceur-ws.org/Vol-1436/Paper58.pdf
           CERTH/CEA LIST at MediaEval Placing Task 2015

         Giorgos Kordopatis-Zilos1 , Adrian Popescu2 , Symeon Papadopoulos1 , and Yiannis
                                            Kompatsiaris1
              1
                  Information Technologies Institute, CERTH, Greece. [georgekordopatis,papadop,ikom]@iti.gr
                               2
                                   CEA, LIST, 91190 Gif-sur-Yvette, France. adrian.popescu@cea.fr


ABSTRACT                                                            its spatial entropy through a Gaussian weight function [5],
We describe the participation of the CERTH/CEA LIST                 which is referred to as Spatial Entropy (SE) function.
team in the Placing Task of MediaEval 2015. We submit-                 To ensure more reliable prediction in finer granularities,
ted five runs in total to the Locale-based placing sub-task,        we built an additional LM using a finer grid (cell side length
providing the estimated locations for the test set released         of 0.001◦ ). Having computed the MLCs for both the coarse
by the organisers. Out of five runs, two are based solely on        and fine granularity, we apply an Internal Grid technique [4]
textual information, using feature selection and weighting          as a means to produce more accurate, yet equally reliable
methods over an existing language model-based approach.             location estimates. This is achieved by first selecting the
One is based on visual content, using geo-spatial cluster-          most appropriate granularity (the finer grid cell if considered
ing over the most visually similar images, and two runs are         reliable, otherwise the coarser grid cell), and then producing
based on hybrid approaches, using both visual and textual           the location estimate based on the center-of-gravity of the
cues from the images. The best results (median error 22km,          k most textually similar images inside the selected MLC
27.5% at 1km) were obtained when both visual and textual            (k = 5), by employing Similarity Search as in [10]. The
features are combined, using external data for training.            textual similarity is computed using the Jaccard similarity
                                                                    of the corresponding sets of tags.

1.    INTRODUCTION                                                  2.1.1    Feature Selection
   The goal of the task is to produce location estimates for          To increase the robustness of the model and reduce its size,
a set of 931,573 photos and 18,316 videos using a set of            feature selection was performed based on two measures: the
4.7M geotagged items and their metadata for training [1].           accuracy and the locality of the tags.
For the tag-based runs, we built upon the scheme of our               Accuracy is computed using the cross-validation scheme
2014 participation [4] and a number of recent extensions on         proposed in [5]. The training set is partitioned into p folds
it [5], focusing on improved feature selection and feature          (here, p = 10). Subsequently, one partition at a time is
weighting. For the visual-based location estimation, we use         withheld, and the rest p − 1 partitions are used to build the
a geospatial clustering scheme of the most visually similar         LM. Having built the LM, the location of every item of the
images for every query image. A hybrid scheme is composed           withheld partition is estimated. The accuracy of a tag is
by the combination of the textual and visual approaches. To         computed based on Equation 1.
further improve the model, we constructed it using all geo-
                                                                                                        Nr
tagged metadata from the YFCC dataset [9], after removing                                   tgeo(t) =      ,                     (1)
all images from the users contained in the test set.                                                    Nt
                                                                    where tgeo(t) is the accuracy score of each tag t, Nr is the
2.    APPROACH DESCRIPTION                                          total number of correctly geotagged items tagged with t and
                                                                    Nt is the total number of items tagged with t. The tags with
2.1    Tag-based location estimation                                non-zero accuracy score form a tag set denoted as Ta .
                                                                       Locality captures the spatial awareness of tags. For every
  According to our last year’s approach [4] (baseline), the
                                                                    individual tag, the locality score is calculated based on the
earth surface is divided in (nearly) rectangular cells of size
                                                                    tag frequency and the neighbor users that have used it in the
0.01◦ latitude/longitude (approximately 1km2 size near the
                                                                    various cells. Every time that a user uses a given tag, he/she
equator). We construct a Language Model (LM) [6], i.e. a
                                                                    is assigned to the respective location cell. As a result, each
tag-cell probability map, by processing the tags and titles
                                                                    cell has a set of users that have been assigned to it. All users
of the training set images. The tag-cell probabilities are
                                                                    assigned to the same cell are considered neighbors (for that
computed based on the user count of each tag in each cell.
                                                                    particular cell). Then, the locality score can be computed
Then, the Most Likely Cell (MLC) of a query (test) image is
                                                                    by Equation 2.
derived from the summation of the respective tag-cell prob-                           P      P            0 0           0
abilities. The contribution of each tag is weighted based on                             c∈C    u∈Ut,c |{u |u ∈ Ut,c , u 6= u}|
                                                                       loc(t) = Nt ∗                       2
                                                                                                                                , (2)
                                                                                                         Nt

Copyright is held by the author/owner(s).                           where loc(t) is the locality score of tag t, Nt is the total
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany         occurrences of t, C denotes all cells and Ut,c denotes the set
of users that used tag t inside cell c. Since all users in Ut,c                measure        RUN-1   RUN-2   RUN-3   RUN-4   RUN-5
are neighbors, Equation 2 can be simplified to:                                acc(1m)         0.15    0.01    0.15    0.16    0.16
          P     P                                                              acc(10m)        0.61    0.08    0.62    0.75    0.76
                  u∈Ut,c |Ut,c | − 1
                                       P
                                         c∈C |Ut,c |(|Ut,c | − 1)
            c∈C                                                                acc(100m)       6.40    1.76    6.52    7.73    7.83
loc(t) =                             =                                         acc(1km)       24.33    5.19   24.61   27.30   27.54
                    Nt                           Nt
                                                                               acc(10km)      43.07    7.43   43.41   46.48   46.77
The tags with non-zero locality score are forming a tag set                    acc(100km)     51.08    9.07   51.45   54.02   54.33
denoted as Tl . The final tag set T used by the approach is                    acc(1000km)    63.81   23.98   64.18   65.81   66.06
the intersection of the two tag sets: T = Ta ∩ Tl .                            m. error(km)     69     5663     61      24      22

2.1.2     Feature Weighting                                              Table 1: Geotagging accuracy (%) for different
  Since the locality metric is sensitive to tag frequency, we            ranges and median error (km). RUN-1 and RUN-4 used
consider it as an inappropriate for directly weighting tags.             only text, RUN-2 relied on visual features, and RUN-3
Alternatively, having computed the locality scores for every             and RUN-5 used both visual and text features.
tag in T , we sort them based on their scores and calculate
their weights using their position in the distribution.
                               |T | − (j − 1)
                        wl =                                      (3)    2.3     Hybrid location estimation
                                     |T |
                                                                            For the hybrid approach, we build an LM using the scheme
where, wl is the weight value of the tag t on the j-th posi-             described in Section 2.1. To achieve further improvement in
tion in the distribution and |T | is the total number of tags            finer granularities wuith the use of the Similarity Search ap-
contained in T . This weighting approach returns values in               proach, the similarity between two images derives from the
the range (0, 1]. To combine the two weighting functions,                combination of the visual and textual similarities. To this
we normalize the values of the Spatial Entropy weighting                 end, we normalize the visual similarities to the range [0, 1].
function, denoted with wse , and use Equation 4 to compute               The final similarity for a pair of images is computed as the
the final weights.                                                       arithmetic mean of the two similarities. We then retrieve
                   w = ω ∗ wse + (1 − ω) ∗ wl                     (4)    the top k = 5 most similar images, within the borders spec-
                                                                         ified by the Internal Grid technique [5], and we use their
The value of ω was set to 0.2 through empirical assessment               center-of-gravity as the final location estimate.
on a sample of 10K images.                                                  For those test images, where no estimate can be produced
                                                                         based on the LM or confidence is lower than 0.02 (which
2.1.3     Confidence                                                     together amount to approximately 10% of the test set), we
  To evaluate the confidence of the estimation of each query             use the visual approach to produce the estimate.
image, we use the confidence measure of Equation 5.
                    P
                      c∈C {p(c|i)|dist(c, mlc) < l}                      3.    RUNS AND RESULTS
         conf(i) =          P                       ,    (5)
                               c∈C p(c|i)                                   We prepared two tag-based (RUN-1, RUN-4), one visual
where conf(i) is the confidence for query image i, p(c|i) is the         (RUN-2) and two hybrid runs (RUN-3, RUN-5). Runs 1-3 used
cell probability of cell c for image i, dist(c1 , c2 ) is the distance   the training set released by the organisers; in Runs 4-5, the
between the centers of cells c1 and c2 and mlc stands for the            entire YFCC dataset was used, excluding all images from
Most Likely Cell.                                                        users that appeared in the test set. All runs contained esti-
                                                                         mates for the full test set (949,889 items).
2.2     Visual-based location estimation                                    According to Table 1, the best performance in terms of
   We compute visual-based location estimations with CNN                 both median error and accuracy in all ranges was attained
features adapted for the tourist domain using approximately              by RUN-5. Comparing the corresponding runs with different
1000 Points Of Interest (POIs) for training, with approxi-               training sets, one may conclude that the use of an extended
mately 1200 images per POI, that were fed directly to Caffe              training set (that does not contain user-specific information)
[3]. These features were computed by fine-tuning the VGG                 had considerable impact on the accuracy results across all
model proposed at ILSVRC 2014 [7]. The outputs of the f c7               ranges. Furthermore, the combination of features (visual
layer (4096 dimensions) were compressed to 128 using a PCA               and textual) in RUN-5 further improved the overall perfor-
matrix learned from a subset of 250, 000 images of the CNN               mance (reaching a 7.83% accuracy for the <100m range) and
training set and used to compute image similarities. CNN                 minimizing median error (22km). The visual-only run (RUN-
features were selected after a favorable comparison against              2) obtained remarkable results (reaching a 5.19% accuracy
compact VLAD features of similar size [8] and with SURF                  for the <1km range).
features of significantly larger size [2]. Having calculated                In the future, we plan to look deeper into different weight-
these similarities, we retrieve the top k most visually similar          ing schemes trying to achieve further improvements. More-
images and use their location to perform the estimate. In                over, we plan to develop more sophisticated clustering mod-
the visual only run (RUN-2), k = 20 and we apply a simple                els for the visual-only runs.
incremental spatial clustering scheme, in which if the j-th
image (out of the k most similar) is within 1km from the
closest one of the previous j − 1 images, it is assigned to its          4.    ACKNOWLEDGEMENTS
cluster, otherwise it forms its own cluster. In the end, the               This work is supported by the REVEAL and USEMP
largest cluster (or the first in case of equal size) is selected         projects, partially funded by the European Commission un-
and its centroid is used as the location estimate.                       der contract numbers 610928 and 611596 respectively.
5.   REFERENCES                                                      Security Informatics, pages 21–40, 2015.
[1] J. Choi, C. Hauff, O. Van Laere, and B. Thomee. The          [6] A. Popescu. Cea list’s participation at mediaeval 2013
    placing task at mediaeval 2015. In MediaEval 2014 Placing        placing task. In MediaEval 2013 Placing Task, 2013.
    Task, 2015.                                                  [7] K. Simonyan and A. Zisserman. Very deep convolutional
[2] J. Choi and X. Li. The 2014 ICSI/TU delft location               networks for large-scale image recognition. In International
    estimation system. In Working Notes Proceedings of the           Conference on Learning Representations, 2015.
    MediaEval 2014 Workshop, Barcelona, Catalunya, Spain,        [8] E. Spyromitros-Xioufis, S. Papadopoulos, I. Kompatsiaris,
    October 16-17, 2014., 2014.                                      G. Tsoumakas, and I. Vlahavas. A comprehensive study
[3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,           over vlad and product quantization in large-scale image
    R. Girshick, S. Guadarrama, and T. Darrell. Caffe:               retrieval. IEEE Transactions on Multimedia, 2014.
    Convolutional architecture for fast feature embedding.       [9] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde,
    arXiv preprint arXiv:1408.5093, 2014.                            K. Ni, D. Poland, D. Borth, and L. Li. The new data and
[4] G. Kordopatis-Zilos, G. Orfanidis, S. Papadopoulos, and          new challenges in multimedia research. CoRR,
    Y. Kompatsiaris. Socialsensor at mediaeval placing task          abs/1503.01817, 2015.
    2014. In MediaEval 2014 Placing Task, 2014.                 [10] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding
[5] G. Kordopatis-Zilos, S. Papadopoulos, and                        locations of Flickr resources using language models and
    Y. Kompatsiaris. Geotagging social media content with a          similarity search. ICMR ’11, pages 48:1–48:8, New York,
    refined language modelling approach. In Intelligence and         NY, USA, 2011. ACM.