Placing Images with Refined Language Models and
         Similarity Search with PCA-reduced VGG Features

                  Giorgos Kordopatis-Zilos1 , Adrian Popescu2 , Symeon Papadopoulos1 , and
                                             Yiannis Kompatsiaris1
              1
                  Information Technologies Institute, CERTH, Greece. [georgekordopatis,papadop,ikom]@iti.gr
                                2
                                    CEA, LIST, 91190 Gif-sur-Yvette, France. adrian.popescu@cea.fr


ABSTRACT                                                             transformation, tokenization and removed accents to gener-
We describe the participation of the CERTH/CEA-LIST                  ate a set of terms for every item. The multi-word tags were
team in the MediaEval 2016 Placing Task. We submitted                further split into their individual components, which were
five runs to the estimation-based sub-task: one based only on        also included in the item’s term set. Finally, symbols and
text by employing a Language Model-based approach with               punctuations in the terms were removed, and terms consist-
several refinements, one based on visual content, using geo-         ing of numerics or less than three characters were discarded.
spatial clustering over the most visually similar images, and           The core of our approach is a probabilistic Language Model
three based on a hybrid scheme exploiting both visual and            (LM) [5] built from the terms of the training set items. The
textual cues from the multimedia items, trained on datasets          earth surface was divided into (nearly) rectangular cells of
of different size and origin. The best results were obtained         size 0.01◦ ×0.01◦ latitude/longitude, and the term-cell prob-
by a hybrid approach trained with external training data             abilities were computed based on the user count of each term
and using two publicly available gazetteers.                         in each cell. The most likely cell (mlc) of a query is derived
                                                                     from the summation of the respective term-cell probabilities.
                                                                     The estimated location of the query items with no textual
1.    INTRODUCTION                                                   information is the centre of the cell with the most users.
   The goal of the task is to estimate the location of 1,497,464        For feature selection, we used a refined version of the local-
photos and 29,934 videos using a set of ≈5M geotagged                ity metric [4]: in our last participation, we computed locality
items and their metadata for training [1]. All submitted             based on the neighbor users that used the same term in the
runs are built upon the scheme of our last year’s participa-         same cell. To this end, we utilized a coarse grid (0.1◦ × 0.1◦ )
tion [4], integrating several refinements. For the text-based        for the calculation, based on which the neighbor users were
runs, we focused on improving the pre-processing of meta-            assigned to a unique cell, as depicted in Figure 1(a). In
data of the training set items and refining the feature se-          that setting, it was possible that a pair of users were not as-
lection method. For the visual-based runs, we built a more           signed to the same cell even if the geodesic distance of their
generic deep neural network model for enhanced visual im-            items was small. To tackle this issue, we now used a grid
age representation. For the hybrid scheme, we devised a              of 0.01◦ × 0.01◦ and modified the assignment of the users to
score for selecting between the text and visual estimations          multiple cells: instead of assigning a user to a unique cell,
based on the prediction confidence. To further improve per-          we assigned a user to an entire neighborhood, as illustrated
formance, we built a model using all geotagged items of the          in Figure 1(b). The area highlighted in orange corresponds
YFCC dataset [8] (items uploaded by users in the test set            to the cells where both users were assigned. The terms with
are not included), and we leveraged structured information           non-negative locality score form the selected term set T .
from open geographical resources such as Geonames1 and                  The contribution of each term was then weighted based
OpenStreetMap2 .                                                     on its locality and spatial entropy scores. Spatial entropy is
                                                                     a Gaussian weight function based on the term-cell entropy
2.    APPROACH DESCRIPTION                                           of the term [2]. The two measures are combined to generate
                                                                     a weight value for every term in T .
2.1    Text-based location estimation
  In the first step, the tags and titles of the training set items
were pre-processed. We applied URL decoding3 , lowercase
1
  http://www.geonames.org/
2
  https://www.openstreetmap.org/
3
  This was necessary because text in different languages was
URL encoded in the released dataset.

                                                                                      (a)                     (b)
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.
                                                                     Figure 1: Locality examples: (a) initial, (b) refined.
                                                                        measure    RUN-1   RUN-2   RUN-3    RUN-4   RUN-5   RUN-E
   To ensure more robust performance in fine granularity, we
built an additional LM using a finer grid (0.001◦ × 0.001◦ ).           P@10m       0.59    0.08    0.56     0.7     0.72    4.78
                                                                        P@100m      6.42    1.84    6.58     7.96    8.27    8.41
Having computed the mlc for both coarse and fine granu-                 P@1km      24.55    5.62   25.03    27.82   28.54   13.67
larities, we selected the most appropriate estimation: this is          P@10km     43.32    8.16   43.73    46.52   46.45    16.6
the mlc of the finer grid if it falls within the borders of the         P@100km    51.26   10.21   51.69    53.96    53.5   18.83
coarse grid, otherwise it is the mlc of the coarse one. Finally,        m. error     65     5031     56       24      27     3432
we employed similarity search as in [9] to derive the location
estimates from the the kt = 5 most textually similar images                                    (a) Images
inside the selected mlc, computing textual similarity using                 measure    RUN-1   RUN-2   RUN-3   RUN-4   RUN-5
the Jaccard similarity between the corresponding term sets.                 P@10m       0.55     0.0    0.55    0.69    0.71
Error case analysis of the text method is presented in [3].                 P@100m      6.86    0.06    6.86    7.89    8.19
                                                                            P@1km      22.73     0.5   22.73   25.53   26.16
2.2    Visual-based location estimation                                     P@10km      40.6    2.48    40.6   43.89   43.62
                                                                            P@100km    48.24    4.97   48.24    51.2   50.44
   The employed method is a refined version of the one em-
                                                                            m. error    161     6211    161      68      85
ployed in last year’s participation [4]. The main objectives
have been (1) to ensure that the visual features are generic                                   (b) Videos
and transferable from a training set independent of YFCC           Table 1: Geotagging precision (%) and median error
to the subset of the collection used for the task, and (2) to      (km) for five runs (+RUN-E for images).
provide a compact representation of the features in order to
scale up the visual search process. To meet the first objec-       tively, conft is the confidence score of the text-based estima-
tive, the VGG architecture [7] was fine-tuned with over 5000       tion and is defined in [4], and confv is the confidence score
diversified man-made and natural POIs, represented by over         of the visual-based estimation (Equation 1).
7 million images. These were downloaded from Flickr using
queries with (1) the POI name and a radius of 5km around           3.   RUNS AND RESULTS
its coordinates and (2) the POI name and the associated city          The submitted runs include one text-based (RUN-1), one
name. Following the conclusions of [6] regarding the useless-      visual-based (RUN-2) and three hybrid runs (RUN-3, RUN-4,
ness of manual annotation for POI representation, there was        RUN-5). For the first three runs, the system was trained
no manual validation of the training set. To meet the sec-         on the set released by the organizers. In RUN-4 and RUN-
ond objective, we used the same procedure as last year and         5, the training set consisted of all YFCC items excluding
compressed the initial features (VGG fc7, 4096 dimensions)         those contributed by users appearing in the test set. Also,
to 128 dimensions using PCA. The PCA matrix was learned            we report the results of an external run (RUN-E), based on
on a subset of 250,000 images of the training set.                 the visual approach but using the full geotagged subset of
   Having calculated these similarities, we retrieved the top      YFCC. The results for RUN-E show that adding more training
kv most visually similar images (in our runs we set kv = 20)       data significantly improves visual geolocation, especially for
and applied a simple spatial clustering scheme based on their      short ranges (10m and 100m), where this run outperforms
geographical distance. We defined a confidence metric for          even the best hybrid run.
our visual approach based on the size of the largest cluster:         To explore the impact of external data sources, in RUN-5,
                                                                   we further leveraged structured data from Geonames and
          confv (i) = max((n(i) − nt )/(kv − nt ), 0)       (1)
                                                                   OpenStreetMap. In particular, we used the geotagged en-
where n(i) is the number of neighbors in the largest clus-         tries of the two sources as additional training items for build-
ter for query image i, nt is the configuration parameter that      ing the text-based LM: from Geonames we used a list of city
determines the “strictness” of the confidence score. The con-      names along with their alternative names, while from Open-
fidence score gets values in the range [0,1]. We empirically       StreetMap a list of nodes (points of interest) provided they
set nt = 5. Our visual approach is not designed for video          were associated with an address. Since training items need
analysis, thus all videos were placed in the centre of London,     to be associated with a contributor, we considered Geonames
which is the densest geotagged region in the world.                and OpenStreetMap as the two contributing users.
                                                                      According to Table 1, the best performance at fine granu-
2.3    Hybrid location estimation                                  larities (≤1km) was attained by RUN-5 for both images and
   The hybrid approach comprises a set of rules that deter-        videos. RUN-4 reported the best results in terms of median
mine the source of estimation between the text and visual          distance error and precision at coarse granularities (>1km).
approaches. First, for query images, for which no estimation       Comparing the two runs, one may conclude that leverag-
could be produced by the text-based approach, the location         ing structured geographic information improves geolocation
was estimated based on the visual approach. Otherwise, in          precision in short ranges (reaching 8.27% and 28.54% in
case the visual estimation fell inside the borders of the mlc      P @100m and P @1km respectively), with a minor increase
calculated by the text-based approach, the visual estima-          in median error. Moreover, the combination of visual and
tion was selected. If not, the estimation was determined by        textual features (RUN-3) improved the overall performance of
comparing the confidence scores of the two approaches.             the system in case of images, but had no effect on video geo-
                   (                                               tagging (since no visual information was used from videos).
                     Gv (i) if conft (i) ≤ confv (i)
          Gh (i) =
                     Gt (i) otherwise
                                                          (2)      4.   ACKNOWLEDGMENTS
                                                                     This work is supported by the REVEAL and USEMP
where Gh , Gt and Gv are the estimated locations for query         projects, partially funded by the European Commission un-
item i of the hybrid, textual and visual approach, respec-         der contract numbers 610928 and 611596 respectively.
5.   REFERENCES                                                  [5] A. Popescu. Cea list’s participation at mediaeval 2013
[1] J. Choi, C. Hauff, O. Van Laere, and B. Thomee. The              placing task. In MediaEval 2013 Placing Task, 2013.
    placing task at mediaeval 2016. In MediaEval 2016 Placing    [6] A. Popescu, E. Gadeski, and H. L. Borgne. Scalable domain
    Task, 2016.                                                      adaptation of convolutional neural networks. CoRR,
[2] G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris.       abs/1512.02013, 2015.
    Geotagging social media content with a refined language      [7] K. Simonyan and A. Zisserman. Very deep convolutional
    modelling approach. In PAISI 2015, pages 21–40, 2015.            networks for large-scale image recognition. In International
[3] G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris.       Conference on Learning Representations, 2015.
    In-depth exploration of geotagging performance using         [8] B. Thomee et al. The new data and new challenges in
    sampling strategies on yfcc100m. In Proceedings of the           multimedia research. CoRR, abs/1503.01817, 2015.
    MMCommons 2016. ACM, 2016.                                   [9] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding
[4] G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and            locations of Flickr resources using language models and
    Y. Kompatsiaris. CERTH/CEA LIST at MediaEval placing             similarity search. ICMR ’11, pages 48:1–48:8, New York,
    task 2015. In MediaEval 2015 Placing Task, 2015.                 NY, USA, 2011. ACM.