Multimodal image geocoding: the 2013 RECOD’s approach
            Lin Tzy Li1,2 , Jurandy Almeida1,3 , Otávio A. B. Penatti1 , Rodrigo T. Calumby1,4 ,
            Daniel C. G. Pedronette1,5 , Marcos A. Gonçalves6 , and Ricardo da S. Torres1
              1
                 RECOD Lab, Institute of Computing, University of Campinas (UNICAMP), Campinas, SP – Brazil, 13083-852
                         2
                           Telecommunications Res. & Dev. Center, CPqD Foundation, Campinas, SP – Brazil, 13086-902
      3
        Institute of Science and Technology, Federal University of Sao Paulo (UNIFESP), Sao Jose dos Campos, SP – Brazil, 12231-280
                  4
                    Dept. of Exact Sciences, University of Feira de Santana (UEFS), Feira de Santana, BA – Brazil, 44036-900
         5
           Dept. of Stat., Applied Math. and Computing, Universidade Estadual Paulista (UNESP), Rio Claro, SP – Brazil, 13506-900
              6
                Dept. of Computer Science, Federal University of Minas Gerais (UFMG), Belo Horizonte, MG – Brazil, 31270-010
                     {lintzyli, jurandy, penatti, rtripodi, rtorres}@ic.unicamp.br, daniel@rc.unesp.br, mgoncalv@dcc.ufmg.br


ABSTRACT                                                               functions used were BM25 and TF-IDF, as implemented by
This work describes the approach used by the RECOD team                the Lucene API.
in the MediaEval Placing Task of 2013, in which we were re-            Visual
quired to develop an automatic scheme to assign geograph-
ical locations to images. Our approach is multimodal, con-             Given the large dataset, we had to select carefully the de-
sidering textual and visual descriptors, which are combined            scriptors to be used. Initially, we have evaluated some of
by a rank aggregation strategy. We estimate the location of            the descriptors provided with the dataset, like: color and
test images based on the coordinates of top-ranked images              edge directivity descriptor (CEDD), scalable color (SCD),
in the list of combined results.                                       gabor filter. Using the validation set, we have noticed that
                                                                       the best results were achieved by CEDD. Although SCD has
                                                                       shown the best results in [2], in our validation set, it did not
1.    INTRODUCTION                                                     performed well for our geocoding approach.
   Geocoding multimedia material has gained great attention               Additionally to CEDD, we used BIC (border/interior
in the latest years given the importance of providing richer           pixel classification). This descriptor was chosen due to its
services for users, like placing information on maps. Image            good results in large scale experiments [5]. For this, we
geocoding is the objective of the Placing Task in 2013, i.e.,          downloaded the whole photo dataset, resizing the images to
it requires participants to assign geographical locations to           have at most 100 thousand pixels, as suggested by [6] for
images. Details about the Placing task, its dataset, and the           large scale experiments, and extracted the 128-dimensional
evaluation protocol can be found in [1].                               BIC feature vector of each image. The Manhattan distance
   In this paper, we present our multimodal approach that              (L1) was used for both BIC and CEDD.
combines different textual and visual descriptors uniformly.
We combine them using a rank aggregation strategy, previ-              2.2     Rank aggregation
ously introduced in [4].                                                 As last year, we used a rank aggregation strategy to com-
                                                                       bine different descriptors [3]. For this year, due to the size
2.    PROPOSED APPROACH                                                of the development set, we created a ranked list limited to
   We handled the task of automatically assigning a geo-               the top 1,000 most similar photos for each test image.
graphical location to images using nearest neighbor searches             We have used an aggregation function similar to sima (nu-
on aggregated ranked lists, which combine textual and visual           merator is m instead of 2) proposed in [3]. When the inter-
features. The strengths of our approach are its simplicity             section of top-1000 lists computed by different features are
and its power to combine multiple description modalities.              small, the size of the final aggregated list tends to (m×1000),
   For evaluation purposes in the training phase, we have se-          being m the number of features combined. We select the
lected a validation set of 5,000 images from the development           top-1000 images that present the highest aggregated score
set of around 8.5 million images. First, each photo from the           as the output of the rank aggregation step.
development set was assigned to a fixed cell of 1-by-1 degree
based on its ground truth latitude and longitude. Then, the
                                                                       2.3     Geocoding
resulting grid was summarized by the total of photos (den-                For geocoding the test images, we have used a nearest
sity) in each cell regarding to the dataset size. Finally, the         neighbor approach. We used the development set (∼8.5 mil-
evaluation images (5,000 photos) were randomly picked up               lion images) as geo-profiles and each test image was com-
from each cell, by taking into account its density.                    pared to the whole development set. For comparing the
                                                                       images, we have used each type of feature independently
2.1    Features                                                        (textual or visual). For a given test image, the ranked list
                                                                       of each feature is produced. All the lists are then combined
Textual                                                                by our rank aggregation strategy and the final ranked list is
                                                                       generated. The lat/long of the first image (most similar) in
From textual metadata, we used only the photo tags to                  this final list is assigned to the test image.
compute similarities between the images. The tags were
stemmed and stopwords were removed. The text similarity
                                                                       3.    OUR SUBMISSIONS & RESULTS
                                                                       Submitted runs
Copyright is held by the author/owner(s).
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain         Our submissions for this year are:
run1: combines 2 textual descriptors: BM25 + TF-IDF;               BIC+CEDD (run 2) we improve the results of BIC alone
run2: combines 2 visual descriptors: BIC + CEDD;                   (run 3). The combination of textual and visual descriptors
run3: one visual descriptor: BIC;                                  (run 4) was slightly worse than the textual descriptors iso-
                                                                   lated. One possible reason is the large difference between
run4: combines 2 textual and 2 visual descriptors: BM25            textual and visual results.
    + TF-IDF + BIC + CEDD;                                            Observe that for the test set (Table 2), our results were
run5: combines 4 textual descriptors: BM25 + TF-IDF1 .             quite different for our validation set (Table 1), mainly for
                                                                   the visual features. While in the test3 set, BIC achieved
   Runs 1 and 5 used only textual features. Thus, for test
                                                                   less than 1% in the 1km radius, in the validation set, it
images without tags, there was no way to apply our similar-
                                                                   presented 15.32%. Because of this, in the validation set,
ity ranked list approach. As post-processing, we randomly
                                                                   the fusion (run 4) results improved over run 1. The huge
selected an item from the development set to transfer its
                                                                   difference between validation and test results might be due
latitude and longitude to the test image.
                                                                   to a property of the test set not considered when building
                                                                   the validation set: the users who contributed for the photos
3.1      Results                                                   in the training set are different from those who contributed
   Besides the organizers’ standard evaluation metric, we          for the photos in the test set.
also applied the WAS score we proposed in [4]. This eval-             Regarding the distribution of test results, for the visual
uation metric gives an overview of a method’s performance          descriptors (runs 2 and 3), the 1st Quartile shows that 25%
expressed by a score between [0,1], 0 being very bad and 1         of the items were geocoded at most 1,900km from the correct
indicating a perfect estimate with a higher weight assigned        location. On the other hand, for the textual descriptors and
to more precise results. The WAS takes into account every          their combinations (runs 1, 4, and 5), 25% of the items are
single result of the whole test set to indicate and summarize      very close to their correct locations (less than 3km).
the level of precision of an evaluated method as a whole.
   Let d(i) be the geographic distance between the predicted
and the ground truth location of the image i. The proposed         4.   CONCLUSIONS
score for the result of a given test image i is defined as:           Our best results were observed for the methods based only
                   log(1+d(i))
score(i) = 1 − log(1+R         , where Rmax is the maximum         on textual description. For them, we could geocode within
                         max )
                                                                   1km radius around 20% of the testing set (test3). Consider-
distance between any two points on the Earth’s surface (half
                                                                   ing visual descriptors, the main challenge this year was the
of Earth’s circumference at the Equator is 20,027.5 km).
                                                                   large scale dataset, which poses time and space constraints
   Let D be a test dataset with n images whose locations
                                                                   in the descriptors to be used. Our rank aggregation strat-
need to be predicted. The overall score for     the predictions
                                              Pn
                                                   score(i)        egy, for the test set, was only effective for combining textual
of a method m is defined as: W AS(m) = i=1 n                .      descriptors. Combining textual and visual descriptors did
              Table 1: Validation set results.                     not improve the results. As future work, we would like to
       Precision   Run 1 Run 2        Run 3      Run 4    Run 5    evaluate a more elaborate geocoding approach, similar to
            1km    64.56%    16.86%    15.32%    68.82%   64.62%   the scheme used to create our validation set, for example.
          10km     73.64%    17.68%    16.10%    75.90%   73.60%
         100km     77.58%    18.64%    17.04%    78.94%   77.58%
         500km     80.20%    22.86%    13.40%    81.10%   80.22%
                                                                   Acknowledgments
        1000km     82.18%    28.32%    20.12%    82.74%   82.32%   We thank the support of FAPESP (2011/11171-5,
      WAS score    0.7866 0.3053      0.2889     0.8019   0.7866   2009/10554-8), CNPq (306580/2012-8, 484254/2012-0),
                       Distance distribution
   1st Quartile       0.00    698.40    885.30     0.00     0.00   CAPES, FAPEMIG, Samsung, ACM SIGIR, and MediaEval
       Median         0.03 5,499.40 5,835.80       0.00     0.04   organizers.


Table 2: Test results using test3 set (53,000 items).
                                                                   5.   REFERENCES
                                                                   [1] C. Hauff, B. Thomee, and M. Trevisiol. Working Notes
         Precision Run 1    Run 2     Run 3      Run 4    Run 5
              1km 20.14%     0.37%     0.28%     20.11%   18.82%       for the Placing Task at MediaEval. In MediaEval 2013
            10km 37.60%      0.80%     0.67%     37.10%   35.93%       Workshop, volume 1043, October 18-19 2013.
           100km 47.66%      1.69%     1.51%     46.97%   45.97%   [2] P. Kelm, S. Schmiedeke, and T. Sikora. Multimodal
           500km 56.62%      6.73%     6.25%     55.83%   55.74%
          1000km 63.17%     14.32%    13.78%     62.26%   62.43%
                                                                       geo-tagging in social media websites using hierarchical
       WAS score 0.5240     0.1653    0.1623     0.5190   0.5128       spatial segmentation. In International Workshop on
                      Distance distribution                            Location-Based Social Networks, pages 32–39, 2012.
      1st Quartile   1.73 1,869.00 1,962.00        1.76     2.05   [3] L. T. Li, J. Almeida, D. C. G. Pedronette, O. A. B.
          Median   168.22 6,632.00 6,729.00      196.79   225.67
                                                                       Penatti, and R. da S. Torres. A multimodal approach
   As we can observe in Table 2, the test runs based solely on         for video geocoding. In Working Notes Proc. MediaEval
textual information yielded the best results (runs 1, 4, and           Workshop, volume 927, 2012.
5), while those based only on visual descriptors presented         [4] L. T. Li, D. C. G. Pedronette, J. Almeida, O. A.
low accuracy. The possible reason is the semantic gap, as              Penatti, R. T. Calumby, and R. d. S. Torres. A rank
there might be many different places with similar visual ap-           aggregation framework for video multimodal geocoding.
pearance, specially in a large dataset like the one used for           Mult. Tools and App., pages 1–37, 2013.
training. Another potential issue was the large number of          [5] O. A. B. Penatti, E. Valle, and R. da S. Torres.
ties in the first positions of ranked lists of visual descrip-         Comparative study of global color and texture
tors. Given our 1-nn geocoding approach, this probably de-             descriptors for web image retrieval. J. Vis. Comm. and
graded our results. However, we can see that by combining              Image Repr., 23(2):359–380, 2012.
1                                                                  [6] F. Perronnin, Z. Akata, Z. Harchaoui, and C. Schmid.
  Non-English tags were translated to English using the
Google Translate service and combined with the original                Towards good practice in large-scale learning for image
tags.                                                                  classification. In CVPR, pages 3482–3489, 2012.