<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal image geocoding: the 2013 RECOD's approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lin Tzy Li</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jurandy Almeida</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Otávio A. B. Penatti</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rodrigo T. Calumby</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel C. G. Pedronette</string-name>
          <email>daniel@rc.unesp.br</email>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcos A. Gonçalves</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ricardo da S. Torres</string-name>
          <email>rtorresg@ic.unicamp.br</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Brazil</institution>
          ,
          <addr-line>31270-010</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Computer Science, Federal University of Minas Gerais (UFMG)</institution>
          ,
          <addr-line>Belo Horizonte, MG</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dept. of Exact Sciences, University of Feira de Santana (UEFS)</institution>
          ,
          <addr-line>Feira de Santana, BA</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Dept. of Stat., Applied Math. and Computing, Universidade Estadual Paulista (UNESP)</institution>
          ,
          <addr-line>Rio Claro, SP</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Institute of Science and Technology, Federal University of Sao Paulo (UNIFESP)</institution>
          ,
          <addr-line>Sao Jose dos Campos, SP</addr-line>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>RECOD Lab, Institute of Computing, University of Campinas (UNICAMP)</institution>
          ,
          <addr-line>Campinas, SP</addr-line>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Telecommunications Res. &amp; Dev. Center, CPqD Foundation</institution>
          ,
          <addr-line>Campinas, SP</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This work describes the approach used by the RECOD team in the MediaEval Placing Task of 2013, in which we were required to develop an automatic scheme to assign geographical locations to images. Our approach is multimodal, considering textual and visual descriptors, which are combined by a rank aggregation strategy. We estimate the location of test images based on the coordinates of top-ranked images in the list of combined results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Geocoding multimedia material has gained great attention
in the latest years given the importance of providing richer
services for users, like placing information on maps. Image
geocoding is the objective of the Placing Task in 2013, i.e.,
it requires participants to assign geographical locations to
images. Details about the Placing task, its dataset, and the
evaluation protocol can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In this paper, we present our multimodal approach that
combines di erent textual and visual descriptors uniformly.
We combine them using a rank aggregation strategy,
previously introduced in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED APPROACH</title>
      <p>We handled the task of automatically assigning a
geographical location to images using nearest neighbor searches
on aggregated ranked lists, which combine textual and visual
features. The strengths of our approach are its simplicity
and its power to combine multiple description modalities.</p>
      <p>For evaluation purposes in the training phase, we have
selected a validation set of 5,000 images from the development
set of around 8.5 million images. First, each photo from the
development set was assigned to a xed cell of 1-by-1 degree
based on its ground truth latitude and longitude. Then, the
resulting grid was summarized by the total of photos
(density) in each cell regarding to the dataset size. Finally, the
evaluation images (5,000 photos) were randomly picked up
from each cell, by taking into account its density.</p>
    </sec>
    <sec id="sec-3">
      <title>Features</title>
      <sec id="sec-3-1">
        <title>Textual</title>
        <p>From textual metadata, we used only the photo tags to
compute similarities between the images. The tags were
stemmed and stopwords were removed. The text similarity
functions used were BM25 and TF-IDF, as implemented by
the Lucene API.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Visual</title>
        <p>
          Given the large dataset, we had to select carefully the
descriptors to be used. Initially, we have evaluated some of
the descriptors provided with the dataset, like: color and
edge directivity descriptor (CEDD), scalable color (SCD),
gabor lter. Using the validation set, we have noticed that
the best results were achieved by CEDD. Although SCD has
shown the best results in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], in our validation set, it did not
performed well for our geocoding approach.
        </p>
        <p>
          Additionally to CEDD, we used BIC (border/interior
pixel classi cation). This descriptor was chosen due to its
good results in large scale experiments [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. For this, we
downloaded the whole photo dataset, resizing the images to
have at most 100 thousand pixels, as suggested by [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] for
large scale experiments, and extracted the 128-dimensional
BIC feature vector of each image. The Manhattan distance
(L1) was used for both BIC and CEDD.
2.2
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Rank aggregation</title>
      <p>
        As last year, we used a rank aggregation strategy to
combine di erent descriptors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For this year, due to the size
of the development set, we created a ranked list limited to
the top 1,000 most similar photos for each test image.
      </p>
      <p>
        We have used an aggregation function similar to sima
(numerator is m instead of 2) proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. When the
intersection of top-1000 lists computed by di erent features are
small, the size of the nal aggregated list tends to (m 1000),
being m the number of features combined. We select the
top-1000 images that present the highest aggregated score
as the output of the rank aggregation step.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Geocoding</title>
      <p>For geocoding the test images, we have used a nearest
neighbor approach. We used the development set ( 8.5
million images) as geo-pro les and each test image was
compared to the whole development set. For comparing the
images, we have used each type of feature independently
(textual or visual). For a given test image, the ranked list
of each feature is produced. All the lists are then combined
by our rank aggregation strategy and the nal ranked list is
generated. The lat/long of the rst image (most similar) in
this nal list is assigned to the test image.
3.</p>
    </sec>
    <sec id="sec-6">
      <title>OUR SUBMISSIONS &amp; RESULTS</title>
      <sec id="sec-6-1">
        <title>Submitted runs</title>
        <p>Our submissions for this year are:
run1: combines 2 textual descriptors: BM25 + TF-IDF;
run2: combines 2 visual descriptors: BIC + CEDD;
run3: one visual descriptor: BIC;
run4: combines 2 textual and 2 visual descriptors: BM25
+ TF-IDF + BIC + CEDD;
run5: combines 4 textual descriptors: BM25 + TF-IDF1.</p>
        <p>Runs 1 and 5 used only textual features. Thus, for test
images without tags, there was no way to apply our
similarity ranked list approach. As post-processing, we randomly
selected an item from the development set to transfer its
latitude and longitude to the test image.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Results</title>
      <p>
        Besides the organizers' standard evaluation metric, we
also applied the WAS score we proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This
evaluation metric gives an overview of a method's performance
expressed by a score between [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ], 0 being very bad and 1
indicating a perfect estimate with a higher weight assigned
to more precise results. The WAS takes into account every
single result of the whole test set to indicate and summarize
the level of precision of an evaluated method as a whole.
      </p>
      <p>Let d(i) be the geographic distance between the predicted
and the ground truth location of the image i. The proposed
score for the result of a given test image i is de ned as:
score(i) = 1 lolgo(g1(+1+Rdm(ia)x) ) , where Rmax is the maximum
distance between any two points on the Earth's surface (half
of Earth's circumference at the Equator is 20,027.5 km).</p>
      <p>Let D be a test dataset with n images whose locations
need to be predicted. The overall score for the predictions
of a method m is de ned as: W AS(m) = Pin=1 sncore(i) .
1st Quartile</p>
      <p>Median
As we can observe in Table 2, the test runs based solely on
textual information yielded the best results (runs 1, 4, and
5), while those based only on visual descriptors presented
low accuracy. The possible reason is the semantic gap, as
there might be many di erent places with similar visual
appearance, specially in a large dataset like the one used for
training. Another potential issue was the large number of
ties in the rst positions of ranked lists of visual
descriptors. Given our 1-nn geocoding approach, this probably
degraded our results. However, we can see that by combining
1Non-English tags were translated to English using the
Google Translate service and combined with the original
tags.</p>
      <p>BIC+CEDD (run 2) we improve the results of BIC alone
(run 3). The combination of textual and visual descriptors
(run 4) was slightly worse than the textual descriptors
isolated. One possible reason is the large di erence between
textual and visual results.</p>
      <p>Observe that for the test set (Table 2), our results were
quite di erent for our validation set (Table 1), mainly for
the visual features. While in the test3 set, BIC achieved
less than 1% in the 1km radius, in the validation set, it
presented 15.32%. Because of this, in the validation set,
the fusion (run 4) results improved over run 1. The huge
di erence between validation and test results might be due
to a property of the test set not considered when building
the validation set: the users who contributed for the photos
in the training set are di erent from those who contributed
for the photos in the test set.</p>
      <p>Regarding the distribution of test results, for the visual
descriptors (runs 2 and 3), the 1st Quartile shows that 25%
of the items were geocoded at most 1,900km from the correct
location. On the other hand, for the textual descriptors and
their combinations (runs 1, 4, and 5), 25% of the items are
very close to their correct locations (less than 3km).
4.</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS</title>
      <p>Our best results were observed for the methods based only
on textual description. For them, we could geocode within
1km radius around 20% of the testing set (test3).
Considering visual descriptors, the main challenge this year was the
large scale dataset, which poses time and space constraints
in the descriptors to be used. Our rank aggregation
strategy, for the test set, was only e ective for combining textual
descriptors. Combining textual and visual descriptors did
not improve the results. As future work, we would like to
evaluate a more elaborate geocoding approach, similar to
the scheme used to create our validation set, for example.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>We thank the support of FAPESP (2011/11171-5,
2009/10554-8), CNPq (306580/2012-8, 484254/2012-0),
CAPES, FAPEMIG, Samsung, ACM SIGIR, and MediaEval
organizers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Trevisiol</surname>
          </string-name>
          .
          <article-title>Working Notes for the Placing Task at MediaEval</article-title>
          .
          <source>In MediaEval 2013 Workshop</source>
          , volume
          <volume>1043</volume>
          ,
          <string-name>
            <surname>October</surname>
          </string-name>
          18-19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kelm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmiedeke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sikora</surname>
          </string-name>
          .
          <article-title>Multimodal geo-tagging in social media websites using hierarchical spatial segmentation</article-title>
          .
          <source>In International Workshop on Location-Based Social Networks</source>
          , pages
          <volume>32</volume>
          {
          <fpage>39</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L. T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C. G.</given-names>
            <surname>Pedronette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. A. B.</given-names>
            <surname>Penatti</surname>
          </string-name>
          , and R. da
          <string-name>
            <given-names>S.</given-names>
            <surname>Torres</surname>
          </string-name>
          .
          <article-title>A multimodal approach for video geocoding</article-title>
          .
          <source>In Working Notes Proc. MediaEval Workshop</source>
          , volume
          <volume>927</volume>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C. G.</given-names>
            <surname>Pedronette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. A.</given-names>
            <surname>Penatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Calumby</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. d. S.</given-names>
            <surname>Torres</surname>
          </string-name>
          .
          <article-title>A rank aggregation framework for video multimodal geocoding</article-title>
          .
          <source>Mult. Tools and App.</source>
          , pages
          <volume>1</volume>
          {
          <fpage>37</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O. A. B.</given-names>
            <surname>Penatti</surname>
          </string-name>
          , E. Valle, and R. da
          <string-name>
            <given-names>S.</given-names>
            <surname>Torres</surname>
          </string-name>
          .
          <article-title>Comparative study of global color and texture descriptors for web image retrieval</article-title>
          .
          <source>J. Vis. Comm. and Image Repr</source>
          .,
          <volume>23</volume>
          (
          <issue>2</issue>
          ):
          <volume>359</volume>
          {
          <fpage>380</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Akata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Harchaoui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <article-title>Towards good practice in large-scale learning for image classi cation</article-title>
          .
          <source>In CVPR</source>
          , pages
          <volume>3482</volume>
          {
          <fpage>3489</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>