<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimedia Geocoding: The RECOD 2014 Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lin Tzy Li</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Otávio A. B. Penatti</string-name>
          <email>o.penatti@samsung.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jurandy Almeida</string-name>
          <email>jurandy.almeida@unifesp.br</email>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovani Chiachia</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rodrigo T. Calumby</string-name>
          <email>rtcalumby@ecomp.uefs.br</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro R. Mendes Júnior</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel C. G. Pedronette</string-name>
          <email>daniel@rc.unesp.br</email>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ricardo da S. Torres</string-name>
          <email>rtorresg@ic.unicamp.br</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Advanced Technologies, SAMSUNG Research Institute</institution>
          ,
          <addr-line>Campinas, SP</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Brazil</institution>
          ,
          <addr-line>13506-900</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dept. of Exact Sciences, University of Feira de Santana (UEFS)</institution>
          ,
          <addr-line>Feira de Santana, BA</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Dept. of Stat., Applied Math. and Computing, Universidade Estadual Paulista (UNESP)</institution>
          ,
          <addr-line>Rio Claro, SP</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Institute of Science and Technology, Federal University of S~ao Paulo (UNIFESP)</institution>
          ,
          <addr-line>Sao Jose dos Campos, SP</addr-line>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>RECOD Lab, Institute of Computing, University of Campinas (UNICAMP)</institution>
          ,
          <addr-line>Campinas, SP</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This work describes the approach proposed by the RECOD team for the Placing Task of MediaEval 2014. This task requires the de nition of automatic schemes to assign geographical locations to images and videos. Our approach is based on the use of as much evidences as possible (textual, visual, and/or audio descriptors) to geocode a given image/video. We estimate the location of test items by clustering the geographic coordinates of top-ranked items in one or more ranked lists de ned in terms of di erent criteria.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Geocoding multimedia material has gained greater
attention in the latest years given its importance for providing
richer services for users, like placing information on maps or
providing geographic searches. The Placing Task at
MediaEval 2014 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] challenges participants to assign geographical
locations to images and videos automatically.
      </p>
      <p>In this paper, we present our approach that combines
different textual, audio, and/or visual descriptors uniformly by
applying a clustering scheme to merge information de ned
by several ranked lists.</p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED APPROACH</title>
      <p>The approach used is composed of ve steps: (i)
image/video feature extraction, (ii) generation of ranked lists,
(iii) re-ranking, (iv) clustering by lat/long of the top-ranked
items (considering one or multiple ranked lists), and (v)
assigning to the test item the lat/long of the sample with the
highest density value.</p>
      <p>For evaluation purposes in the training phase, we created
a validation set sampling 5,000 images and 1,000 videos from
the development/training set. This set was created as
follows. First, each item in the development set was assigned
to a xed cell of 1-by-1 degree based on its latitude and
longitude. Then, the resulting grid was summarized by the
number of photos (density) in each cell. Next, we randomly
picked up images/videos from each cell considering their
proportional distribution over the original dataset. To keep the
validation step with similar characteristics to the real
development and testing sets, items from users who have
image/video selected for the validation set were removed from
the new training set, creating a subset from the original full
development set with 4,485,331 images and 14,115 videos.
Therefore, to evaluate our strategies before conducting the
nal runs, we used the validation set with the partial
training set created as described above.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Features</title>
      <p>
        Textual . From textual metadata, the title, description, and
tags of photos/videos were concatenated as one eld to
compute text similarities between the test and training items.
The text was stemmed and stopwords were removed. The
text similarity functions used were BM25 and TF-IDF as
implemented by the Lucene.1 The best results for textual
similarity computation used a training set composed of both
image and video metadata, regardless the kind of test query.
Audio/Visual . Videos and images were handled di
erently. For images, we used the provided CEDD, Gabor, and
FTCH and extracted additional features: OverFeat2 and
BIC [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Before extracting additional features from images,
we resized them to at most 100k pixels. For videos, we used
the provided features: GIST (static feature) and MFCC
(audio feature), besides extracting HMP motion feature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Re-ranking, clustering &amp; geocoding</title>
      <p>
        We rst used the full development set as geo-pro les and
each test item was compared to the whole development set
for each feature independently. For a given test item, a
ranked list for each feature was produced. Given the ranked
lists, we explored two strategies:
1. Re-ranking items using the RL-Sim algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It
relies on contextual information encoded in the similarity
between ranked lists. This method exploits the fact that if
two images are similar, their ranked lists should be similar
as well. Therefore, a contextual distance measure is de ned
based on the similarity of ranked lists. As the top-n positions
hold more relevant items, we focus on them to de ne the nal
list considering m input ranked lists (m features).
      </p>
      <p>
        We were able to apply the re-ranking algorithm (using
the top n = 15 items of the original ranked lists) only to
the video dataset, due to its small size and to the number
of required inputs for the algorithm.
2. Clustering lat/long points derived from the top-n items
of ranked lists. Input lists of the clustering method were
1http://lucene.apache.org/core/ (as of 10/2014).
2https://github.com/sermanet/OverFeat (as of 09/2014).
2
de ned for a single feature (i.e., one list only), for the
result of the re-ranking of m features, or from a set of m
independent lists associated with m di erent features. We
used Optimum-Path Forest (OPF) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for clustering the
input list(s) related to a given test sample. OPF created a
graph as follows: for each item s, a node was de ned; each
node s was then linked to its k nearest neighbors (k = 3 was
used in all the cases). Then, each item/node in the graph
received a density value according to the formula proposed
by Rocha et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The lat/long of the test item were
inherited from the graph's sample/node with highest density
value. When using m ranked lists generated for m di erent
descriptors, we combined the top mn items for each ranked
list to create the graph.
      </p>
      <p>None of our submissions used extra crawled material or
gazetteers. Based on con guration from our best results on
evaluation set, our submission was set as shown in Table 1.</p>
      <p>For test items that had no lat/long estimation (because of
missing/empty features), we randomly selected an item from
the development set to assign its latitude and longitude to
the test item. For runs of textual feature only (Runs 1 &amp; 4),
those represented 1.07% of the test items, while for visual
only runs (2 &amp; 5) they were 0.02% and 0.03% respectively.
For multimodal run (3) there were only 2 cases. We have
also noted that 0.58% of the test images were the unavailable
message of Flickr, warning that the item was unavailable.</p>
      <p>As we can observe in Table 2, the test run combining
textual and visual information (Run 3) yields the best results
for lower precision radii (10 m, 100 m, and 1 km), while
using only textual descriptors via OPF clustering (Run 1)
produces better from 10 km precision level on.</p>
      <p>For non-textual runs (Run 2 and Run 5), at precision level
up to 1 km the results using only one visual feature (Run 5)
are slightly higher (0.01, 0.00, and 0.03 percentage point)
than combining di erent features (Run 2). The opposite is
true when we observe results from 10 km on. It seems that
there were some disagreement between the two combined
visual features that were accommodated by the geocoding
method applied, which a ected the results precision.</p>
      <p>During the validation stage of the OPF clustering, we have
noticed that when textual features are used, the number of
top-n items to be clustered should be lower than when using
only visual features. Otherwise, the textual results were
degraded when more points are considered in the clustering
process. For example, Run 4 (textual) result was derived
from top-5 point clustering, while Run 5 (visual) was based
on lat/long from top-100 items.</p>
      <p>Comparing the results using re-ranking to combine visual
features of videos (Run 2) with just HMP feature (Run 5),
the test results showed that up to 1km precision the fusion
by re-ranking (Run 2) improved the results over using just
one feature (Run 5), but for larger radii it is the other way
around, as shown in Table 3. Considering that we aim to
geocode items as precisely as possible, re-ranking and
clustering strategies have shown promising results.
4.</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>In this work, we explored re-ranking and clustering
approaches to geocode multimedia items based on the
similarity of ranked lists. We observed that geocoding results were
in uenced by the number of top-n items of a ranked list used
to cluster or re-rank. It seems that textual features require
less top items than visual descriptors.</p>
      <p>
        As future work, we plan to explore further con
guration and approaches using di erent clustering and re-ranking
strategies. We also plan to combine the strategies used this
year with rank aggregation methods [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We thank FAPESP (#2013/08645-0 and #2013/11359-0),
CNPq (306580/2012-8 and 484254/2012-0), CAPES,
Samsung, and Placing Task organizers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Leite</surname>
          </string-name>
          , and R. da Silva Torres.
          <article-title>Comparison of video sequences with histograms of motion patterns</article-title>
          .
          <source>In ICIP</source>
          , pages
          <volume>3673</volume>
          {
          <fpage>3676</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Friedland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Borth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Elizalde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gottlieb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Carrano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pearce</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Poland</surname>
          </string-name>
          .
          <article-title>The placing task: A large-scale geo-estimation challenge for social-media videos and images</article-title>
          .
          <source>In ACM GeoMM</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>R. de O. Stehling</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Nascimento</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. X.</given-names>
            <surname>Falca</surname>
          </string-name>
          <article-title>~o. A compact and e cient image retrieval approach based on border/interior pixel classi cation</article-title>
          .
          <source>In CIKM</source>
          , pages
          <volume>102</volume>
          {
          <fpage>109</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C. G.</given-names>
            <surname>Pedronette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. A. B.</given-names>
            <surname>Penatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Calumby</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. da Silva</given-names>
            <surname>Torres</surname>
          </string-name>
          .
          <article-title>A rank aggregation framework for video multimodal geocoding</article-title>
          .
          <source>Mult. Tools and App.</source>
          , pages
          <volume>1</volume>
          {
          <fpage>37</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. C. G.</given-names>
            <surname>Pedronette</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. da Silva</given-names>
            <surname>Torres</surname>
          </string-name>
          .
          <article-title>Image re-ranking and rank aggregation based on similarity of ranked lists</article-title>
          .
          <source>PR</source>
          ,
          <volume>46</volume>
          (
          <issue>8</issue>
          ):
          <volume>2350</volume>
          {
          <fpage>2360</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Rocha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A. M.</given-names>
            <surname>Cappabianco</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. X.</given-names>
            <surname>Falca</surname>
          </string-name>
          <article-title>~o. Data clustering as an optimum-path forest problem with applications in image analysis</article-title>
          .
          <source>Int J Imag Syst Tech</source>
          ,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <volume>50</volume>
          {
          <fpage>68</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>