<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgos Kordopatis-Zilos</string-name>
          <email>P@100km</email>
          <email>P@100m</email>
          <email>P@10km</email>
          <email>P@10m</email>
          <email>P@1km</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Popescu</string-name>
          <email>adrian.popescu@cea.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Papadopoulos</string-name>
          <email>papadop@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yiannis Kompatsiaris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Technologies Institute</institution>
          ,
          <addr-line>CERTH</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>We describe the participation of the CERTH/CEA-LIST team in the MediaEval 2016 Placing Task. We submitted ve runs to the estimation-based sub-task: one based only on text by employing a Language Model-based approach with several re nements, one based on visual content, using geospatial clustering over the most visually similar images, and three based on a hybrid scheme exploiting both visual and textual cues from the multimedia items, trained on datasets of di erent size and origin. The best results were obtained by a hybrid approach trained with external training data and using two publicly available gazetteers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The goal of the task is to estimate the location of 1,497,464
photos and 29,934 videos using a set of 5M geotagged
items and their metadata for training [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. All submitted
runs are built upon the scheme of our last year's
participation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], integrating several re nements. For the text-based
runs, we focused on improving the pre-processing of
metadata of the training set items and re ning the feature
selection method. For the visual-based runs, we built a more
generic deep neural network model for enhanced visual
image representation. For the hybrid scheme, we devised a
score for selecting between the text and visual estimations
based on the prediction con dence. To further improve
performance, we built a model using all geotagged items of the
YFCC dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (items uploaded by users in the test set
are not included), and we leveraged structured information
from open geographical resources such as Geonames1 and
OpenStreetMap2.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH DESCRIPTION</title>
    </sec>
    <sec id="sec-3">
      <title>Text-based location estimation</title>
      <p>In the rst step, the tags and titles of the training set items
were pre-processed. We applied URL decoding3, lowercase
1http://www.geonames.org/
2https://www.openstreetmap.org/
3This was necessary because text in di erent languages was
URL encoded in the released dataset.
transformation, tokenization and removed accents to
generate a set of terms for every item. The multi-word tags were
further split into their individual components, which were
also included in the item's term set. Finally, symbols and
punctuations in the terms were removed, and terms
consisting of numerics or less than three characters were discarded.</p>
      <p>
        The core of our approach is a probabilistic Language Model
(LM) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] built from the terms of the training set items. The
earth surface was divided into (nearly) rectangular cells of
size 0:01 0:01 latitude/longitude, and the term-cell
probabilities were computed based on the user count of each term
in each cell. The most likely cell (mlc) of a query is derived
from the summation of the respective term-cell probabilities.
The estimated location of the query items with no textual
information is the centre of the cell with the most users.
      </p>
      <p>
        For feature selection, we used a re ned version of the
locality metric [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: in our last participation, we computed locality
based on the neighbor users that used the same term in the
same cell. To this end, we utilized a coarse grid (0:1 0:1 )
for the calculation, based on which the neighbor users were
assigned to a unique cell, as depicted in Figure 1(a). In
that setting, it was possible that a pair of users were not
assigned to the same cell even if the geodesic distance of their
items was small. To tackle this issue, we now used a grid
of 0:01 0:01 and modi ed the assignment of the users to
multiple cells: instead of assigning a user to a unique cell,
we assigned a user to an entire neighborhood, as illustrated
in Figure 1(b). The area highlighted in orange corresponds
to the cells where both users were assigned. The terms with
non-negative locality score form the selected term set T .
      </p>
      <p>
        The contribution of each term was then weighted based
on its locality and spatial entropy scores. Spatial entropy is
a Gaussian weight function based on the term-cell entropy
of the term [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The two measures are combined to generate
a weight value for every term in T .
(b)
      </p>
      <p>
        To ensure more robust performance in ne granularity, we
built an additional LM using a ner grid (0:001 0:001 ).
Having computed the mlc for both coarse and ne
granularities, we selected the most appropriate estimation: this is
the mlc of the ner grid if it falls within the borders of the
coarse grid, otherwise it is the mlc of the coarse one. Finally,
we employed similarity search as in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to derive the location
estimates from the the kt = 5 most textually similar images
inside the selected mlc, computing textual similarity using
the Jaccard similarity between the corresponding term sets.
Error case analysis of the text method is presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Visual-based location estimation</title>
      <p>
        The employed method is a re ned version of the one
employed in last year's participation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The main objectives
have been (1) to ensure that the visual features are generic
and transferable from a training set independent of YFCC
to the subset of the collection used for the task, and (2) to
provide a compact representation of the features in order to
scale up the visual search process. To meet the rst
objective, the VGG architecture [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] was ne-tuned with over 5000
diversi ed man-made and natural POIs, represented by over
7 million images. These were downloaded from Flickr using
queries with (1) the POI name and a radius of 5km around
its coordinates and (2) the POI name and the associated city
name. Following the conclusions of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] regarding the
uselessness of manual annotation for POI representation, there was
no manual validation of the training set. To meet the
second objective, we used the same procedure as last year and
compressed the initial features (VGG fc7, 4096 dimensions)
to 128 dimensions using PCA. The PCA matrix was learned
on a subset of 250,000 images of the training set.
      </p>
      <p>
        Having calculated these similarities, we retrieved the top
kv most visually similar images (in our runs we set kv = 20)
and applied a simple spatial clustering scheme based on their
geographical distance. We de ned a con dence metric for
our visual approach based on the size of the largest cluster:
confv(i) = max((n(i)
nt)=(kv
nt); 0)
(1)
where n(i) is the number of neighbors in the largest
cluster for query image i, nt is the con guration parameter that
determines the \strictness" of the con dence score. The
condence score gets values in the range [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ]. We empirically
set nt = 5. Our visual approach is not designed for video
analysis, thus all videos were placed in the centre of London,
which is the densest geotagged region in the world.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Hybrid location estimation</title>
      <p>The hybrid approach comprises a set of rules that
determine the source of estimation between the text and visual
approaches. First, for query images, for which no estimation
could be produced by the text-based approach, the location
was estimated based on the visual approach. Otherwise, in
case the visual estimation fell inside the borders of the mlc
calculated by the text-based approach, the visual
estimation was selected. If not, the estimation was determined by
comparing the con dence scores of the two approaches.</p>
      <p>Gh(i) =
(Gv(i) if conft(i)</p>
      <p>Gt(i) otherwise
confv(i)
where Gh, Gt and Gv are the estimated locations for query
item i of the hybrid, textual and visual approach,
respec(a) Images
RUN-4
0.69
7.89
25.53
43.89
51.2
68</p>
      <p>
        RUN-5
0.71
8.19
26.16
43.62
50.44
85
(b) Videos
tively, conft is the con dence score of the text-based
estimation and is de ned in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and confv is the con dence score
of the visual-based estimation (Equation 1).
3.
      </p>
    </sec>
    <sec id="sec-6">
      <title>RUNS AND RESULTS</title>
      <p>The submitted runs include one text-based (RUN-1), one
visual-based (RUN-2) and three hybrid runs (RUN-3, RUN-4,
RUN-5). For the rst three runs, the system was trained
on the set released by the organizers. In RUN-4 and
RUN5, the training set consisted of all YFCC items excluding
those contributed by users appearing in the test set. Also,
we report the results of an external run (RUN-E), based on
the visual approach but using the full geotagged subset of
YFCC. The results for RUN-E show that adding more training
data signi cantly improves visual geolocation, especially for
short ranges (10m and 100m), where this run outperforms
even the best hybrid run.</p>
      <p>To explore the impact of external data sources, in RUN-5,
we further leveraged structured data from Geonames and
OpenStreetMap. In particular, we used the geotagged
entries of the two sources as additional training items for
building the text-based LM: from Geonames we used a list of city
names along with their alternative names, while from
OpenStreetMap a list of nodes (points of interest) provided they
were associated with an address. Since training items need
to be associated with a contributor, we considered Geonames
and OpenStreetMap as the two contributing users.</p>
      <p>According to Table 1, the best performance at ne
granularities ( 1km) was attained by RUN-5 for both images and
videos. RUN-4 reported the best results in terms of median
distance error and precision at coarse granularities (&gt;1km).
Comparing the two runs, one may conclude that
leveraging structured geographic information improves geolocation
precision in short ranges (reaching 8.27% and 28.54% in
P @100m and P @1km respectively), with a minor increase
in median error. Moreover, the combination of visual and
textual features (RUN-3) improved the overall performance of
the system in case of images, but had no e ect on video
geotagging (since no visual information was used from videos).
(2)</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is supported by the REVEAL and USEMP
projects, partially funded by the European Commission
under contract numbers 610928 and 611596 respectively.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hau</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Van Laere</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          .
          <article-title>The placing task at mediaeval 2016</article-title>
          . In MediaEval 2016 Placing Task,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kordopatis-Zilos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <article-title>Geotagging social media content with a re ned language modelling approach</article-title>
          .
          <source>In PAISI 2015</source>
          , pages
          <fpage>21</fpage>
          {
          <fpage>40</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kordopatis-Zilos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <article-title>In-depth exploration of geotagging performance using sampling strategies on yfcc100m</article-title>
          .
          <source>In Proceedings of the MMCommons</source>
          <year>2016</year>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kordopatis-Zilos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          . CERTH/CEA LIST at MediaEval placing task
          <year>2015</year>
          . In MediaEval 2015 Placing Task,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          .
          <article-title>Cea list's participation at mediaeval 2013 placing task</article-title>
          .
          <source>In MediaEval 2013 Placing Task</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          , E. Gadeski, and
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Borgne</surname>
          </string-name>
          .
          <article-title>Scalable domain adaptation of convolutional neural networks</article-title>
          .
          <source>CoRR, abs/1512</source>
          .
          <year>02013</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>In International Conference on Learning Representations</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          et al.
          <article-title>The new data and new challenges in multimedia research</article-title>
          . CoRR, abs/1503.
          <year>01817</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>O.</given-names>
            <surname>Van Laere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhoedt</surname>
          </string-name>
          .
          <article-title>Finding locations of Flickr resources using language models and similarity search</article-title>
          .
          <source>ICMR '11</source>
          , pages
          <issue>48:1</issue>
          {
          <issue>48</issue>
          :
          <fpage>8</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>