<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgos Kordopatis-Zilos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Popescu</string-name>
          <email>adrian.popescu@cea.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Papadopoulos</string-name>
          <email>papadop@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yiannis Kompatsiaris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Technologies Institute</institution>
          ,
          <addr-line>CERTH</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We describe the participation of the CERTH/CEA LIST team in the Placing Task of MediaEval 2015. We submitted ve runs in total to the Locale-based placing sub-task, providing the estimated locations for the test set released by the organisers. Out of ve runs, two are based solely on textual information, using feature selection and weighting methods over an existing language model-based approach. One is based on visual content, using geo-spatial clustering over the most visually similar images, and two runs are based on hybrid approaches, using both visual and textual cues from the images. The best results (median error 22km, 27.5% at 1km) were obtained when both visual and textual features are combined, using external data for training.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The goal of the task is to produce location estimates for
a set of 931,573 photos and 18,316 videos using a set of
4.7M geotagged items and their metadata for training [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
For the tag-based runs, we built upon the scheme of our
2014 participation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and a number of recent extensions on
it [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], focusing on improved feature selection and feature
weighting. For the visual-based location estimation, we use
a geospatial clustering scheme of the most visually similar
images for every query image. A hybrid scheme is composed
by the combination of the textual and visual approaches. To
further improve the model, we constructed it using all
geotagged metadata from the YFCC dataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], after removing
all images from the users contained in the test set.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH DESCRIPTION</title>
    </sec>
    <sec id="sec-3">
      <title>Tag-based location estimation</title>
      <p>
        According to our last year's approach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (baseline), the
earth surface is divided in (nearly) rectangular cells of size
0.01 latitude/longitude (approximately 1km2 size near the
equator). We construct a Language Model (LM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], i.e. a
tag-cell probability map, by processing the tags and titles
of the training set images. The tag-cell probabilities are
computed based on the user count of each tag in each cell.
Then, the Most Likely Cell (MLC) of a query (test) image is
derived from the summation of the respective tag-cell
probabilities. The contribution of each tag is weighted based on
its spatial entropy through a Gaussian weight function [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
which is referred to as Spatial Entropy (SE) function.
      </p>
      <p>
        To ensure more reliable prediction in ner granularities,
we built an additional LM using a ner grid (cell side length
of 0.001 ). Having computed the MLCs for both the coarse
and ne granularity, we apply an Internal Grid technique [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
as a means to produce more accurate, yet equally reliable
location estimates. This is achieved by rst selecting the
most appropriate granularity (the ner grid cell if considered
reliable, otherwise the coarser grid cell), and then producing
the location estimate based on the center-of-gravity of the
k most textually similar images inside the selected MLC
(k = 5), by employing Similarity Search as in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The
textual similarity is computed using the Jaccard similarity
of the corresponding sets of tags.
2.1.1
      </p>
      <sec id="sec-3-1">
        <title>Feature Selection</title>
        <p>To increase the robustness of the model and reduce its size,
feature selection was performed based on two measures: the
accuracy and the locality of the tags.</p>
        <p>
          Accuracy is computed using the cross-validation scheme
proposed in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The training set is partitioned into p folds
(here, p = 10). Subsequently, one partition at a time is
withheld, and the rest p 1 partitions are used to build the
LM. Having built the LM, the location of every item of the
withheld partition is estimated. The accuracy of a tag is
computed based on Equation 1.
        </p>
        <p>tgeo(t) =</p>
        <p>Nr
Nt
;
(1)
where tgeo(t) is the accuracy score of each tag t, Nr is the
total number of correctly geotagged items tagged with t and
Nt is the total number of items tagged with t. The tags with
non-zero accuracy score form a tag set denoted as Ta.</p>
        <p>Locality captures the spatial awareness of tags. For every
individual tag, the locality score is calculated based on the
tag frequency and the neighbor users that have used it in the
various cells. Every time that a user uses a given tag, he/she
is assigned to the respective location cell. As a result, each
cell has a set of users that have been assigned to it. All users
assigned to the same cell are considered neighbors (for that
particular cell). Then, the locality score can be computed
by Equation 2.</p>
        <p>loc(t) = Nt</p>
        <p>P
c2C</p>
        <p>Pu2Ut;c jfu0ju0 2 Ut;c; u0 6= ugj</p>
        <p>Nt2
; (2)
where loc(t) is the locality score of tag t, Nt is the total
occurrences of t, C denotes all cells and Ut;c denotes the set
of users that used tag t inside cell c. Since all users in Ut;c
are neighbors, Equation 2 can be simpli ed to:
loc(t) =</p>
        <p>Since the locality metric is sensitive to tag frequency, we
consider it as an inappropriate for directly weighting tags.
Alternatively, having computed the locality scores for every
tag in T , we sort them based on their scores and calculate
their weights using their position in the distribution.
wl = jT j
(j
jT j
1)
where, wl is the weight value of the tag t on the j-th
position in the distribution and jT j is the total number of tags
contained in T . This weighting approach returns values in
the range (0; 1]. To combine the two weighting functions,
we normalize the values of the Spatial Entropy weighting
function, denoted with wse, and use Equation 4 to compute
the nal weights.</p>
        <p>w = !
wse + (1
!) wl
The value of ! was set to 0:2 through empirical assessment
on a sample of 10K images.
2.1.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Confidence</title>
        <p>To evaluate the con dence of the estimation of each query
image, we use the con dence measure of Equation 5.
conf(i) =</p>
        <p>Pc2C fp(cji)jdist(c; mlc) &lt; lg</p>
        <p>Pc2C p(cji)
;
where conf(i) is the con dence for query image i, p(cji) is the
cell probability of cell c for image i, dist(c1; c2) is the distance
between the centers of cells c1 and c2 and mlc stands for the
Most Likely Cell.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Visual-based location estimation</title>
      <p>
        We compute visual-based location estimations with CNN
features adapted for the tourist domain using approximately
1000 Points Of Interest (POIs) for training, with
approximately 1200 images per POI, that were fed directly to Ca e
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These features were computed by ne-tuning the VGG
model proposed at ILSVRC 2014 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The outputs of the f c7
layer (4096 dimensions) were compressed to 128 using a PCA
matrix learned from a subset of 250; 000 images of the CNN
training set and used to compute image similarities. CNN
features were selected after a favorable comparison against
compact VLAD features of similar size [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and with SURF
features of signi cantly larger size [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Having calculated
these similarities, we retrieve the top k most visually similar
images and use their location to perform the estimate. In
the visual only run (RUN-2), k = 20 and we apply a simple
incremental spatial clustering scheme, in which if the j-th
image (out of the k most similar) is within 1km from the
closest one of the previous j 1 images, it is assigned to its
cluster, otherwise it forms its own cluster. In the end, the
largest cluster (or the rst in case of equal size) is selected
and its centroid is used as the location estimate.
(4)
(5)
(3)
2.3
      </p>
      <p>Hybrid location estimation
measure
acc(1m)
acc(10m)
acc(100m)
acc(1km)
acc(10km)
acc(100km)
acc(1000km)
m. error(km)</p>
      <p>
        For the hybrid approach, we build an LM using the scheme
described in Section 2.1. To achieve further improvement in
ner granularities wuith the use of the Similarity Search
approach, the similarity between two images derives from the
combination of the visual and textual similarities. To this
end, we normalize the visual similarities to the range [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ].
The nal similarity for a pair of images is computed as the
arithmetic mean of the two similarities. We then retrieve
the top k = 5 most similar images, within the borders
speci ed by the Internal Grid technique [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and we use their
center-of-gravity as the nal location estimate.
      </p>
      <p>For those test images, where no estimate can be produced
based on the LM or con dence is lower than 0.02 (which
together amount to approximately 10% of the test set), we
use the visual approach to produce the estimate.
3.</p>
    </sec>
    <sec id="sec-5">
      <title>RUNS AND RESULTS</title>
      <p>We prepared two tag-based (RUN-1, RUN-4), one visual
(RUN-2) and two hybrid runs (RUN-3, RUN-5). Runs 1-3 used
the training set released by the organisers; in Runs 4-5, the
entire YFCC dataset was used, excluding all images from
users that appeared in the test set. All runs contained
estimates for the full test set (949,889 items).</p>
      <p>According to Table 1, the best performance in terms of
both median error and accuracy in all ranges was attained
by RUN-5. Comparing the corresponding runs with di erent
training sets, one may conclude that the use of an extended
training set (that does not contain user-speci c information)
had considerable impact on the accuracy results across all
ranges. Furthermore, the combination of features (visual
and textual) in RUN-5 further improved the overall
performance (reaching a 7.83% accuracy for the &lt;100m range) and
minimizing median error (22km). The visual-only run
(RUN2) obtained remarkable results (reaching a 5.19% accuracy
for the &lt;1km range).</p>
      <p>In the future, we plan to look deeper into di erent
weighting schemes trying to achieve further improvements.
Moreover, we plan to develop more sophisticated clustering
models for the visual-only runs.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work is supported by the REVEAL and USEMP
projects, partially funded by the European Commission
under contract numbers 610928 and 611596 respectively.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hau</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Van Laere</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          .
          <article-title>The placing task at mediaeval 2015</article-title>
          . In MediaEval 2014 Placing Task,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          and
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>The 2014 ICSI/TU delft location estimation system</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2014 Workshop</source>
          , Barcelona, Catalunya, Spain,
          <source>October 16-17</source>
          ,
          <year>2014</year>
          .,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shelhamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guadarrama</surname>
          </string-name>
          , and T. Darrell. Ca e:
          <article-title>Convolutional architecture for fast feature embedding</article-title>
          .
          <source>arXiv preprint arXiv:1408.5093</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kordopatis-Zilos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Orfanidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <article-title>Socialsensor at mediaeval placing task 2014</article-title>
          . In MediaEval 2014 Placing Task,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kordopatis-Zilos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <article-title>Geotagging social media content with a re ned language modelling approach</article-title>
          .
          <source>In Intelligence and Security Informatics</source>
          , pages
          <volume>21</volume>
          {
          <fpage>40</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          .
          <article-title>Cea list's participation at mediaeval 2013 placing task</article-title>
          .
          <source>In MediaEval 2013 Placing Task</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>In International Conference on Learning Representations</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Spyromitros-Xiou s</surname>
          </string-name>
          , S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas,
          <string-name>
            <surname>and I. Vlahavas.</surname>
          </string-name>
          <article-title>A comprehensive study over vlad and product quantization in large-scale image retrieval</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Friedland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Elizalde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Poland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Borth</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>The new data and new challenges in multimedia research</article-title>
          . CoRR, abs/1503.
          <year>01817</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O.</given-names>
            <surname>Van Laere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhoedt</surname>
          </string-name>
          .
          <article-title>Finding locations of Flickr resources using language models and similarity search</article-title>
          .
          <source>ICMR '11</source>
          , pages
          <issue>48:1</issue>
          {
          <issue>48</issue>
          :
          <fpage>8</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>