<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgos Kordopatis-Zilos</string-name>
          <email>gkordopa@auth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Papadopoulos,</string-name>
          <email>[papadop,espyromi]@iti.gr</email>
          <email>asymeon@iti.gr</email>
          <email>papadop@iti.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas L. Symeonidis,</string-name>
          <email>[asymeon,ikom]@iti.gr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Electrical Engineering Dept., Aristotle University of</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Eleftherios, Spyromitros-Xioufis, Information Technologies, Institute, CERTH</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yiannis Kompatsiaris, Information Technologies, Institute, CERTH</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>We describe the participation of the CERTH team in the Placing task of MediaEval 2013. We submitted 5 runs on the full test set, two of which are based on tag information, two on visual content, and one uses both tag and visual information. Our best performance (median error 650km) was achieved with the use of tag features.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The goal of the task is to produce location estimates for
a set of 262,000 images using a set of over 8.5 million
geotagged images and their metadata for training. One may
find more details regarding the challenge and the dataset in
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For the tag-based runs, we built upon the scheme of
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], making use of a two-level LDA scheme [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]1 to filter out
non-geographic terms. For the visual-based runs, we relied
on a simple Nearest Neighbour scheme using SURF+VLAD
features [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and an efficient indexing scheme for very fast
retrieval [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Our hybrid run combined a tag and a visual
run using a simple fall-back scheme. All models were built
solely on the training data provided by the organizers (i.e.
no external gazetteers or Internet data were used).
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACHES</title>
    </sec>
    <sec id="sec-3">
      <title>Placing images using tags</title>
      <p>The tag-based method relies on an offline analysis, in
which a complex geographical-tag model is built from the
tags and locations of the approximately 8.5 million images
of the training set. The implemented approach comprises
three steps.</p>
      <p>A. Filtering: In this step, we aim at removing noisy and
irrelevant tags from the training data. We, therefore,
remove machine-tags from all images and then remove from
the training set those images with no tags left. We end up
with 7,266,903 images.</p>
      <p>B. Spatial clustering and local LDA: Here, we first
cluster the training set images based on their location, using
1We used the JGibbLDA implementation, available on:
http://jgibblda.sourceforge.net/.
k-means on their latitude-longitude values. We opted for
K = 5000 clusters-areas so that on average each area would
contain approximately 1450 images. For each such area, we
then apply LDA to derive a local topic distribution, using
100 topics and 20 terms per topic. We denote an area j as
Aj = fuj ; fwj g; f jkgg, where uj is the area id, wj is the
set of images belonging to the area, and jk is the k-th topic
(set of terms) of the local distribution.</p>
      <p>C. Creating bag-of-excluded-words (BoEW): In this
step, we attempt to create a set of non-geographic tags, i.e.
tags that should not be taken into account for geotagging.
To this end, we first apply LDA on the whole dataset (global
LDA) using 500 topics and 50 terms per topic. For each
of the resulting topics, we compute its frequency per area2,
thus ending up with a topic-area distribution (histogram).
This essentially corresponds to the spatial distribution of the
topic. Based on this, we compute its entropy and flag the
topics that exceed a threshold of 180 (empirically selected).
The terms of these topics form the bag-of-excluded-words.
Some example excluded words include the terms landscape,
35mm, kodak, boats, christmas, sunset, and tree, while some
terms that were mistakenly considered as non-geographic
include europe, usa, atlanticocean and newmexico, most
probably due to their large geographic span.</p>
      <p>Having created the geographical-tag model, we then
proceed with the online location estimation step for each test
image Ti. We first filter the tags of the image that are either
machine-tags or belong to the BoEW, ending up with the set
of ftig clean tags. We then compute the Jaccard similarity
between this set of tags and the set of tags for each topic of
each local LDA:
sijk = jti \ jkj
jti [ jkj
(1)</p>
      <p>In a first variant of the approach (run 1), the image is
assigned to the area with the highest Jaccard similarity with
any local topic, areai = arg maxj;k sijk. In the second
variant of the approach (run 3), we first compute the mean
Jaccard similarity for each area sij over all topics with at least
one common tag with the test image, and then select the
area with the highest mean similarity, areai = arg maxj sij.</p>
      <p>
        Having assigned the test image to an area, we then adopt
the location estimation technique of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: we first determine
the k most similar training images (using Jaccard similarity
on the corresponding sets of tags) and use their
center-of2This is computed by counting the images contained in the area
that activate the topic. These are determined by applying a
threshold (= 0:9 1500 ) on the image-topic distribution.
gravity (weighted by the similarity values) as the location
estimate for the test image. For test images with no clean
tags, we set their location equal to the centroid of the largest
area (a kind of maximum likelihood estimation).
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Placing images using visual features</title>
      <p>
        In the offline analysis step, we extract optimized SURF+
VLAD features from each image in the training set (over
8.5 million images) and index the features using the
IVFADC scheme proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The SURF+VLAD vectors
are based on multiple vocabulary aggregation (four visual
vocabularies with k = 128 centroids each) and joint
dimensionality reduction (to only 48 dimensions for efficiency)
with PCA and whitening [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The vectors were then
quantized using a coarse quantizer with 1024 centroids and
applied Product Quantization on the residual vectors using an
8 10 scheme [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which led to a signature of length 112bits
for each image. We made the implementation of the above
feature extraction and indexing methods publicly available3.
      </p>
      <p>For the online location estimation step, we retrieve the top
k visually most similar images and use those for the estimate.
In the first variant (run 2), k = 1 and we simply consider the
location of the most similar image as the location of the test
image. In the second variant (run 4), k = 20 and we apply a
simple incremental spatial clustering scheme, in which if the
j-th image (out of the k most similar) is within 1km from
the closest one of the previous j 1 images, it is assigned to
its cluster, otherwise it forms its own cluster. In the end, the
largest cluster (or the first in case of equal size) is selected
and its centroid is used as a location estimate.</p>
    </sec>
    <sec id="sec-5">
      <title>RUNS AND RESULTS</title>
      <p>As described above, we prepare two tag-based runs, which
we will refer to as tmax (run 1) and tmean (run 3), and two
visual runs which we will refer to as vnn (run 2) and vclust
(run 4). A fifth run, referred to as hyb, was prepared using
a hybrid strategy: if the test image had at least one clean
tag associated with it, the tmax approach was selected,
otherwise the vnn was used. All runs were prepared for the full
test set of 262,000 images. The tag-based runs took
approximately 23 hours to complete (316msec per image), while the
visual ones took only 13 hours (179msec per image). These
times were recorded on a commodity Quad core@2.40GHz
with 8GB RAM with a 1TB 7200rpm hard drive.</p>
      <p>Table 1 summarises the obtained results for the full test
set. The best performance in terms of median error was
attained by tmax, closely followed by hyb. In terms of
accuracy, the hyb run performs marginally better than tmax in
the low ranges (&lt;1km up to &lt;1000km), which is expected
since it has a fall-back scheme for the images with no tags.
A noteworthy result is the very low performance of visual
runs. Potential reasons for this poor performance include
the very “cheap” feature extraction and indexing settings
(for efficiency reasons) and the nature of the training and
test set, i.e. the training set did not contain images that
were sufficiently visually similar to those of the test set.</p>
      <p>We also computed the selected performance measures for
the test subsets proposed by the organizers. Surprisingly, we
could not identify a trend with respect to the test set size.
For instance, on the small test set (5300), the best median
error (by hyb) was 1423km, while for the test set of 53000
3https://github.com/socialsensor/multimedia-indexing
measure
acc(1km)
acc(10km)
acc(100km)
acc(500km)
acc(1000km)
median error
images, the best median error (by tmax ) was just 521km.
The accuracy measure was more stable, e.g. in the case of
tmax ranging from 9.62% to 10.95% (for &lt;1km) and from
30.34% to 38.01% (for &lt;10km). Similar fluctuations were
noted for all other runs, which indicates that each test set
has an inherent degree of placeability.
4.</p>
    </sec>
    <sec id="sec-6">
      <title>FUTURE WORK</title>
      <p>In the future, we plan to conduct a more thorough
analysis on the different sources of error for the proposed scheme,
and extend it to also include additional metadata of the
input images, as well as external resources. Regarding the
error analysis, we will look into the impact of (a) the
number of topics and terms per topic both for the local and the
global LDAs, (b) the selection of the entropy threshold, (c)
the number K of geographical areas, (d) the type and
quality of visual features, (e) the use of better visual matching
methods (e.g. geometric verification applied on the list of
top k most similar images). Regarding the use of additional
metadata, we plan to incorporate the author of an image as
an indicator of the image location. Finally, we will consider
incorporating resources such as gazetteers, as well as
additional geotagged image data collected from the Web with
the goal of increasing the visual coverage of the training set.
5.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work is supported by the SocialSensor FP7 project,
partially funded by the EC under contract number 287975.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>the Journal of machine Learning research</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Trevisiol</surname>
          </string-name>
          .
          <source>Working Notes for the Placing Task at MediaEval</source>
          <year>2013</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>H. J</surname>
          </string-name>
          ´egou and
          <string-name>
            <given-names>O.</given-names>
            <surname>Chum</surname>
          </string-name>
          .
          <article-title>Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening</article-title>
          .
          <source>In ECCV</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>H. J</surname>
          </string-name>
          ´egou, M. Douze, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <article-title>Product quantization for nearest neighbor search</article-title>
          .
          <source>IEEE Trans. PAMI</source>
          ,
          <volume>33</volume>
          (
          <issue>1</issue>
          ):
          <fpage>117</fpage>
          -
          <lpage>128</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Spyromitros-Xioufis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , I. Kompatsiaris, G. Tsoumakas,
          <string-name>
            <surname>and I. Vlahavas.</surname>
          </string-name>
          <article-title>An empirical study on the combination of SURF features with VLAD vectors for image search</article-title>
          .
          <source>In WIAMIS</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Van Laere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhoedt</surname>
          </string-name>
          .
          <article-title>Finding locations of Flickr resources using language models and similarity search</article-title>
          .
          <source>ICMR '11</source>
          , pages
          <fpage>48</fpage>
          :
          <fpage>1</fpage>
          -
          <lpage>48</lpage>
          :
          <fpage>8</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>