<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Imcube @ MediaEval 2015 Placing Task: A Hierarchical Approach for Geo-referencing Large-Scale Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pascal Kelm</string-name>
          <email>kelm@imcube.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Schmiedeke</string-name>
          <email>schmiedeke@imcube.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lutz Goldmann</string-name>
          <email>goldmann@imcube.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Imcube Labs GmbH Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes Imcube's geo-referencing approach, experiments, and results at the MediaEval 2015 Placing Task benchmark. This task requires to develop techniques to automatically annotate Flickr photos and videos with their geolocation (latitude and longitude) in two individual subtasks. A hierarchical approach combining textual, visual and optional routing information is employed. The results show that for 24% of the images (local-based task) and for 96% of the images (mobile-based task) the error of the estimated location is below city level (10 km).</p>
      </abstract>
      <kwd-group>
        <kwd>Mobile-based Sub-task</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The MediaEval Placing Task 2015 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] requires that participants
use systems to automatically estimate the location of Flickr
photos and videos using any or all of metadata, visual/ audio content,
and/or user information.
      </p>
      <p>This year the task introduces two new sub-tasks: The
localebased sub-task addresses the prediction of missing locations for
individual images in an entity-centred way by choosing a location
from a given ground truth hierarchy. The mobile-based sub-task
addresses predicting missing locations within a sequence of photos
shot by a travelling photographer.</p>
      <p>
        Similar to the Placing Task 2014, the training set (4,672,382
photos &amp; 22,767 videos) and test sets (931,573 photos &amp; 18,316 videos)
were sampled from the YFCC100M [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] data set. One important
difference to the past editions is that this year the distances between
the predicted and the ground truth geographic coordinates are
evaluated using Karney’s formula [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which is based on the assumption
that the shape of the Earth is an oblate spheroid.
      </p>
      <p>In this paper, we present an approach that combines different
textual and visual descriptors by applying a hierarchical scheme to
merge information obtained from several ranked lists.</p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
      <p>This section describes the different methods created to solve the
challenges of the locale-based and mobile-based sub-tasks.</p>
      <p>
        The proposed approach is composed of different steps: (i)
hierarchical clustering of the provided training set by latitude and
longitude, (ii) visual and textual feature extraction, (iii) generation
of ranked lists, (iv) re-ranking and (v) estimation of the location for
each test item.
The hierarchy provided contains 221,458 leaf nodes (locations)
that are spread across 253 countries. Below the second level
(Country State) we segment the states into 360 180 regions
according to the meridians and parallels. We also apply a smaller grid
of segments with half the spatial dimensions to increase the
accuracy and to minimize the computational cost. Each geo-referenced
training image is assigned to its corresponding grid cell at the
lowest level [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For each layer of the hierarchy a ranking model is used
to iteratively assign a test image to the most likely spatial segment.
      </p>
      <p>
        Due to the large size of the dataset and the limited processing
time, we did not apply a hierarchical language model approach with
multiple modalities [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] but adopted a textual re-ranking model. The
vocabulary of the spatial locations includes stemmed1 words from
the tags, titles and descriptions. The text similarity function used
is BM25 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as implemented by Lucene2. The best results for
textual similarity computations were achieved with a training set
composed of both image and video meta data, regardless of the kind of
test query.
      </p>
      <p>The visual similarity relies on a wide spectrum of visual
features to describe the color and texture characteristics of the video
key frames and photos. These image descriptions are pooled for
each leaf node in the different hierarchy level using the mean and
median value of each descriptor. A kd-tree that contains all
appropriate segments is built for each descriptor in each leaf node. This
procedure speeds up the following search because only a portion of
data is needed to be computed for nearest neighbour search.</p>
      <p>Starting at the top of the hierarchy the nodes of the current level
are ranked according to their distance to the test image. The overall
distance is obtained by fusing the textual and the visual distances
using weighted summation. These weights differ in both fusion
experiments as described in results section. Then the node with the
lowest distance becomes the most likely location at the given level
of granularity. By iteratively traversing the hierarchy the method
determines the leaf node that has the highest similarity to the test
image and returns the corresponding geolocation.
2.2</p>
      <p>For this task, we pursue a similar approach as described in
section 2.1 but without the hierarchical layer model and with
additional routing information.</p>
      <p>We use OpenStreetMap3 to find the shortest route between two
photos that have associated geographic coordinates. Tracks with
a distance smaller than 2 km are routed by pedestrian navigation,
larger tracks are routed by car navigation, respectively. The results
1http://tartarus.org/martin/PorterStemmer/
index.html
2http://lucene.apache.org/core/
3http://www.openstreetmap.org/
#Items
2
81
1593
6501
9171
9659
9674
9675</p>
      <p>Routing</p>
      <p>Percentage
0.02 %
0.84 %
16.47 %
67.19 %
94.79 %
99.83 %
99.99 %
100.00 %
of the routing run are predicted linearly in travel time to be a
location on these tracks. For test images, which do not have both
chronological neighbours, the neighbouring route segment is
extrapolating while considering their distance in time. The other runs
use additionally textual and visual features to determine the most
similar image along the track.</p>
      <p>
        The visual similarity is determined as described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Densely
sampled local features (pairwise averaged DCT coefficients) are
represented as a histogram quantised by vector quantisation (a
clusterless bag-of-visual-words approach) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As similarity metric
between training images and the image to be geo-tagged, histogram
intersection of their BoW representation is applied. The two visual
runs differ in the assignment of coordinates: the visual run assigns
the coordinates of the visually most similar image from the
training data, the weighted visual run calculates the coordinates as the
centroid of all training images weighted by their visual similarity.
      </p>
      <p>The textual run uses the same textual similarity as the location
task, but the training images are restricted to be located within a
corridor of 0.001 degree along the estimated routes.</p>
    </sec>
    <sec id="sec-3">
      <title>RESULTS Local-based Sub-task</title>
      <p>Table 1 shows the accuracies of selected error margins for the
different textual and visual runs. Based on the experience from the
previous years we expect the textual run to perform better than the
visual run due to the visual ambiguity at coarser levels. The
results clearly show that the visual only approach has low accuracy
in all error margins when compared to the textual only approach.
For combining the textual and visual information we have tested
two different fusion models. We design a set of two fusion
experiments to combine textual and visual features. Our first fusion
model (Fusion1) combines the estimations of textual and visual
models equally on each hierarchy level. The second fusion model
(Fusion2) only combines these estimations on the finest three
hierarchy levels. On the coarsest hierarchy levels, only the
estimation of the textual model is used. This combination results in more
accurate results, since visual feature are not able to solve
ambiguities in large scale (i.e. , most cityscapes look similar). The results
show that fusing visual and textual information on finer levels
(Fusion1) improves the performance for error margins between range
between 10 m and 100 km.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>Mobile-based Sub-task</title>
      <p>The results of the local-based sub-task show that the best
performance can be achieved with a multimodal fusion approach that uses
textual information on coarser levels and the combination of visual
and textual information in finer ones. The results of the
mobilebased sub-task show that the use of visual and textual information
beside routing information improves the location estimation. The
low correlation of the localization errors of the different approaches
suggests that more advanced fusion approaches will lead to better
results. Another interesting direction to improve the accuracy of
the visual approach for both sub-tasks is by using local features to
distinct landmarks and points of interest.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. V.</given-names>
            <surname>Laere</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          .
          <article-title>The placing task at mediaeval 2015</article-title>
          .
          <source>MediaEval 2015 Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Karney</surname>
          </string-name>
          .
          <article-title>Algorithms for geodesics</article-title>
          .
          <source>Journal of Geodesy</source>
          ,
          <volume>87</volume>
          (
          <issue>1</issue>
          ):
          <fpage>43</fpage>
          -
          <lpage>55</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kelm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmiedeke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Friedland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Ekambaram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramchandran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sikora</surname>
          </string-name>
          .
          <article-title>A novel fusion method for integrating multiple modalities and knowledge for multimodal location estimation</article-title>
          .
          <source>In Proceedings of the 2Nd ACM International Workshop on Geotagging and Its</source>
          Applications in Multimedia,
          <source>GeoMM '13</source>
          , pages
          <fpage>7</fpage>
          -
          <lpage>12</lpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          .
          <article-title>Okapi at TREC-3</article-title>
          .
          <source>In Proceedings of TREC-3 '95</source>
          , pages
          <fpage>109</fpage>
          -
          <lpage>126</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmiedeke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kelm</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sikora</surname>
          </string-name>
          .
          <article-title>Dct-based features for categorisation of social media in compressed domain</article-title>
          .
          <source>In Multimedia Signal Processing (MMSP)</source>
          ,
          <source>2013 IEEE 15th International Workshop on</source>
          , pages
          <fpage>295</fpage>
          -
          <lpage>300</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Friedland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Elizalde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Poland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Borth</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>The new data and new challenges in multimedia research</article-title>
          . CoRR, abs/1503.
          <year>01817</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>