<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extended Abstract on: Minimal Structure from Motion Representation for Image Geocoding</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright (c) by the paper's authors. Copying permitted for private and academic purposes. In: A. Comber, B. Bucher, S. Ivanovic (eds.): Proceedings of the 3rd AGILE Phd School Champs sur Marne, France</institution>
          ,
          <addr-line>15-17-September-2015, published at</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jorge Gustavo Rocha Algoritmi Research Centre, University of Minho</institution>
          ,
          <addr-line>4710-057 Braga</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Nuno Amorim Algoritmi Research Centre, University of Minho</institution>
          ,
          <addr-line>4710-057 Braga</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this extended abstract we present our early work on structure from motion data compression for image geocoding. We address the advantages of image geocoding over standard trillateration solutions and identify desired characteristics for an image geocoding system such as accuracy, speed and scalability. We hypothesize that scalability impacts both speed and accuracy and should be further researched. Hence, in this thesis we would like to know which is the minimal representation of structure from motion to efficiently compute the location and orientation of new photographs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The recent explosion of images on social media has lead Computer Vision
researchers to an increased interest on image processing algorithms. From those,
structure from motion (SFM) has gained an increased relevance due to its
applications. With a couple of photographs from the same scene, this algorithm is able to
retrieve the 3D structure (point clouds) by processing the motion from photograph
to photograph as stereo vision. Providing the correct focal lengths and distortion
parameters, this algorithm can achieve an impressing precision on building point
clouds. Moreover, new photographs can be added at any time, allowing the ability
to update models to this ever changing world.</p>
      <p>The application of SFM can vary from simple visualization purposes to more
complex tasks such 3D modeling and geographical location recognition. On the
later, photographs with unknown location are compared with a geocoded database
to compute their GPS coordinates.</p>
    </sec>
    <sec id="sec-2">
      <title>2 State of Art</title>
      <p>
        Early work on image geocoding started with methods based on a database of
geocoded photographs [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Image features are extracted from new photographs,
and feature matching is performed to retrieve similar photographs from the database.
Two-view geometry is then executed to compute the pose of the query photographs.
      </p>
      <p>
        Since two-view geometry is often computationally expensive as it usually relies
in RANSAC based methods to compute the relative pose of images, there was an
increased interest on using structure from motion to support the geocoding process.
Image features are extracted from new photographs, but are directly compared to 3D
point clouds. Rather than performing two-view geometry, it is computed a projection
matrix which validates 2D (query photograph) to 3D points (database point clouds).
Techniques such as nearest-neighbor feature matching [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and data structures such
as vocabulary trees [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ] are often used on state of art work to greatly reduce the
amount of matches needed to perform. Faster computational times can be achieved
by resorting to parallel processing on CPU and GPU units as shown in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which
achieved real time image geocoding with a GPU implementation of a vocabulary
tree, if new photographs successfully matched the first document retrieved from
their vocabulary tree queries.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 Motivation</title>
      <p>The advantages of image geocoding are clear when compared to other geocoding
systems. First of all image geocoding does not rely on a trilateration process, which
means it can compute coordinates on indoor environments as long as it has a WiFi
connection to issue the geocoding request. Also, image geocoding has access to
the heading of the queried photographs allowing the calculus of the direction in
which the photographs were taken, in a single query. This reinforce the utility of
image geocoding on supporting guiding systems. Lastly, the only (and ideal) device
required for image geocoding is a smartphone, since it contains both camera and
WiFi connection. As they are now omnipresent in our society, the cost for deploying
an image geocoding solution is greatly decreased.</p>
      <p>Extended Abstract on: Minimal SfM Representation for Image Geocoding</p>
    </sec>
    <sec id="sec-4">
      <title>4 Problem to be Solved</title>
      <p>In order to image geocoding replace standard trilateration solutions, three
characteristics are desired: accuracy, speed and scalability. Starting from accuracy, a good
pose estimation is attained when there is related data within the database to the
queried photograph. Feature matching is used to ascertain which data to use when
computing the pose. Assuming that we are facing the best case scenario and the
focal length and distortion parameters are known, then an impressive precision can be
achieved with a single photograph.</p>
      <p>Speed is defined by how fast can we find the correct database data to geocode the
queried photograph. Image processing requires expensive matrix operations as every
image pixel is relevant to compute visual features. Additionally, high resolution
images deliver better image features but also increase the computational time on
extracting and matching those. However, the constant evolution of hardware and the
parallelization of image processing algorithms is progressively breaking the barrier
of real time processing.</p>
      <p>Being scalable means that neither speed and accuracy are hindered with the
growth of the geocoded database, which is not quite the case. Assuming that each
SFM model contains millions of points, and each point is related to at least two
image descriptors and associated 2D data, a massive amount of visual data is
required to support an image geocoding system. Consequently, querying the geocoded
database gets slower. Besides, more information means having an higher amount of
similar features, which may confuse the image geocoding system into geocoding
photographs miles way from their correct location.</p>
    </sec>
    <sec id="sec-5">
      <title>5 Research Question and Future Work</title>
      <p>Facing the scalability problem described in the previous section, in this thesis we
question which is the minimal scene representation of structure from motion to
allow a good geocoding rate.</p>
      <p>Our main objective will be study and benchmark different state of art SFM based
geocoding systems, to enhance existing SFM compression strategies or to develop
alternative compression methods. We want compression rates able to maintain the
geocoding speed and rate, while allowing the scalability of the geocoding system to
wider areas. Also, rather than delivering a perfect 100% geocoding rate, we are only
interested in avoiding hindering this rate due to aggressive compression.</p>
      <p>
        We are aware that there is state of art research concerning SFM compression
[
        <xref ref-type="bibr" rid="ref4 ref6 ref8">6, 4, 8</xref>
        ], but rather than focusing the compression into a single geocoding engine, we
will generalize it to currently available engines. Since all image geocoding methods
work under the same assumptions (image features and 3D pose estimation), we
believe that the generalization is achievable.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Robertson</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cipolla</surname>
            <given-names>R</given-names>
          </string-name>
          (
          <year>2004</year>
          )
          <article-title>An Image-Based System for Urban Navigation</article-title>
          .
          <source>Review Literature And Arts Of The Americas, doi: 10</source>
          .5244/C.18.84
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Werner</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kessel</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marouane</surname>
            <given-names>C</given-names>
          </string-name>
          (
          <year>2011</year>
          )
          <article-title>Indoor positioning using smartphone camera</article-title>
          ,
          <source>International Conference on Indoor Positioning and Indoor</source>
          , doi: 10.1109/IPIN.
          <year>2011</year>
          .6071954
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            <given-names>W</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>iNavigation: an image based indoor navigation system</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          , doi: 10.1007/s11042-013-1656-9
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Li</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snavely</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huttenlocher</surname>
            <given-names>DP</given-names>
          </string-name>
          (
          <year>2010</year>
          )
          <article-title>Location Recognition using Prioritized Feature Matching</article-title>
          .
          <source>ECCV'10 Proceedings of the 11th European conference on Computer vision, doi: 10.1007/978-3-642-15552-9 57</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Schindler</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szeliski</surname>
            <given-names>R</given-names>
          </string-name>
          (
          <year>2007</year>
          )
          <article-title>City-Scale Location Recognition. Computer Vision and Pattern Recognition, CVPR 07</article-title>
          . IEEE Conference, doi: 10.1109/CVPR.
          <year>2007</year>
          . 383150
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Irschara</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zach</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frahm</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bischof</surname>
            <given-names>H</given-names>
          </string-name>
          (
          <year>2009</year>
          )
          <article-title>From structure-from-motion point clouds to fast location recognition</article-title>
          .
          <source>IEEE Conference on Computer Vision</source>
          and Pattern Recognition, doi: 10.1109/CVPR.
          <year>2009</year>
          .5206587
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Huitl</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schroth</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hilsenbeck</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schweiger</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steinbach E</surname>
          </string-name>
          (
          <year>2012</year>
          )
          <article-title>TUMindoor: An extensive image and point cloud dataset for visual indoor localization and mapping</article-title>
          .
          <source>19th IEEE International Conference on Image Processing, doi: 10</source>
          .1109/ICIP.
          <year>2012</year>
          .6467224
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Cao</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snavely</surname>
            <given-names>N</given-names>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>Minimal Scene Descriptions from Structure from Motion Models</article-title>
          .
          <source>IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>doi: 10</source>
          .1109/CVPR.
          <year>2014</year>
          .66
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>