<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Pattern Recognition to Place Identi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sven Eberhardt</string-name>
          <email>sven2@uni-bremen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tobias Kluth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Zetzsche</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kerstin Schill</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cognitive Neuroinformatics, University of Bremen</institution>
          ,
          <addr-line>Enrique-Schmidt-Stra e 5, 28359 Bremen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>39</fpage>
      <lpage>44</lpage>
      <abstract>
        <p>What are the ingredients required for vision-based place recognition? Pattern recognition models for localization must ful ll invariance requirements di erent from those of object recognition. We propose a method to evaluate the suitability of existing image processing techniques by testing their outputs against these invariances. The method is applied to several holistic and one local model. We generalize our ndings and identify model properties of locality, spatial con guration and generalization as key factors for applicability to localization tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>visual</kwd>
        <kwd>model</kwd>
        <kwd>pattern recognition</kwd>
        <kwd>localization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Although the concept of place is essential to the way humans represent and
interact with spatial environments, many of its determinants are not yet completely
understood. One important question is what kind of information and what
computations can be used to determine a speci c place. Among the di erent types of
input suitable for this purpose pictorial information has a particularly high
potential. In biological terms, the investigation of place cells, for example, indicates
the importance of visual cues for the robust localization of rodents.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
      </p>
      <p>
        However, the exact processing mechanisms that can enable a successful
visionbased localization are still unclear. In particular, it has to be understood how the
classical determinants of pattern recognition systems, invariance and
generalization properties, relate to the problem of localization. Invariance properties seem
to play a crucial role, since for example the activation of a place cell is primarily
determined by the animal's location, whereas it is independent of the orientation
and other conditions like illumination. These are typical invariance properties. It
may thus be assumed that the classic invariance principles attributed to human
vision, and the corresponding computer vision approaches, can also be applied
to the problem of localization (or place recognition). In this paper, we will argue
that this is not necessarily the case, and that successful localization requires
speci c properties that can be in direct in opposition to those underlying other
basic visual capabilities, like for example object recognition. For this, we will rst
introduce a basic framework that enables the description and di erentiation of
image processing techniques with respect to their applicability for localization
as compared to, e.g., object recognition. We will then discuss how some
established image processing techniques can be described in terms of the suggested
framework. This will then motivate an investigation of the suitability of some
of these techniques for the speci c problem of localization, or place recognition.
In particular, we will investigate whether one of the most successful models of
visual object recognition, the HMAX model[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], can also be used for the task of
vision based localization.
1.1
      </p>
      <p>Invariance in Place Recognition
One of the di culties in place recognition from visual input is that even minor
changes in observer's orientation or location, as well as unrelated changes such
as variations in illumination, can cause vast changes of retinal input. Successful
models for place identi cation should provide output that is invariant to such
small changes in the observer's view. Although this is a requirement which is
shared with object recognition models, there are some fundamental di erences
in which kind of invariance is desired.</p>
      <p>While changes in scale, position and occlusion of elements in a scene are often
irrelevant in the context of object recognition, they correspond to movement of
the observer and should elicit changes in the output for place recognition models.
On the other hand, spatial shifting of a scene as a whole corresponds to rotation
of the observer. Similarly, rotations within the viewing plane correspond to tilting
of the viewer's head. A place detector that mimics the behavior of place cells
should be invariant to such rotations.
a
distribution
scene
class
instance
landmark
template
b
distribution
4
class
3</p>
      <p>2
instance
1
template
pixels
textons
1 waSv1elet
2 C1
3 C2
4 C3</p>
      <p>
        Given these fundamental di erences, can models for object recognition be
used for place identi cation at all? The large amount of existing pattern
recognition algorithms makes testing this hypothesis a tedious task. We therefore
suggest to categorize algorithms into a conceptual space with three dimensions
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and seek to nd a systemic correspondence between the placement of models
within these dimensions and their applicability to place identi cation.
      </p>
      <p>
        The rst dimension is locality. A local approach processes image data from
selected image regions, whereas a global approach always takes the whole image
into account. Naturally, local approaches need a detection mechanism to
determine regions of interests (ROI). Such mechanisms may rely on low-level image
data such as curvature[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], local brightness extrema[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or generalized features[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Ideally, the detection mechanism picks out informative image regions containing
objects or landmarks useful to solve the given task.
      </p>
      <p>
        The second dimension measures the invariance to changes in spatial con
guration. Algorithms that are sensitive to spatial layout match templates of stored
objects against the input image, but fail to generalize if object components are
rearranged or scrambled. On the other hand, the largest invariance to spatial
layout is provided by models relying on image statistics [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or bags-of-features
like [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The class of HMAX models by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] follow an intermediate approach where
invariance to feature locations is increased step-by-step in a multi-layer
hierarchy.
      </p>
      <p>
        The third dimension describes how well a model generalizes among several
instances of a class. Most local descriptor-based algorithms such as [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] only store
patterns speci c to the particular instance and view of an object, so multiple
patterns are required to describe a class. Usually, category-level generalization
can be achieved by clustering speci c descriptors into broader categories [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        These dimensions describe key attributes required for a model to be suitable
for place identi cation. The rst dimension, locality, is certainly useful to
determine place. If each detected feature is attributed to a position, the relation of
these positions provides valuable information in determining the position of an
observer[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. For the second dimension, spatial con guration, the requirements
are not so clear. On the one hand, changes in spatial con guration result from
changes in position of an observer and invariance to such changes is not
desired. On the other hand, invariance to small changes in con guration increase
robustness of the detection of features, and could improve detection when scenes
are presented under slightly di erent conditions. The third dimension,
generalization properties, are probably required to some extent to generalize di erent
views from the same place onto the same class. Too much generalization is not
desirable, because it might project locations that look similar onto the same
place.
      </p>
      <p>
        In the following study, we investigated the invariance properties of models
that vary within the second and third dimension. In particular, we varied the
two parameters of location and orientation. We judged algorithms based on how
well they stayed invariant to changes in orientation compared to their variation
induced by changes in location. We tested two holistic models, wavelet-like
histograms[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and texture descriptors called `textons'[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In comparison, we chose
the HMAX model as a hierarchical model of which we analyzed each model
step separately. Finally, performance on raw pixel values has been checked as a
baseline.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>We developed a test setup to evaluate the applicability of pattern recognition
methods for place identi cation. We recorded input images xL at nL = 10 di
erent locations L, and n = 181 di erent observer rotation angles spanning 180
degrees of rotation. If a model S is applied to two input images, the dissimilarity
of output vectors can be written as their euclidean distance dS .
dS (xL11 ; xL22 ) :=</p>
      <p>S xL11</p>
      <p>S xL22
2
We measure the invariance to rotation I~rSot( ) of a model by averaging the
dissimilarity to a midpoint rotation over all locations.</p>
      <p>I~rSot( ) :=
1</p>
      <p>X dS (xL; x0L)
nL L
(1)
(2)
(3)
(4)</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>We nd that, in accordance with our predictions, pattern recognition models
display vastly di erent performances when investigated for their applicability in</p>
      <p>A low value of I~rSot( ) means the output of the model is highly invariant to
the given rotation . In order to measure usability for place identi cation, we
need to put this value in relation to variations of model outputs achieved by
changing the place. We de ne a relative orientation invariance measure IrSot( )
as:</p>
      <p>IrSot( ) :=
1 X</p>
      <p>dS (xL; x0L)
nL L n1 P0 Lm06=inLdS (xL0 ; xL00 )
Values larger than 1 for I~rSot for a given angle mean that the model produces
more dissimilar outputs under rotation by that angle than it would by switching
the place. We therefore de ne the maximum angle of invariance IS as the largest
value under which this condition is met:</p>
      <p>IS := max j j IrSot( ) &lt; 1
Large values of IS stand for good invariance to rotation compared to changes
in place, which attributes the model as suitable for place recognition.</p>
      <p>We applied this method to the raw input pixels, as well as outputs from the
texton algorithm, wavelet descriptors and the HMAX model at various stages.
For the HMAX model, we were particularly interested in how the rotational
invariance properties vary with increasing layers. We extracted values at the
gabor lter layer (S1), as well as the st and second local invariance layers (C1,
C2) and the nal, global invariance layer (C3) At each layer, a maximum of
500 features was extracted. For non-global layers, a random sub sampling over
features and locations was done. The same features at the same locations were
subtracted for all images.
place identi cation. Relative orientation invariance measures IrSot( ) for holistic
models (textons and wavelets) as well as raw pixels are shown in g. 2a. The
maximum angle of invariance for texton outputs ( ITexton = 7 ) is actually lower
than for raw pixels ( IPix = 44 ), which shows that these models are more
invariant to changes in location than to changes in rotation compared to raw
pixels. Invariance to wavelet transformation is only slightly lower than pixels
( IWavelets = 38 ).</p>
      <p>For HMAX, performance for each layer is shown in gure g. 2b. Again, I
sinks below performance on raw pixel down to ( IWavelets = 24 ) for the successive
layers C1 and C2 and further down to ( IWavelets = 14 ) for the nal layer C3.
This decay of performance in higher stages of the model show that invariance to
place increases faster than invariance to orientation.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>We have investigated the question of how a place can be characterized in terms
of visual properties. In particular, we have investigated which invariances are
required to uniquely determine a place and how these are related to the
invariance properties commonly attributed to visual processing. We have evaluated
di erent models asking how they are able to generalize across all possible views
of a place while still being selective enough to guarantee a unique localization.</p>
      <p>We have shown that the invariance requirements for place recognition are
not necessarily met by models popular for object recognition, such as texton
outputs or HMAX. Further, we found that higher layers in the hierarchy of the
model, which correspond to more complex features and higher levels of invariance
to spatial con guration, lead to a reduced level of invariance to rotation. This
yields the hypothesis that invariance to spatial layout, i.e. the second dimension
of our conceptual space in g. 1a, is a detrimental ingredient for invariant place
recognition in general. However, since we have explored only a small part of the
space of approaches, a more comprehensive study needs to be done.</p>
      <p>How much generalization is needed to perform localization? Being able to
generalize across di erent views of the same location is certainly helpful.
However, if generalization leads to higher invariance across di erent locations, as
happens in the higher stages of the HMAX model in our case, reliable place
identi cation performance decreases.</p>
      <p>
        Interestingly, [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] proposes a hierarchical model architecture for place cells
very similar to that of the HMAX model. In his model, cells are repeated across
locations and pooled over increasingly receptive elds in higher stages. The main
di erence to HMAX lies in that features are trained explicitly to be invariant to
rotations using slow feature analysis. This shows that the invariance properties
wired into a model greatly a ect its suitability for localization, as long as the
learning stage is tuned generalize across views, but not across places.
      </p>
      <p>These results suggest that a universal vision system for both object
recognition and localization methods is unfeasible. While some of the processing
mechanisms may be shared between architectures for the two tasks, speci c mechanisms
are required to uniquely determine a place.</p>
      <p>Acknowledgement This work was supported by DFG, SFB/TR8 Spatial
Cognition, project A5-[ActionSpace].</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>O</given-names>
            <surname>'Keefe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Dostrovsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>The hippocampus as a spatial map. Preliminary evidence from unit activity in the freely-moving rat</article-title>
          .
          <source>Brain research 34(1)</source>
          (
          <year>1971</year>
          )
          <fpage>171</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Serre</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poggio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A feedforward architecture accounts for rapid categorization</article-title>
          .
          <source>PNAS</source>
          <volume>104</volume>
          (
          <issue>15</issue>
          ) (
          <year>2007</year>
          )
          <volume>6424</volume>
          {
          <fpage>6429</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reineking</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zetzsche</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schill</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>From visual perception to place</article-title>
          .
          <source>Cognitive Processing</source>
          <volume>10</volume>
          (
          <year>2009</year>
          )
          <volume>351</volume>
          {
          <fpage>354</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Zetzsche</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barth</surname>
          </string-name>
          , E.:
          <article-title>Fundamental limits of linear lters in the visual processing of two-dimensional signals</article-title>
          .
          <source>Vision Research</source>
          <volume>30</volume>
          (
          <year>1990</year>
          )
          <volume>1111</volume>
          {
          <fpage>1117</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>IJCV</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ) (
          <year>2004</year>
          )
          <volume>91</volume>
          {
          <fpage>110</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kweon</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Biologically motivated perceptual feature: Generalized robust invariant feature</article-title>
          . In Narayanan, P.,
          <string-name>
            <surname>Nayar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shum</surname>
          </string-name>
          , H.Y., eds.:
          <source>Computer Vision ACCV 2006. Volume 3852 of Lecture Notes in Computer Science</source>
          . Springer Berlin / Heidelberg (
          <year>2006</year>
          )
          <volume>305</volume>
          {
          <fpage>314</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Modeling the shape of the scene: A holistic representation of the spatial envelope</article-title>
          .
          <source>IJCV</source>
          <volume>42</volume>
          (
          <issue>3</issue>
          ) (
          <year>2001</year>
          )
          <volume>145</volume>
          {
          <fpage>175</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Grauman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
          </string-name>
          , T.:
          <article-title>E cient image matching with distributions of local invariant features</article-title>
          .
          <source>In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2005</year>
          .
          <article-title>CVPR 2005</article-title>
          . Volume
          <volume>2</volume>
          . (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gri n</surname>
          </string-name>
          , G.,
          <string-name>
            <surname>Holub</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Caltech-256 object category dataset</article-title>
          .
          <source>Technical Report 7694</source>
          , California Institute of Technology (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Se</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Little</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Vision-based global localization and mapping for mobile robots</article-title>
          .
          <source>IEEE Transactions on Robotics</source>
          <volume>21</volume>
          (
          <issue>3</issue>
          ) (
          <year>2005</year>
          )
          <volume>364</volume>
          {
          <fpage>375</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Renninger</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
          </string-name>
          , J.:
          <article-title>When is scene identi cation just texture recognition</article-title>
          ?
          <source>Vision Research</source>
          <volume>44</volume>
          (
          <issue>19</issue>
          ) (
          <year>2004</year>
          )
          <volume>2301</volume>
          {
          <fpage>2311</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Franzius</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sprekeler</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiskott</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Slowness and sparseness lead to place, head-direction, and spatial-view cells</article-title>
          .
          <source>PLoS Comput Biol</source>
          <volume>3</volume>
          (
          <issue>8</issue>
          ) (
          <year>2007</year>
          ) e166
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>