<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Low-level global features for vision-based localization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sven Eberhardt</string-name>
          <email>sven2@uni-bremen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Zetzsche</string-name>
          <email>zetzsche@informatik.uni-bremen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cognitive Neuroinformatics, Universit ̈at Bremen</institution>
          ,
          <addr-line>Bibliothekstraße 1, 28359 Bremen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>5</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Vision-based self-localization is the ability to derive one's own location from visual input only without knowledge of a previous position or idiothetic information. It is often assumed that the visual mechanisms and invariance properties used for object recognition will also be helpful for localization. Here we show that this is neither logically reasonable nor empirically supported. We argue that the desirable invariance and generalization properties differ substantially between the two tasks. Application of several biologically inspired algorithms to various test sets reveals that simple, globally pooled features outperform the complex vision models used for object recognition, if tested on localization. Such basic global image statistics should thus be considered as valuable priors for self-localization, both in vision research and robot applications.</p>
      </abstract>
      <kwd-group>
        <kwd>localization</kwd>
        <kwd>visual features</kwd>
        <kwd>spatial cognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The ability to make reliable assumptions about their own position in the
world is of critical importance for biological as well as for man-made systems
such as mobile robots. A number of sensors can be used and combined to achieve
this feat (see e.g. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). Among these, vision is of particular importance. Although
idiothetic information such as acceleration, velocity and orientation
measurements can be used for dead reckoning, visual realignment can be essential to
avoid the accumulation of errors in path integration. Furthermore, allothetic
information in form of visual input can be used for direct localization. For example,
many place cells in the hippocampus can be driven by visual input alone [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
But how exactly can vision support localization?
      </p>
      <p>
        The default hypothesis would be that this is achieved by just the same
established principles of visual processing used for other spatial tasks like pattern
discrimination or object recognition. The corresponding standard view of the
visual system assumes that the main task of the system is invariant object
recognition, and that this is achieved by a feed-forward system of feature extraction
in form of a hierarchy of neural layers with increasing levels of abstraction and
of spatial granularity [
        <xref ref-type="bibr" rid="ref16 ref5 ref6">5, 6, 16</xref>
        ]. This standard model is supported by numerous
behavioral experiments and electroencephalography recordings, in particular by
experiments showing that human discrimination between categories in object
and scene classification is achieved as early as 150ms after stimulus onset (for
an overview see [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]).
      </p>
      <p>
        In this paper, we look at vision-based self-localization from static allothetic
input alone and formulate it as a classification problem: A set of example images
per location is trained with their location as the label and the task is to attribute
a new image to one of the learned locations by testing the classifier. Performance
is evaluated by percent correct classified images, i.e. we disregard any metric
information of distance between different locations and just treat each location
as a class and all views from a location of instances in that class. From this
perspective, the localization task is comparable to object classification problems
such as the one posed by the Caltech-101 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] dataset.
      </p>
      <p>
        Generally, the features on which a classifier operates should be invariant to
changes within a class but selective to changes between classes. Models designed
for object recognition provide varying degrees of translation and scale
invariance [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. For example, the HMax features used for an animal detection task
performed by [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] are designed to provide translation and scale invariance at
local and global levels because animals may occur at different positions, sizes
and 3D rotations in images.
      </p>
      <p>However, whether object recognition and vision-based localization are really
similar problems and can thus be solved with the same architecture has, to our
knowledge, never been investigated systematically. In this paper, we ask whether
visual features that are optimal for one task may be unsuited for the other and
vice versa. To answer this question, we test how well feature outputs of a number
of biologically inspired low-level vision models are able to discriminate among
large numbers of locations and compare the results with benchmark performance
on several object and scene recognition datasets.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        Streetview dataset We use a novel dataset which has been sampled from
Google Street View [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Street View has become popular as an outdoor dataset
of natural scenes for self-localization, 3D map reconstruction, text recognition
and image segmentation. Some unique key advantages to this dataset are its
sheer amount of available data from many countries of the world, preprocessed
in a standardized manner without bias to object centering [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Caveats include
a bias to roads and populated areas as well as relatively poor image quality with
distorted edges and Google watermarks.
      </p>
      <p>204 locations are selected by picking random points in the sampling region
until a road for which street view data is available is found within 50m range. For
each location a full 360◦ yaw rotation in intervals of 10◦ for a total of 36 pictures
per location is sampled. Field of view is 90◦ and pictures are stored as grayscale
images with size 512x512 pixels. The Streetview dataset is sampled from random
locations in France (SV-Country). To test localization on several distance scales,
we generate two additional datasets from different sampling regions. For SV-City,</p>
      <sec id="sec-2-1">
        <title>Streetview</title>
      </sec>
      <sec id="sec-2-2">
        <title>Caltech-101</title>
      </sec>
      <sec id="sec-2-3">
        <title>Scene-15</title>
      </sec>
      <sec id="sec-2-4">
        <title>Animal</title>
        <p>we sample locations in Berlin city center only. SV-World consists of imagery from
all countries where street view was available.</p>
        <p>
          Benchmark datasets To compare localization with object recognition and
scene classification tasks, we also use several established categorization databases.
The first dataset is Caltech-101 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], which is a very diverse collection of 101
object categories containing between 31 and 800 images each. Categories are
diverse and include specific animals, musical instruments, food categories, vehicles
and more. Image contents vary between isolated objects, comic depictions and
scenes containing the object in use. The dataset has been used as a benchmark
for object recognition by a number of algorithms in the past, including an
implementation of HMax and Spatial Pyramids. Caltech-101 is sometimes criticized
because low-level algorithms can perform relatively well on some categories due
to their very similar sample images [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. However, the large number of categories
alleviates this.
        </p>
        <p>
          For a scene classification test, we use Scene-15 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which is a dataset
comprised of photos of 15 different indoor and outdoor scene categories such as
kitchen, forest and highway. Each photo shows an open scene without any
objects close to the camera. Scene-15 has been mostly used to benchmark holistic
feature extraction models such as Gist and Spatial Pyramids.
        </p>
        <p>
          Finally, we include the Animal detection dataset from Serre et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], which
is a two-class classification object recognition dataset showing mostly non-urban
outdoor scenes both with and without animals.
        </p>
        <p>Models We focus on low-level, biologically inspired models that produce a
fixedsize feature vector for each input image. For all models, we use implementation
code supplied by the authors if available.</p>
        <p>
          Textons by Malik et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] apply a set of Gabor filters to an image, resulting in a
response vector for each pixel. The response vectors are clustered into 128 textons
and each pixel is assigned the cluster with the least square distance to its response
vector. The resulting output vector is a histogram of these texton assignments
over the whole image. Textons have been used for image segmentation purposes
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] as well as scene classification [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>
          Gist is also termed the Spatial Envelope of a scene by Oliva et al. It consists
of the first few principal components of spectral components on a very coarse
grid (8x8) as well as on the whole image. Gist has shown strong categorization
performance on the Scene-15 dataset [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          Spatial Pyramids, as described by Lazebnik et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], calculate histograms over
low-level features over image regions of different size and concatenates them to
one large feature vector. The features used here are densely sampled SIFT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
descriptors. For better comparability with the other models, we omit the custom
histogram matching support vector machine (SVM) kernel used by Lazebnik in
favor of a linear kernel and regression. We test the full pyramid up to level 2
(SPyr2) as well as outputs of the global histogram (SPyr0) only.
HMax is a biologically motivated multi-layer feed-forward model designed to
mirror functionality found in the primate visual cortex ventral stream by Hubel
and Wiesel [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. It is based on the Neocognitron [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and consists of alternating
layers of simple and complex cells. Simple cell layers match a dictionary of visual
patterns at all image locations and several scales, so units achieve selectivity to
certain patterns. Complex cell layers combine the outputs of simple cells over
a windows of locations and scales to achieve location and scale invariance. In
this way, units of low layers have localized receptive fields and simple patterns,
while units of higher layers respond to more complex patterns and are more
translation and scale invariant. We use the CNS [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] implementation of HMax
with parameter settings as chosen by Serre et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The full feature vector of
an image processed by HMax consists of randomly selected subsets of outputs
of the C1, C2, C2b and C3 layers. In order to determine the effects of increasing
invariance and matching to complex features, we also test performance when
using only outputs of the C1, C2 and C3 layers respectively. To test if the task
can be solved on trivial, low-level features, the classifier is also run on a luminance
histogram and on a random subset of 2000 pixels from the images.
Classification is done on a normalized feature set which has been reduced to
128 features per image by principal component analysis. On these features, we
perform regression with a linear kernel and leave-one-out cross validation to
determine the regression parameter using the GURLS package [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] for
MATLAB. Multi-class classification is performed by the one-versus-all rule. An equal
number of training samples is taken at random from each class and all
remaining elements are used for testing. Each run is repeated ten times with different
test splits to yield the reported performance average and a standard deviation.
Performance is defined as the percent correct averaged over all classes.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>All algorithms achieve between 28 and 76 percent correct performance on
our dataset (Figure 2a). Performance ranges are similar to those found in the
benchmark sets, which shows that our dataset has a comparable difficulty.
Despite the dataset similarity in difficulty, we find that classification on Texton</p>
      <p>Gist
HMax
SPyr2</p>
      <p>Texton</p>
      <p>SV−World
Gist
HMax
SPyr2
Texton
80 b
70
60
ce50
n
a
rm40
o
f
rpe30
20
10
0
80 d
70
60
ce50
n
a
rm40
o
f
rpe30
20
10
0
0 Streetview Animal Caltech−101 Scene−15
dataset
features yield the highest rank on all tasks of the streetview dataset, while they
rank lowest on all other datasets. In particular, we do not observe this effect on
the Scene-15 dataset, which hints that the requirements for scene classification
are quite different from a true self-localization task. The strong performance of
Textons is specially surprising, because they are the most basic and simple
features in comparison with the outputs of HMax, Spatial Pyramids and Gist and
they also output the least number of feature dimensions.</p>
      <p>Spatial pyramids rank second on the performance scale. However, a test on
the base level pyramid features (SPyr0 on Figure 2b) reveals that the
performance at level zero of the pyramid exceeds that of the full pyramid at level two.
Since the base level is just a histogram over densely sampled SIFT descriptors,
classification actually happens based on a global histogram similar to that of the
Textons. This means that any information about spatial arrangement of features
is actually detrimental to self-localization performance.</p>
      <p>The results suggest that the task is too easy in the sense that low-level
features are sufficient to achieve high performance. However, tests on global
luminance histograms as well as random image pixels show low performance near
chance level (Figure 2c). In that sense, our self-localization dataset is harder than
the benchmark datasets, for which 8-16% of all test samples could be classified
based on raw pixel data alone.</p>
      <p>Our findings generalize along different image sampling scales at SV-City,
SV-Country and SV-World level (Figure 2d). Performance is higher at larger
sampling scales, because locations are more different on a world scale than on a
city scale. However, the performance order among different models remains the
same.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>
        The results show quite clearly that model performance is highly task-dependent
and there are no universal features that are optimal for any vision-based task.
The main reason for this finding is that there are key differences in the invariance
properties required for self-localization compared to those inherent to object or
scene classification [
        <xref ref-type="bibr" rid="ref2 ref20">2, 20</xref>
        ].
      </p>
      <p>While object recognition needs to be tolerant to changes in scale and rotation,
self-localization does not (see Figure 3). Similarly, object recognition needs to
be invariant to some feature rearrangements that occur when the object is seen
1Photos: c Stephen &amp; Claire Farnsworth via flickr, license CC-BY-NC. Map: Google
maps c Google inc.
from different angles. For self-localization, invariance to such rearrangements
may be unwanted because if you see an object from a different angle, you are
likely standing at a different position.</p>
      <p>
        Concerning these invariances, HMax has both local translation and scale
invariance built into the model. Thus it is not surprising that Streetview
classification performance on these features is relatively poor. The differing invariance
requirements also explain why neither Gist nor the pyramid structure of the
spatial pyramid model could show strong performance on the dataset although
both algorithms have been established for scene classification tasks [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Both
models include features that are not completely location invariant, but contain
the position in the image on a very coarse scale.
      </p>
      <p>Classifying scenes in datasets like Scene-15 might actually be closer to a task
like sorting photos, where photographers have a certain bias to how types of
scenes are best portrayed and reflect that in the spatial arrangement of image
features. Scene classification algorithms like Gist can catch on that common
structure and use it for classification. However, when images from locations are
recorded at random, unbiased angles, this method breaks down.</p>
      <p>
        Although salient features are believed to be advantageous for localization
[
        <xref ref-type="bibr" rid="ref17 ref19">17, 19</xref>
        ], we also find that the performance on complex SIFT descriptors is lower
than on the more simple Textons. This is probably due to their high selectivity
to particular objects, so they do not generalize well to matching on other, similar
objects present in other views from the same location.
      </p>
      <p>
        It appears surprising that Texton features, which have been designed for
image segmentation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], perform so well on a localization task. The reason seems
to be that – among the models tested – they provide the best tradeoff between
specificity to features present at individual locations and invariance to different
views from the same location. The strong correlation of simple, global features
with location suggests that very basic histogram features can be used as priors
for self-localization algorithms for example in mobile robots instead of relying on
geometric relations between complex features only. It also suggests that it might
be worthwhile to check whether biological systems make use of such features to
determine their own location.
      </p>
      <p>Acknowledgments. This work was supported by DFG, SFB/TR8 Spatial
Cognition, project A5-[ActionSpace].</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Google Street View, http://google.com/streetview</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Eberhardt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kluth</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zetzsche</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schill</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>From pattern recognition to place identification</article-title>
          . In: Spatial cognition,
          <source>international workshop on place-related knowledge acquisition research</source>
          . pp.
          <fpage>39</fpage>
          -
          <lpage>44</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fergus</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories</article-title>
          .
          <source>In: IEEE. CVPR</source>
          <year>2004</year>
          ,
          <article-title>workshop on generative model-based vision</article-title>
          . vol.
          <volume>106</volume>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Filliat</surname>
          </string-name>
          , D., Meyer, J.:
          <article-title>Map-based navigation in mobile robots: A review of localization strategies</article-title>
          .
          <source>Cognitive systems research 4(4)</source>
          ,
          <fpage>243</fpage>
          -
          <lpage>282</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fukushima</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position</article-title>
          .
          <source>Biological Cybernetics</source>
          <volume>36</volume>
          ,
          <fpage>193</fpage>
          -
          <lpage>202</lpage>
          (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hubel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiesel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Receptive fields and functional architecture of monkey striate cortex</article-title>
          .
          <source>The Journal of physiology</source>
          pp.
          <fpage>215</fpage>
          -
          <lpage>243</lpage>
          (
          <year>1968</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponce</surname>
          </string-name>
          , J.:
          <article-title>Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories</article-title>
          .
          <source>In: Computer vision and pattern recognition</source>
          ,
          <source>2006 IEEE computer society conference on. vol. 2</source>
          , pp.
          <fpage>2169</fpage>
          -
          <lpage>2178</lpage>
          . Ieee (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Object recognition from local scale-invariant features</article-title>
          .
          <source>In: Computer vision</source>
          ,
          <year>1999</year>
          .
          <source>The proceedings of the seventh IEEE international conference on. vol. 2</source>
          , pp.
          <fpage>1150</fpage>
          -
          <lpage>1157</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leung</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Contour and texture analysis for image segmentation</article-title>
          .
          <source>International journal of computer vision 43(1)</source>
          ,
          <fpage>7</fpage>
          -
          <lpage>27</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Markus</surname>
            ,
            <given-names>E.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barnes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>a</article-title>
          .,
          <string-name>
            <surname>McNaughton</surname>
            ,
            <given-names>B.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gladden</surname>
            ,
            <given-names>V.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skaggs</surname>
          </string-name>
          , W.E.:
          <article-title>Spatial information content and reliability of hippocampal CA1 neurons: effects of visual input</article-title>
          .
          <source>Hippocampus</source>
          <volume>4</volume>
          (
          <issue>4</issue>
          ),
          <fpage>410</fpage>
          -
          <lpage>421</lpage>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mutch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblich</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poggio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>CNS: a GPU-based framework for simulating cortically-organized networks</article-title>
          .
          <source>Tech. Rep</source>
          . MIT-
          <string-name>
            <surname>CSAIL-TR-</surname>
          </string-name>
          2010-013 / CBCL-286, Massachusetts Institute of Technology, Cambridge, MA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hospital</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ave</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Modeling the shape of the scene: A holistic representation of the spatial envelope</article-title>
          .
          <source>International journal of computer vision 42(3)</source>
          ,
          <fpage>145</fpage>
          -
          <lpage>175</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barhomi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cox</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dicarlo</surname>
            ,
            <given-names>J.J.:</given-names>
          </string-name>
          <article-title>Comparing state-of-the-art visual features on invariant object recognition tasks</article-title>
          .
          <source>In: Applications of computer vision (WACV)</source>
          ,
          <source>2011 IEEE workshop on</source>
          . pp.
          <fpage>463</fpage>
          -
          <lpage>470</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ponce</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berg</surname>
            ,
            <given-names>T.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Everingham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forsyth</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hebert</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marszalek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>C.K.I.</given-names>
          </string-name>
          , Zhang, J.,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Dataset issues in object recognition</article-title>
          . Springer Berlin Heidelberg (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Renninger</surname>
            ,
            <given-names>L.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
          </string-name>
          , J.:
          <article-title>When is scene identification just texture recognition</article-title>
          ?
          <source>Vision research</source>
          <volume>44</volume>
          (
          <issue>19</issue>
          ),
          <fpage>2301</fpage>
          -
          <lpage>2311</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Serre</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poggio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A feedforward architecture accounts for rapid categorization</article-title>
          .
          <source>Proceedings of the national academy of sciences 104(15)</source>
          ,
          <fpage>6424</fpage>
          -
          <lpage>6429</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Sim</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elinas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Griffin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Little</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Vision-based SLAM using the RaoBlackwellised particle filter</article-title>
          .
          <source>In: IJCAI workshop on reasoning</source>
          . pp.
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Tacchetti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mallapragada</surname>
            ,
            <given-names>P.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santoro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosasco</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>GURLS: a toolbox for large scale multiclass learning</article-title>
          .
          <source>In: Big learning workshop at NIPS</source>
          (
          <year>2011</year>
          ), http: //cbcl.mit.edu/gurls/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Warren</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rossano</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wear</surname>
          </string-name>
          , T.D.:
          <article-title>Perception of map-environment correspondence: The roles of features and alignment</article-title>
          .
          <source>Ecological psychology 2(February</source>
          <year>2013</year>
          ),
          <fpage>131</fpage>
          -
          <lpage>150</lpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reineking</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zetzsche</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schill</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>From visual perception to place</article-title>
          .
          <source>Cognitive processing 10</source>
          ,
          <fpage>351</fpage>
          -
          <lpage>354</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>