<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Baseline Multimodal Place Classi er for the 2012 Robot Vision Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jesus Martinez-Gomez</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ismael Garcia-Varea</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barbara Caputo</string-name>
          <email>2bcaputo@idiap.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Idiap Research Institute, Centre Du Parc</institution>
          ,
          <addr-line>Rue Marconi 19 P.O. Box 592, CH-1920 Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jesus.Martinez</institution>
          ,
          <addr-line>Ismael.Garcia</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The objective of this article is reporting the participation of the SIMD-IDIAP group in the RobotVision@ImageCLEF 2012 challenge. This challenge addresses the problem of multimodal place classi cation, and the 2012 edition has been organized by the members of the SIMDIDIAP team. One of the main novelties in the 2012 edition of the task has been the proposal of several techniques for features extraction and cue integration. This paper details how to use all these techniques for developing a multimodal place classi er and describes the results obtained with it. Our approach ranked 7th for task 1 and 4th for task 2. The complete results for all the participants of the 2012 RobotVision task are also reported in this article.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This article describes the participation of the SIMD-IDIAP team at the fourth
edition of the Robot Vision task [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This competition addresses the problem of
multimodal place localization for mobile robots in indoor environments. Since its
release in 2009, the information provided by the organizers consisted on visual
images acquired with a mobile robot while moving within indoor environments.
However, the 2012 edition has introduced the use of range images as well as
proposes useful techniques for the development of the approaches.
The SIMD-IDIAP team has developed a multimodal place classi er based on
the use of Support Vector Machines (SVMs) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The classi er has been trained
using a combination of visual and depth features extracted from sequences of
images. The selected features, as well as the classi er, are those proposed by the
organizers of the task, members of the SIMD-IDIAP team. Therefore, the results
achieved by the method presented in this article can be considered as baseline
results that any participant is expected to improve.
      </p>
      <p>The rest of the paper is organized as follows: Section 2 describes the 2012 edition
of the RobotVision task. Section 3 gives an overview of the SIMD-IDIAP
proposal, while the feature extraction and classi cation techniques are described in
Section 4 and Section 5 respectively. We report the results obtained in Section 6,
and nally, in Section 7, conclusions are drawn and future work are outlined.</p>
    </sec>
    <sec id="sec-2">
      <title>The RobotVision Task</title>
      <sec id="sec-2-1">
        <title>Description</title>
        <p>The fourth edition of the RobotVision challenge is focused on the problem of
multi-modal place classi cation. Participants are asked to classify functional
areas on the basis of image sequences, captured by a perspective camera and a
kinect mounted on a mobile robot (see Fig. 1) within an o ce environment.</p>
        <p>Participants have available visual images and range images that can be used
to generate 3D cloud point les. The di erence between visual images, range
images and 3D point cloud les can be observed in Figure 2. Training and test
sequences have been acquired within the same building and oor but with some
variations in the lighting conditions or the acquisition procedure (clockwise and
counter clockwise).</p>
        <p>Two di erent tasks are considered in the RobotVision challenge: task 1 and
task 2. For both tasks, participants should be able to answer the question "`where
are you?"' when presented with a test sequence imaging a room category already
Visual Image</p>
        <p>Range Image
3D point cloud le
seen during training. The di erence between both tasks is the presence (or lack)
of kidnappings in the nal test sequence, and the availability on the use of the
temporal continuity of the sequence.</p>
        <p>The kidnappings (only task 2) a ect to the room changes. Room changes
in sequences without kidnappings are usually represented by a small number
of images showing a smooth transition. On the other side, room changes with
kidnappings are represented by a drastic change for frames, as can be observed
in Figure 3.</p>
        <p>The Data Three di erent sequences of frames are provided for training and two
additional ones for the nal experiment. All training frames are labelled with
the name of the room they were acquired from. There are 9 di erent categories
of rooms:</p>
        <p>{ Corridor
{ Elevator Area
{ Printer Room
{ Lounge Area
{ Professor O ce
{ Student O ce
{ Visio Conference
{ Technical Room
{ Toilet</p>
        <p>Figure 4 shows an exemplar visual image for each one of the 9 room
categories.</p>
        <p>Elevator Area</p>
        <p>Corridor</p>
        <p>Toilet
Lounge Area</p>
        <p>Technical Room</p>
        <p>Professor O ce
Student O ce</p>
        <p>Visio Conference</p>
        <p>Printer Room</p>
        <p>Task 1 This task is mandatory and the test sequence has to be classi ed without
using the temporal continuity of the sequence. Therefore, the order of the test
frames cannot be taken into account. Moreover, there are not kidnappings in the
nal test sequence.</p>
        <p>Task 2 This task is optional and participants can take advantage of the temporal
continuity of the test sequence. There are kidnapping in the nal test sequences
that allow participants to obtain additional points when they are managed
correctly
Performance Evaluation The proposals of the participants are compared
using the score obtained by their submissions. These submissions are the classes
or room categories assigned to the frames of the test sequences, and the score
is calculated using the rules that are shown in the Table 1. Due to wrong
classi cations obtain negative points, participants are allowed to not classify test
frames.</p>
        <p>
          Each correctly classi ed frame +1 points
Each misclassi ed frame -1 points
Each frame that was not classi ed +0 points
(Task 2) All the 4 frames correctly classi ed after a kidnapping +1 points (additional)
The organizers of the 2012 RobotVision task propose the use of several
techniques for features extraction and cue integration (classi er). Thanks to the use
of these techniques, participants can focus on the development of new features
while using the proposed method for cue integration or vice versa.
The organizers also provide information as the point cloud library [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and a
basic technique for taking advantage of the temporal continuity1. All these
techniques have been used for generating the multimodal place classi er proposed in
this article and are explained in Section 4 and Section 5.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Overall Description</title>
      <p>
        The SIMD-IDIAP proposal for the 2012 RobotVision task can be split into two
steps: training and classi cation. The rst step is performed by generating a
SVM classi er [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] with a combination of features extracted from visual and
depth images. We opted for the visual and depth features proposed by the task
organizers and a SVM classi er. Both features are then concatenated for
generating a single feature. The complete training process is shown in Figure 5.
      </p>
      <p>
        The second step corresponds with the classi cation of test frames. Before
classifying a test frame, it is necessary to extract the same features extracted in
the training step. After that, the features are classi ed using the SVM previously
generated, which obtains a decision margin for each class. All these margins are
then processed to decide whether to classify the frame or not, depending on the
level of con dence of the decision. Low con dence frames will not be classi ed in
order to avoid obtaining negative points. The complete process for classi cation
and post-processing will be explained in Section 5.
1 http://imageclef.org/2012/robot
As features to extract from the sequences of frames, we have chosen the Pyramid
Histogram of Orientated Gradients (PHOG) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and the Normal Aligned Radial
Feature (NARF). These descriptors can be extracted from visual images and 3D
point cloud les respectively, and they are the features proposed by the task
organizers.
      </p>
      <p>
        PHOG features are histogram-based global features that combine structural and
statistical approaches. Other descriptors similar to PHOG that could also be
used are: Sift-based Pyramid Histogram Of visual Words (PHOW) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Pyramid
histogram of Local Binary Patterns (PLBP) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Self-Similarity-based PHOW
(SS-PHOW) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and Compose Receptive Field Histogram (CRFH) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
NARF features is a novel descriptor technique that has been included in the
point cloud library [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The number of descriptors that can be extracted from a
range image is not xed, in the same manner as SIFT points. In order to extract
descriptors with the same length, we have computed a new feature from the
NARF points extracted from a 3D point cloud le.
The Pyramid Histogram of Orientated Gradients (PHOG) is inspired by the
pyramid representation presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the Histogram of Gradient
Orientation (HOG) described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This descriptor consists then of a histogram of
orientation gradients over each image sub-region at each level. It represents local
image shape and its spatial layout, both together with a spatial pyramid kernel.
Two parameters should be xed when extracting a PHOG descriptor from a
visual image: the number of levels of the pyramid and the number of bins of the
histogram. Each bin represents the number of edges with orientations within a
certain angular range. In our approach, we opted for 20 bins, 3 levels (0, 1 and
2) and the range of orientations [0,360]. Using these parameters, for each visual
image we obtain a 420 bytes descriptor.
4.2
      </p>
      <sec id="sec-3-1">
        <title>Depth Features - NARF</title>
        <p>The process to generate depth features from range images consists of three steps:
a) convert the range image into a 3D point cloud le, b) extract NARF features
from keypoints, and c) compute a descriptor pyramidly from the NARF
features.</p>
        <p>
          The rst step has been done by using a pyhton script provided by the task
organizers. This script has been used due to the speci c format of the
RobotVision@ImageCLEF 2012 range images, but the step can be skipped when using
the PCL software to register point cloud les directly from the kinect device.
The second step extracts NARF features from the keypoints detected in a point
cloud le. The keypoints are detected by using the neighbour cloud points and
taking into account the borders. For each keypoint, a 36 bytes NARF feature is
extracted, and we also store the &lt;x,y,z&gt; position of the 3D point.
For the third step, we compute pyramidly a descriptor from the data generated
in the previous step. This is done by following the pyramid representation
presented in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In a similar way as for the PHOG descriptor, we have to x the
number of bins of the histogram and the number of levels. We selected 100 bins
and 2 levels (0 and 1), obtaining a 500 bytes descriptor.
        </p>
        <p>Both PHOG and NARF-based descriptors are directly concatenated to obtain a
920 bytes descriptor by frame.
5
5.1</p>
      </sec>
      <sec id="sec-3-2">
        <title>Classi er</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Classi cation and post-processing</title>
      <p>
        The algorithm that was proposed by the organizers for cue integration was the
Online-Batch Strongly Convex mUlti keRnel lEarning (OBSCURE) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In this
work, we have used a classi er similar to OBSCURE based on the use of Online
Independent Support Vector Machines (OI-SVM) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This classi er is named
Memory Controlled OI-SVM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and allows to keep under control the memory
growth while the algorithm learns incrementally. This is done by applying a
forgetting strategy over the stored Training Samples (TSs), while preserving the
stored Support Vectors and it approximates reasonably well the original optimal
solution.
      </p>
      <p>The Memory Controlled OI-SVM is trained using the combination of visual and
depth features described in the previous section. When a test frame is classi ed,
the classi er obtains a decision margin for each class, and this output has to be
processed to obtain the most feasible class.
5.2</p>
      <sec id="sec-4-1">
        <title>Low con dence detection</title>
        <p>In order to detect low con dence classi cations, we process the obtained outputs
in the following way: (i) we normalize the margins by dividing all values by the
maximum margin, and (ii) we test whether the normalized outputs pass two
conditions. On the basis of the output margins Mni iC=1, with C= number of
classes, for each frame n, the two conditions used to detect challenging frames
are:</p>
        <p>C
1. Mni &lt; Mmaxi=1: for each of the possible classes C none obtains a high level
of con dence;
2. jMni Mnj j &lt; iC=1: there are at least two classes with high level of con
dence, but the di erence between them is too small to allow for a con dent
decision.</p>
        <p>If a frame has been detected as challenging, the algorithm will not assign
them any room category and it will not be classi ed. In all the experiments, we
have used Mmax = 0:2 and = 0:2.
5.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Temporal Continuity</title>
        <p>For the task 2, the temporal continuity of the sequence can be taken into account.
In order to take advantage of this situation, we used the solution proposed by the
task organizers: detect prior classes and use them to classify challenging frames.
Once a frame has been identi ed as challenging (and not classi ed), we use the
classi cation results obtained for the last n frames to solve the ambiguity: if all
the last n frames have been assigned to the same class Ci, then we can conclude
that all frames come from the same class Ci, we consider it a prior class, and the
label will be assigned accordingly. We have used n = 5 for all the experiments.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>We only submitted two runs: one for the task 1 and one for the task 2. The
process for generating both runs includes all the techniques that have been
explained above.</p>
      <p>We extracted a PHOG and NARF-based feature descriptor for each one of the
training and test frames. Both descriptors were concatenated to generate a 920
bytes descriptor and then, we trained a Memory Controlled OI-SVM using the
training1, training2, and training3 sequences. Finally, we classi ed the test
sequences (test 1 for task 1 and test 2 for task 2) using the MC-OI-SVM.
For both tasks, we post-processed the output generated with the classi er to
detect challenging frames. These challenging frames were not labelled. For task
2, we used the temporal continuity of the sequence for classifying the challenging
fames when possible (a prior class using the last 5 frames has been detected).
6.1</p>
      <sec id="sec-5-1">
        <title>Task 1</title>
        <p>Eight di erent groups submitted runs for the task 1 and all the results can be
observed in Table 2. The maximum score that could be achieved was 2445 and
the best group (CII UTN FRC) obtained 2071 points. Our proposal ranked 7th
with 462 points and most of the teams, as expected, achieved better than us.
The submitted run classi ed only 1526 frames (37:5% of the test frames were
detected as challenging), with 994 hits and 532 fails.
For the optional task, the maximum score was 4079 and only 4 groups submitted
runs. Our submission ranked 4th with 1041 points and the winner for the task
2 was the CIII UTN FRC group with 3930 points. All the results can be seen in
Table 3. In this task, our algorithm discarded 1205 challenging frames but 97 of
them were classi ed using a prior class detected with the temporal continuity.
Concretely, the nal submission consisted of 2915 frames classi ed and 1108
(27:54%) frames that were not labelled.
This article describes the participation of the SIMD-IDIAP group to the
RobotVision task at ImageCLEF 2012. We developed an approach using the techniques
for feature extraction, cue integration, and temporal continuity proposed by the
task organizers.</p>
        <p>We submitted runs for task 1 (mandatory) and task 2 (optional) using the
information extracted from visual and depth images. Due to such runs were generated
using the techniques proposed by the organizers, these results can be considered
as baseline results.</p>
        <p>Our best runs in the two tracks ranked respectively seventh (task 1) and fourth
(task 2), showing that most of the teams ranked better than the baseline results.
For future work, we have in mind to improve the NARF-based descriptor
introduced in this article. We also have plans to evaluate the use of larger descriptors,
which can be done by using a higher number of levels for the pyramid and bins
for the histogram.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Munoz</surname>
          </string-name>
          .
          <article-title>Image classi cation using random forests and ferns</article-title>
          .
          <source>In International Conference on Computer Vision</source>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          8.
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Munoz</surname>
          </string-name>
          .
          <article-title>Representing shape with a spatial pyramid kernel</article-title>
          .
          <source>In Proceedings of the 6th ACM international conference on Image and video retrieval, page 408. ACM</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalal</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Triggs</surname>
          </string-name>
          .
          <article-title>Histograms of oriented gradients for human detection</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2005</year>
          .
          <article-title>CVPR 2005</article-title>
          . IEEE Computer Society Conference on, volume
          <volume>1</volume>
          , pages
          <fpage>886</fpage>
          {
          <fpage>893</fpage>
          .
          <string-name>
            <surname>Ieee</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Lazebnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ponce</surname>
          </string-name>
          .
          <article-title>Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories</article-title>
          .
          <source>In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          , volume
          <volume>2</volume>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>O.</given-names>
            <surname>Linde</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Lindeberg</surname>
          </string-name>
          .
          <article-title>Object recognition using composed receptive eld histograms of higher dimensionality</article-title>
          .
          <source>In Proc. ICPR. Citeseer</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinez</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          .
          <article-title>Towards semi-supervised learning of semantic spatial concepts for mobile robots</article-title>
          .
          <source>Journal of Physical Agents</source>
          ,
          <volume>4</volume>
          (
          <issue>3</issue>
          ):
          <volume>19</volume>
          {
          <fpage>31</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Jesus</given-names>
            <surname>Martinez-Gomez</surname>
          </string-name>
          ,
          <article-title>Ismael Garcia-Varea, and Barbara Caputo. Overview of the imageclef 2012 robot vision task</article-title>
          .
          <source>In CLEF 2012 working notes</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>T.</given-names>
            <surname>Ojala</surname>
          </string-name>
          , M. Pietikainen, and T. Maenpaa.
          <article-title>Gray scale and rotation invariant texture classi cation with local binary patterns</article-title>
          .
          <source>Computer Vision-ECCV</source>
          <year>2000</year>
          , pages
          <fpage>404</fpage>
          {
          <fpage>420</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>F.</given-names>
            <surname>Orabona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Sandini</surname>
          </string-name>
          .
          <article-title>Indoor place recognition using online independent support vector machines</article-title>
          .
          <source>In Proc. BMVC</source>
          , volume
          <volume>7</volume>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>F.</given-names>
            <surname>Orabona</surname>
          </string-name>
          , L. Jie, , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          .
          <article-title>Online-Batch Strongly Convex Multi Kernel Learning</article-title>
          .
          <source>In Proc. of Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>R.B. Rusu</surname>
            and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Cousins</surname>
          </string-name>
          .
          <article-title>3d is here: Point cloud library (pcl)</article-title>
          .
          <source>In Robotics and Automation (ICRA)</source>
          ,
          <source>2011 IEEE International Conference on, pages 1{4</source>
          . IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. E. Shechtman and
          <string-name>
            <given-names>M.</given-names>
            <surname>Irani</surname>
          </string-name>
          .
          <article-title>Matching local self-similarities across images and videos</article-title>
          .
          <source>In IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2007</year>
          . CVPR'
          <volume>07</volume>
          , pages
          <issue>1{8</issue>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>V.N.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          .
          <article-title>The nature of statistical learning theory</article-title>
          . Springer-Verlag New York Inc,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>