=Paper= {{Paper |id=Vol-1178/CLEF2012wn-ImageCLEF-MartinezGomezEt2012 |storemode=property |title=Baseline Multimodal Place Classifier for the 2012 Robot Vision Task |pdfUrl=https://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-MartinezGomezEt2012.pdf |volume=Vol-1178 }} ==Baseline Multimodal Place Classifier for the 2012 Robot Vision Task == https://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-MartinezGomezEt2012.pdf
      Baseline Multimodal Place Classifier for the
                2012 Robot Vision Task

      Jesus Martinez-Gomez1 , Ismael Garcia-Varea1 , and Barbara Caputo2 ??
                            1
                            University of Castilla-La Mancha
                                     Albacete, Spain
              2
                Idiap Research Institute, Centre Du Parc, Rue Marconi 19
                     P.O. Box 592, CH-1920 Martigny, Switzerland
                      1
                        {Jesus.Martinez, Ismael.Garcia} @uclm.es
                                   2
                                     bcaputo@idiap.ch



         Abstract. The objective of this article is reporting the participation of
         the SIMD-IDIAP group in the RobotVision@ImageCLEF 2012 challenge.
         This challenge addresses the problem of multimodal place classification,
         and the 2012 edition has been organized by the members of the SIMD-
         IDIAP team. One of the main novelties in the 2012 edition of the task
         has been the proposal of several techniques for features extraction and
         cue integration. This paper details how to use all these techniques for de-
         veloping a multimodal place classifier and describes the results obtained
         with it. Our approach ranked 7th for task 1 and 4th for task 2. The
         complete results for all the participants of the 2012 RobotVision task
         are also reported in this article.


1      Introduction

This article describes the participation of the SIMD-IDIAP team at the fourth
edition of the Robot Vision task [7]. This competition addresses the problem of
multimodal place localization for mobile robots in indoor environments. Since its
release in 2009, the information provided by the organizers consisted on visual
images acquired with a mobile robot while moving within indoor environments.
However, the 2012 edition has introduced the use of range images as well as
proposes useful techniques for the development of the approaches.
The SIMD-IDIAP team has developed a multimodal place classifier based on
the use of Support Vector Machines (SVMs) [13]. The classifier has been trained
using a combination of visual and depth features extracted from sequences of
images. The selected features, as well as the classifier, are those proposed by the
organizers of the task, members of the SIMD-IDIAP team. Therefore, the results
??
     This work was supported by the SNSF project MULTI (B. C.), and by the European
     Social Fund (FEDER), the Spanish Ministerio de Ciencia e Innovacion (MICINN),
     and the Spanish “Junta de Comunidades de Castilla-La Mancha” (MIPRCV Con-
     solider Ingenio 2010 CSD2007-00018, TIN2010-20900-C04-03, PBI08-0210-7127 and
     PPII11-0309-6935 projects, J. M.-G. and I. G.-V.)
2

achieved by the method presented in this article can be considered as baseline
results that any participant is expected to improve.
The rest of the paper is organized as follows: Section 2 describes the 2012 edition
of the RobotVision task. Section 3 gives an overview of the SIMD-IDIAP pro-
posal, while the feature extraction and classification techniques are described in
Section 4 and Section 5 respectively. We report the results obtained in Section 6,
and finally, in Section 7, conclusions are drawn and future work are outlined.


2     The RobotVision Task

2.1   Description

The fourth edition of the RobotVision challenge is focused on the problem of
multi-modal place classification. Participants are asked to classify functional
areas on the basis of image sequences, captured by a perspective camera and a
kinect mounted on a mobile robot (see Fig. 1) within an office environment.




              Fig. 1. Mobile robot platform used for data acquisition.


    Participants have available visual images and range images that can be used
to generate 3D cloud point files. The difference between visual images, range
images and 3D point cloud files can be observed in Figure 2. Training and test
sequences have been acquired within the same building and floor but with some
variations in the lighting conditions or the acquisition procedure (clockwise and
counter clockwise).
    Two different tasks are considered in the RobotVision challenge: task 1 and
task 2. For both tasks, participants should be able to answer the question ”‘where
are you?”’ when presented with a test sequence imaging a room category already
                                                                                 3

        Visual Image              Range Image              3D point cloud file




                  Fig. 2. Visual, depth and 3D point cloud files.


seen during training. The difference between both tasks is the presence (or lack)
of kidnappings in the final test sequence, and the availability on the use of the
temporal continuity of the sequence.




                         Fig. 3. Example of kidnapping.


    The kidnappings (only task 2) affect to the room changes. Room changes
in sequences without kidnappings are usually represented by a small number
of images showing a smooth transition. On the other side, room changes with
kidnappings are represented by a drastic change for frames, as can be observed
in Figure 3.

The Data Three different sequences of frames are provided for training and two
additional ones for the final experiment. All training frames are labelled with
the name of the room they were acquired from. There are 9 different categories
of rooms:

 – Corridor
4

    – Elevator Area
    – Printer Room
    – Lounge Area
    – Professor Office
    – Student Office
    – Visio Conference
    – Technical Room
    – Toilet

   Figure 4 shows an exemplar visual image for each one of the 9 room cate-
gories.


               Elevator Area         Corridor              Toilet




                Lounge Area       Technical Room      Professor Office




               Student Office     Visio Conference     Printer Room




           Fig. 4. Examples of images from the RobotVision 2012 database.




Task 1 This task is mandatory and the test sequence has to be classified without
using the temporal continuity of the sequence. Therefore, the order of the test
frames cannot be taken into account. Moreover, there are not kidnappings in the
final test sequence.


Task 2 This task is optional and participants can take advantage of the temporal
continuity of the test sequence. There are kidnapping in the final test sequences
that allow participants to obtain additional points when they are managed cor-
rectly
                                                                                     5

Performance Evaluation The proposals of the participants are compared us-
ing the score obtained by their submissions. These submissions are the classes
or room categories assigned to the frames of the test sequences, and the score
is calculated using the rules that are shown in the Table 1. Due to wrong clas-
sifications obtain negative points, participants are allowed to not classify test
frames.

              Table 1. Rules used to calculate the final score for a run.

Each correctly classified frame                                        +1 points
Each misclassified frame                                                -1 points
Each frame that was not classified                                     +0 points
(Task 2) All the 4 frames correctly classified after a kidnapping +1 points (additional)




2.2    Information provided by the organizers

The organizers of the 2012 RobotVision task propose the use of several tech-
niques for features extraction and cue integration (classifier). Thanks to the use
of these techniques, participants can focus on the development of new features
while using the proposed method for cue integration or vice versa.
The organizers also provide information as the point cloud library [11] and a
basic technique for taking advantage of the temporal continuity1 . All these tech-
niques have been used for generating the multimodal place classifier proposed in
this article and are explained in Section 4 and Section 5.


3     Overall Description

The SIMD-IDIAP proposal for the 2012 RobotVision task can be split into two
steps: training and classification. The first step is performed by generating a
SVM classifier [13] with a combination of features extracted from visual and
depth images. We opted for the visual and depth features proposed by the task
organizers and a SVM classifier. Both features are then concatenated for gener-
ating a single feature. The complete training process is shown in Figure 5.
    The second step corresponds with the classification of test frames. Before
classifying a test frame, it is necessary to extract the same features extracted in
the training step. After that, the features are classified using the SVM previously
generated, which obtains a decision margin for each class. All these margins are
then processed to decide whether to classify the frame or not, depending on the
level of confidence of the decision. Low confidence frames will not be classified in
order to avoid obtaining negative points. The complete process for classification
and post-processing will be explained in Section 5.
1
    http://imageclef.org/2012/robot
6




      Fig. 5. Overall overview of the training step for generating the classifier.


4     Feature Extraction

As features to extract from the sequences of frames, we have chosen the Pyramid
Histogram of Orientated Gradients (PHOG) [2] and the Normal Aligned Radial
Feature (NARF). These descriptors can be extracted from visual images and 3D
point cloud files respectively, and they are the features proposed by the task
organizers.
PHOG features are histogram-based global features that combine structural and
statistical approaches. Other descriptors similar to PHOG that could also be
used are: Sift-based Pyramid Histogram Of visual Words (PHOW) [1], Pyramid
histogram of Local Binary Patterns (PLBP) [8], Self-Similarity-based PHOW
(SS-PHOW) [12], and Compose Receptive Field Histogram (CRFH) [5].
NARF features is a novel descriptor technique that has been included in the
point cloud library [11]. The number of descriptors that can be extracted from a
range image is not fixed, in the same manner as SIFT points. In order to extract
descriptors with the same length, we have computed a new feature from the
NARF points extracted from a 3D point cloud file.


4.1   Visual Features - PHOG

The Pyramid Histogram of Orientated Gradients (PHOG) is inspired by the
pyramid representation presented in [4] and the Histogram of Gradient Orien-
tation (HOG) described in [3]. This descriptor consists then of a histogram of
orientation gradients over each image sub-region at each level. It represents local
image shape and its spatial layout, both together with a spatial pyramid kernel.
Two parameters should be fixed when extracting a PHOG descriptor from a
                                                                                7

visual image: the number of levels of the pyramid and the number of bins of the
histogram. Each bin represents the number of edges with orientations within a
certain angular range. In our approach, we opted for 20 bins, 3 levels (0, 1 and
2) and the range of orientations [0,360]. Using these parameters, for each visual
image we obtain a 420 bytes descriptor.


4.2    Depth Features - NARF

The process to generate depth features from range images consists of three steps:
a) convert the range image into a 3D point cloud file, b) extract NARF features
from keypoints, and c) compute a descriptor pyramidly from the NARF fea-
tures.
The first step has been done by using a pyhton script provided by the task
organizers. This script has been used due to the specific format of the RobotVi-
sion@ImageCLEF 2012 range images, but the step can be skipped when using
the PCL software to register point cloud files directly from the kinect device.
The second step extracts NARF features from the keypoints detected in a point
cloud file. The keypoints are detected by using the neighbour cloud points and
taking into account the borders. For each keypoint, a 36 bytes NARF feature is
extracted, and we also store the ¡x,y,z¿ position of the 3D point.
For the third step, we compute pyramidly a descriptor from the data generated
in the previous step. This is done by following the pyramid representation pre-
sented in [4]. In a similar way as for the PHOG descriptor, we have to fix the
number of bins of the histogram and the number of levels. We selected 100 bins
and 2 levels (0 and 1), obtaining a 500 bytes descriptor.
Both PHOG and NARF-based descriptors are directly concatenated to obtain a
920 bytes descriptor by frame.


5     Classification and post-processing

5.1    Classifier

The algorithm that was proposed by the organizers for cue integration was the
Online-Batch Strongly Convex mUlti keRnel lEarning (OBSCURE) [10]. In this
work, we have used a classifier similar to OBSCURE based on the use of Online
Independent Support Vector Machines (OI-SVM) [9]. This classifier is named
Memory Controlled OI-SVM [6] and allows to keep under control the memory
growth while the algorithm learns incrementally. This is done by applying a
forgetting strategy over the stored Training Samples (TSs), while preserving the
stored Support Vectors and it approximates reasonably well the original optimal
solution.
The Memory Controlled OI-SVM is trained using the combination of visual and
depth features described in the previous section. When a test frame is classified,
the classifier obtains a decision margin for each class, and this output has to be
processed to obtain the most feasible class.
8

5.2   Low confidence detection

In order to detect low confidence classifications, we process the obtained outputs
in the following way: (i) we normalize the margins by dividing all values by the
maximum margin, and (ii) we test whether the normalized outputs pass two
                                                         C
conditions. On the basis of the output margins Mni i=1 , with C= number of
classes, for each frame n, the two conditions used to detect challenging frames
are:
                 C
 1. Mni < Mmax i=1 : for each of the possible classes C none obtains a high level
    of confidence;
                    C
 2. |Mni − Mnj | < ∆i=1 : there are at least two classes with high level of confi-
    dence, but the difference between them is too small to allow for a confident
    decision.

   If a frame has been detected as challenging, the algorithm will not assign
them any room category and it will not be classified. In all the experiments, we
have used Mmax = 0.2 and ∆ = 0.2.


5.3   Temporal Continuity

For the task 2, the temporal continuity of the sequence can be taken into account.
In order to take advantage of this situation, we used the solution proposed by the
task organizers: detect prior classes and use them to classify challenging frames.
Once a frame has been identified as challenging (and not classified), we use the
classification results obtained for the last n frames to solve the ambiguity: if all
the last n frames have been assigned to the same class Ci , then we can conclude
that all frames come from the same class Ci , we consider it a prior class, and the
label will be assigned accordingly. We have used n = 5 for all the experiments.


6     Results

We only submitted two runs: one for the task 1 and one for the task 2. The
process for generating both runs includes all the techniques that have been ex-
plained above.
We extracted a PHOG and NARF-based feature descriptor for each one of the
training and test frames. Both descriptors were concatenated to generate a 920
bytes descriptor and then, we trained a Memory Controlled OI-SVM using the
training1, training2, and training3 sequences. Finally, we classified the test se-
quences (test 1 for task 1 and test 2 for task 2) using the MC-OI-SVM.
For both tasks, we post-processed the output generated with the classifier to
detect challenging frames. These challenging frames were not labelled. For task
2, we used the temporal continuity of the sequence for classifying the challenging
fames when possible (a prior class using the last 5 frames has been detected).
                                                                              9

6.1   Task 1
Eight different groups submitted runs for the task 1 and all the results can be
observed in Table 2. The maximum score that could be achieved was 2445 and
the best group (CII UTN FRC) obtained 2071 points. Our proposal ranked 7th
with 462 points and most of the teams, as expected, achieved better than us.
The submitted run classified only 1526 frames (37.5% of the test frames were
detected as challenging), with 994 hits and 532 fails.


       Table 2. Ranking of the runs submitted by the groups for the task 1.

                 Rank      Group Name             Best Score
                  1       CIII UTN FRC              2071
                  2           NUDT                  1817
                  3         UAIC2012                1348
                  4        USUroom409               1225
                  5      SKB Kontur Labs            1028
                  6         CBIRITU                  551
                  7 SIMD-IDIAP - baseline results    462
                  8        BuffaloVision             -70




6.2   Task 2
For the optional task, the maximum score was 4079 and only 4 groups submitted
runs. Our submission ranked 4th with 1041 points and the winner for the task
2 was the CIII UTN FRC group with 3930 points. All the results can be seen in
Table 3. In this task, our algorithm discarded 1205 challenging frames but 97 of
them were classified using a prior class detected with the temporal continuity.
Concretely, the final submission consisted of 2915 frames classified and 1108
(27.54%) frames that were not labelled.


       Table 3. Ranking of the runs submitted by the groups for the task 2.

                 Rank      Group Name             Best Score
                  1       CIII UTN FRC              3930
                  2           NUDT                  3925
                  3         CBIRITU                 3169
                  4 SIMD-IDIAP - baseline results   1041




7     Conclusions and Future Work
This article describes the participation of the SIMD-IDIAP group to the RobotVi-
sion task at ImageCLEF 2012. We developed an approach using the techniques
10

for feature extraction, cue integration, and temporal continuity proposed by the
task organizers.
We submitted runs for task 1 (mandatory) and task 2 (optional) using the infor-
mation extracted from visual and depth images. Due to such runs were generated
using the techniques proposed by the organizers, these results can be considered
as baseline results.
Our best runs in the two tracks ranked respectively seventh (task 1) and fourth
(task 2), showing that most of the teams ranked better than the baseline results.
For future work, we have in mind to improve the NARF-based descriptor intro-
duced in this article. We also have plans to evaluate the use of larger descriptors,
which can be done by using a higher number of levels for the pyramid and bins
for the histogram.

References
 1. A. Bosch, A. Zisserman, and X. Munoz. Image classification using random forests
    and ferns. In International Conference on Computer Vision, pages 1–8. Citeseer,
    2007.
 2. A. Bosch, A. Zisserman, and X. Munoz. Representing shape with a spatial pyramid
    kernel. In Proceedings of the 6th ACM international conference on Image and video
    retrieval, page 408. ACM, 2007.
 3. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
    In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer
    Society Conference on, volume 1, pages 886–893. Ieee, 2005.
 4. S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
    matching for recognizing natural scene categories. In 2006 IEEE Computer Society
    Conference on Computer Vision and Pattern Recognition, volume 2, 2006.
 5. O. Linde and T. Lindeberg. Object recognition using composed receptive field
    histograms of higher dimensionality. In Proc. ICPR. Citeseer, 2004.
 6. J. Martinez and B. Caputo. Towards semi-supervised learning of semantic spatial
    concepts for mobile robots. Journal of Physical Agents, 4(3):19–31, 2010.
 7. Jesus Martinez-Gomez, Ismael Garcia-Varea, and Barbara Caputo. Overview of
    the imageclef 2012 robot vision task. In CLEF 2012 working notes, 2012.
 8. T. Ojala, M. Pietikäinen, and T. Mäenpää. Gray scale and rotation invariant
    texture classification with local binary patterns. Computer Vision-ECCV 2000,
    pages 404–420, 2000.
 9. F. Orabona, C. Castellini, B. Caputo, J. Luo, and G. Sandini. Indoor place recogni-
    tion using online independent support vector machines. In Proc. BMVC, volume 7,
    2007.
10. F. Orabona, L. Jie, , and B. Caputo. Online-Batch Strongly Convex Multi Kernel
    Learning. In Proc. of Computer Vision and Pattern Recognition, CVPR, 2010.
11. R.B. Rusu and S. Cousins. 3d is here: Point cloud library (pcl). In Robotics and
    Automation (ICRA), 2011 IEEE International Conference on, pages 1–4. IEEE,
    2011.
12. E. Shechtman and M. Irani. Matching local self-similarities across images and
    videos. In IEEE Conference on Computer Vision and Pattern Recognition, 2007.
    CVPR’07, pages 1–8, 2007.
13. V.N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York
    Inc, 2000.