<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CPPP/UFMS at ImageCLEF 2014: Robot Vision Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rodrigo de Carvalho Gomes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucas Correia Ribas</string-name>
          <email>lucascorreiaribas@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amaury Ant</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>onio de Castro Junior</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wesley Nunes Goncalves</string-name>
          <email>wesley.goncalvesg@ufms.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Federal University of Mato Grosso do Sul - Ponta Pora~ Campus Rua Itibire Vieira</institution>
          ,
          <addr-line>s/n, CEP 79907-414 Ponta Pora~ - MS</addr-line>
          ,
          <country>Brazil rodrigo</country>
        </aff>
      </contrib-group>
      <fpage>348</fpage>
      <lpage>354</lpage>
      <abstract>
        <p>This paper describes the participation of the CPPP/UFMS group in the robot vision task. We have applied the spatial pyramid matching proposed by Lazebnik et al. This method extends bag-of-visualwords to spatial pyramids by concatenating histograms of local features found in increasingly ne sub-regions. To form the visual vocabulary, kmeans clustering was applied in a random subset of images from training dataset. After that the images are classi ed using a pyramid match kernel and the k-nearest neighbors. The system has shown promising results, particularly for object recognition.</p>
      </abstract>
      <kwd-group>
        <kwd>Scene recognition</kwd>
        <kwd>object recognition</kwd>
        <kwd>spatial pyramid matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In recent years, robotics has achieved important advances, such as the intelligent
industrial devices that are increasingly accurate and e cient. Despite the recent
advances, most robots still represent the surrounding environment by means of a
map with information about obstacles and free spaces. To increase the
complexity of autonomous tasks, robots should be able to get a better understanding
of images. In particular, the ability to identify scenes such as o ce, kitchen,
as well as objects, is an important step to perform complex tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Thus,
place localization and object recognition becomes a fundamental part of image
understanding for robot localization [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        This paper presents the participation of our group in the 6th edition of the
Robot Vision challenge1 [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. This challenge addresses the problem of
semantic place classi cation and object recognition. For this task, the
bag-of-visualwords approach (BOW) [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] is one of the most promising approaches available.
Although the approach have advantages, it also has one major drawback, the
absence of spatial information. To overcome this drawback, a spatial pyramid
framework combined with local features extractors, such as such as Scale
Invariant Feature Transform (SIFT) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Speeded Up Robust Features (SURF) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
was proposed by Lazebnik et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This method showed signi cantly improved
performance on challenging scene categorization tasks. The image recognition
system used in our participation is based on the improved BOW and multi-class
classi ers.
      </p>
      <p>Experimental results have shown that the image recognition system provides
promising results, in particular to the object recognition task. Among four
systems, the proposed system ranked second using the number of cluster k = 400
and number of images M = 150 for training the vocabulary.</p>
      <p>This paper is described as follows. Section 2 presents the image recognition
system used by our group in the robot vision challenge. The experiments and
results of the proposed system are described in Section 3. Finally, conclusions
and future works are discussed in Section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Image Recognition System</title>
      <p>In this section, we describe the image recognition system used in the challenge.
This system can be described into 3 steps: i) feature extraction; ii) spatial
pyramid matching; iii) classi cation. The following sections describe each step in
details.
2.1</p>
      <sec id="sec-2-1">
        <title>Feature Extraction</title>
        <p>In the feature extraction step, the system extracts SIFT descriptors from 16 16
patches computed over a grid with spacing of 8 pixels. For each patch i, 128
descriptors are calculated, i.e., it is calculated a vector 'i 2 &lt;128. To train
the visual vocabulary, we perform k-means clustering of a random subset of
descriptors D = f'g from the training set according to Equation 1. Throughout
the paper, the number of clusters will be referred to as k and the size of the
random subset of images will be referred to as M .</p>
        <p>C = k-means(D)
(1)
where C 2 Rek 128 represents the clusters.</p>
        <p>Then, each vector descriptor 'i is associated to the closest cluster according
to the Euclidean distance (Equation 2). The index associated is usually called
visual word in the bag-of-visual-word approach.</p>
        <p>k
i = arg min j'i; Cij
j=1
(2)
where j:j is the Euclidean distance.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Spatial Pyramid Matching</title>
        <p>The pyramid matching was proposed to nd an approximate correspondence
between two sets, such as histograms. It works placing a sequence of grids over
the space and calculating a weighted sum of the number of matches. Consider a
sequence of grids at resolution l = 0; : : : ; L. A grid at level l has 2dl cells, where
d is the space dimension which in our case is d = 2. In each cell i, it calculates
the histogram Hl(i) of visual words . The number of matches at level l for two
image X e Y is given by:</p>
        <p>In order to penalize matches at larger cells, the pyramid match kernel between
images X e Y , considering all levels l, is given by:</p>
        <p>2dl
Il = X min (HXl (i); HYl (i))</p>
        <p>i=1
L(X; Y ) =
1
2L</p>
        <p>L
I0 + X
l=1</p>
        <p>1
2L l+1Il
k
KL(X; Y ) = X</p>
        <p>L(Xj ; Yj )
j=1</p>
        <p>The kernel above is calculated to each visual word, such that:
where Xj indicates that the kernel will consider only the visual word j.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Classi cation</title>
        <p>The multi-class classi cation is done with the k-nearest neighbor using the kernel
KL described above. Given a test image, it calculates the kernel value for all
training images and assigns the room/category of the closest training image. The
same procedure is done for object recognition, i.e., it is detected the presence of
the objects of the closest training image.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <p>In this section we describe the experiments and results of the proposed
system. To train the system, we have used 5000 visual images divided into 10
rooms/categories: Corridor, Hall, ProfessorO ce, StudentO ce, TechnicalRoom,
Toilet, Secretary, VisioConference, Warehouse, ElevatorArea. An example of
each category can be seen in Figure 1. The train dataset also provides 8 objects:
Extinguisher, Phone, Chair, Printer, Urinal, Bookself, Trash, Fridge. Examples
of images containing each of the objects can be seen in Figure 2. The number of
images for each room/category and object is summarized in Table 1.</p>
      <p>To test the proposed system, we have used the validation dataset composed
by 1500 images. The results for di erent values of number of cluster k and images
(3)
(4)
(5)
(a)
dor</p>
      <p>Corri(b) Hall
(c)
Professor O ce
(f) Toilet
(g)
Secretary
(h) Visio
Conference
(i)
house</p>
      <p>Ware(j) Elevator
Area
used to obtain the vocabulary M can be seen in Table 2. The nal size of the
descriptor is given by k PlL=0 22l. For L = 2 and k = 300, the nal size is 6300.
Despite high number of descriptors, the system takes on average 0.9545 seconds
to process an image. For each image, our system provided the room category
and the presence of objects. The scores shown in the table are the sum of all
the scores obtained for the images. The rules shown in Table 3 are used when
calculating the nal score for an image.</p>
      <p>Finally, the Table 4 shows the results for all participations in the robot vision
challenge. Three groups have submitted solutions to this challenge and the
baseline indicates the results obtained by the dataset provided script (Color &amp; Depth
histogram + SVM). For the competition we submitted four runs with di erent
values for the parameters k and M (see Section 2.1). Our system ranked second,
achieving 1738.75 points on this task for k = 400 and M = 150.
This paper described the participation of our group in the Robot vision challenge.
In this challenge, the proposed system ranked second among four others systems.
Thus, the image recognition system has shown promising results, particularly for
object recognition.</p>
      <p>As future work, we intend to extend the approach for color images and use
other classi ers, such as Support Vector Machine, which it is known to provide
better results than k-nearest neighbors. In addition, the system will be applied
in the depth images.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments.</title>
      <p>RCG and LCR were supported by the CNPq and PET-Fronteira, respectively.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ulrich</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nourbakhsh</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Appearance-based place recognition for topological localization</article-title>
          .
          <source>In: Robotics and Automation</source>
          ,
          <source>2000. Proceedings. ICRA '00. IEEE International Conference on. Volume</source>
          <volume>2</volume>
          . (
          <year>2000</year>
          )
          <volume>1023</volume>
          {
          <fpage>1029</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Martinez-Gomez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Varea</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cazorla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caputo</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the imageclef 2013 robot vision task</article-title>
          . In: Working Notes,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>(</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Caputo</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Martinez-Gomez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Acar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patricia</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marvasti</surname>
            , N., Uskudarl ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paredes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cazorla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Varea</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morell</surname>
          </string-name>
          , V.:
          <article-title>ImageCLEF 2014: Overview and analysis of the results</article-title>
          .
          <source>In: CLEF proceedings. Lecture Notes in Computer Science</source>
          . Springer Berlin Heidelberg (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Martinez-Gomez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cazorla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Varea</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morell</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Overview of the ImageCLEF 2014 Robot Vision Task</article-title>
          . In:
          <article-title>CLEF 2014 Evaluation Labs</article-title>
          and Workshop, Online Working Notes. (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Csurka</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dance</surname>
            ,
            <given-names>C.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Willamowski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bray</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Visual categorization with bags of keypoints</article-title>
          . In: In Workshop on Statistical Learning in
          <source>Computer Vision</source>
          , ECCV. (
          <year>2004</year>
          )
          <volume>1</volume>
          {
          <fpage>22</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sivic</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Video google: A text retrieval approach to object matching in videos</article-title>
          .
          <source>In: Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2. ICCV '03</source>
          , Washington, DC, USA, IEEE Computer Society (
          <year>2003</year>
          )
          <volume>1470</volume>
          {
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          :
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International Journal Computer Vision</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ) (
          <year>2004</year>
          )
          <volume>91</volume>
          {
          <fpage>110</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Bay</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ess</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuytelaars</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Gool</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Speeded-up robust features (surf)</article-title>
          .
          <source>Comput. Vis. Image Underst</source>
          .
          <volume>110</volume>
          (
          <issue>3</issue>
          ) (
          <year>2008</year>
          )
          <volume>346</volume>
          {
          <fpage>359</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponce</surname>
          </string-name>
          , J.:
          <article-title>Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories</article-title>
          .
          <source>In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR '06</source>
          , Washington, DC, USA, IEEE Computer Society (
          <year>2006</year>
          )
          <volume>2169</volume>
          {
          <fpage>2178</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>