<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the CLEF 2009 Robot Vision Track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barbara Caputo</string-name>
          <email>bcaputo@idiap.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrzej Pronobis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patric Jensfelt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Measurement, Performance, Experimentation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Autonomous Systems, Royal Institute of Technology</institution>
          ,
          <addr-line>Stockholm</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Idiap Research Institute</institution>
          ,
          <addr-line>Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The robot vision task has been proposed to the ImageCLEF participants for the rst time in 2009. The task attracted a considerable attention, with 19 inscribed research groups, 7 groups eventually participating and a total of 27 submitted runs. The task addressed the problem of visual place recognition applied to robot topological localization. Speci cally, participants were asked to classify rooms on the basis of image sequences, captured by a perspective camera mounted on a mobile robot. The sequences were acquired in an o ce environment, under varying illumination conditions and across a time span of almost two years. The training and validation set consisted of a subset of the IDOL2 database1. The test set consisted of sequences similar to those in the training and validation set, but acquired 20 months later and imaging also additional rooms. Participants were asked to build a system able to answer the question \where are you?" (I am in the kitchen, in the corridor, etc) when presented with a test sequence imaging rooms seen during training, or additional rooms that were not imaged in the training sequence. The system had to assign each test image to one of the rooms present in the training sequence, or indicate that the image came from a new room. We asked all participants to solve the problem separately for each test image (obligatory task). Additionally, results could also be reported for algorithms exploiting the temporal continuity of the image sequences (optional task). Of the 27 runs, 21 were submitted to the obligatory task, and 6 to the optional task. The best result in the obligatory task was obtained by the Multimedia Information Retrieval Group of the University of Glasgow, UK with an approach based on local feature matching. The best result in the optional task was obtained by the Intelligent Systems and Data Mining Group (SIMD) of the University of Castilla-La Mancha, Albacete, Spain, with an approach based on local features and a particle lter.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>4 Systems and Software</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        ImageCLEF2 [
        <xref ref-type="bibr" rid="ref1 ref2 ref5">1, 2, 5</xref>
        ] has started in 2003 as part of the Cross Language Evaluation Forum (CLEF3,
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). Its main goal has been to promote research on multi-modal data annotation and information
retrieval, in various application elds. As such it has always contained visual, textual and other
modalities, mixed tasks and several sub tracks.
      </p>
      <p>This year, for the rst time, ImageCLEF has hosted a Robot Vision task. This paper reports
on it, while other papers describe the other ve tasks of ImageCLEF 2009. More information on
the tasks and on how to participate to CLEF can also be found on the ImageCLEF web pages.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Participation</title>
      <p>In 2009, a new record of 85 research groups registered for the seven sub tasks of ImageCLEF. Of
these 85, 19 registered to the Robot Vision task. 7 of the registered groups submitted at least one
run:</p>
      <p>Multimedia Information Retrieval Group, University of Glasgow, United Kingdom
Idiap Research Institute, Martigny, Switzerland
Faculty of Computer Science, The Alexandru Ioan Cuza University (UAIC), Iasi, Romania
Computer Vision &amp; Image Understanding Department (CVIU), Institute for Infocomm
Research, Singapore
Laboratoire des Sciences de l'Information et des Systemes (LSIS), France
Intelligent Systems and Data Mining Group (SIMD), University of Castilla-La Mancha,
Albacete, Spain
Multimedia Information Modeling and Retrieval Group (MRIM), Laboratoire d'Informatique
de Grenoble, France
A total of 27 runs were submitted, with 21 runs submitted to the obligatory task and 6 runs
submitted to the optional task. In order to encourage participation, there was no limit to the
number of runs that each group could submit.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data Sets, Tasks, Ground Truthing</title>
      <p>This section describes the details concerning the setup of the robot vision task. Section 3.1
describes the dataset used. Section 3.2 gives details on the tasks proposed to the participants.
Finally, section 3.3 describes brie y the algorithm used for obtaining a ground truth and the
obtained results.
3.1</p>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>
          Training and validation set consisted of a subset of the publicly available IDOL2 database [
          <xref ref-type="bibr" rid="ref3 ref4">3,
4</xref>
          ]. An additional, previously unreleased image sequence was used for testing. The part of the
IDOL2 database used for training and validation comprises 12 image sequences acquired using a
MobileRobots PowerBot robot platform. The image sequences are accompanied by laser range
data and odometry data; however use of that data was not permitted in the competition.
        </p>
        <p>The image sequences in the IDOL2 database were captured with a Canon VC-C4 perspective
camera using the resolution of 320x240 pixels. The acquisition was performed in a ve room
2http://www.imageclef.org/
3http://www.clef-campaign.org/
subsection of a larger o ce environment, selected in such way that each of the ve rooms
represented a di erent functional area: a one-person o ce, a two-persons o ce, a kitchen, a corridor,
and a printer area. The appearance of the rooms was captured under three di erent illumination
conditions: in cloudy weather, in sunny weather, and at night. The robots were manually driven
through each of the ve rooms while continuously acquiring images and laser range scans at a rate
of 5fps. Each data sample was then labelled as belonging to one of the rooms according to the
position of the robot during acquisition (rather than contents of the images). Examples of images
showing the interiors of the rooms, variations observed over time and caused by activity in the
environment as well as introduced by changing illumination are presented in Figure 1.</p>
        <p>The IDOL2 database was designed to test the robustness of place recognition algorithms to
variations that occur over a long period of time. Therefore, the acquisition process was conducted
in two phases. Two sequences were acquired for each type of illumination conditions over the
time span of more than two weeks, and another two sequences for each setting were recorded 6
months later (12 sequences in total). Thus, the sequences captured variability introduced not
only by illumination but also natural activities in the environment (presence/absence of people,
furniture/objects relocated etc.).</p>
        <p>The test sequences were acquired in the same environment, using the same camera setup. The
acquisition was performed 20 months after the acquisition of the IDOL2 database. The sequences
contain additional rooms that were not imaged in the IDOL2 database.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>The Task</title>
        <p>The robot vision task addressed the problem of visual place recognition applied to topological
localization of a mobile robot. Speci cally, participants were asked to determine the topological
location of a robot based on images acquired with a perspective camera mounted on a robot
platform.</p>
        <p>Participants were given training data consisting of an image sequence. The training sequence
was recorded using a mobile robot that was manually driven through several rooms of a typical
indoor o ce environment. The acquisition was performed under xed illumination conditions and
at a given time. Each image in the training sequence was labeled and assigned to the room in
which it was acquired.</p>
        <p>The challenge was to build a system able to answer the question 'where are you?' (I'm in the
kitchen, in the corridor, etc.) when presented with a test sequence containing images acquired in
the previously observed part of the environment or in additional rooms that were not imaged in the
training sequence. The test images were acquired 6-20 months later after the training sequence,
possibly under di erent illumination settings. The system had to assign each test image to one
of the rooms that were present in the training sequence or indicate that the image came from a
room that was not included during training. Moreover, the system could refrain from making a
decision (e.g. in the case of lack of con dence).</p>
        <p>The algorithm had to be able to provide information about the location of the robot separately
for each test image (e.g. when only some of the images from the test sequences are available or
the sequences are scrambled). This corresponds to the problem of global topological localization.
We called this the obligatory task. However, results can also be reported for the case when the
algorithm is allowed to exploit continuity of the sequences and rely on the test images acquired
before the classi ed image. We called this the optional task.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Ground Truth</title>
        <p>The image sequences used in the competition were annotated with ground truth. The annotations
of the training and validation sequences were available to the participants, while the ground truth
for the test sequence was released after the results were announced. Each image in the sequences
was labelled according to the position of the robot during acquisition as belonging to one of the
rooms used for training or as an unknown room. The ground truth was then used to calculate a
e
c
ffi
o
s
n
o
s
r
e
-p
o
w
T
r
o
d
i
r
r
o
C</p>
        <sec id="sec-3-3-1">
          <title>Corridor</title>
          <p>(a) Variations introduced by illumination</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>One-person office</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>Two-persons office</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>One-person office</title>
          <p>(b) Variations observed over time</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>Kitchen</title>
        </sec>
        <sec id="sec-3-3-6">
          <title>Printer area (c) Remaining rooms (at night)</title>
          <p>score indicating the performance of an algorithm on the test sequence. The following rules were
used when calculating the overall score for the whole test sequence:
1 point was given for each correctly classi ed image.</p>
          <p>Correct detection of an unknown room was regarded as correct classi cation.
0.5 points was subtracted for each misclassi ed image.</p>
          <p>No points were given or subtracted if an image was not classi ed (the algorithm refrained
from the decision).</p>
          <p>A script was available to the participants that automatically calculated the score for a speci ed
test sequence given the classi cation results produced by an algorithm.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>This section describes the results of the robot vision task at ImageCLEF 2009. Table 1(a) shows
the results for the obligatory task, while Table 1(b) shows the result for the optional task.</p>
      <p>We see that the majority of runs were submitted to the obligatory task: of the 27 total
submissions, 21 were submitted to the obligatory run and only 6 to the optional task. A possible
explanation is that the optional task requires a higher expertise on robotics that the obligatory
task, which therefore represents a very good entry point.</p>
      <p>The submissions used a wide range of techniques, spanning from local descriptors combined
with statistical methods to approaches transplanted from the language modeling community. It
interesting to note though that the two groups that ranked rst in the two sub tasks both used a
local features based approach. This con rms a consolidated trend in the robot vision community
that treats local descriptors as the o the shelf feature of choice for visual recognition.
The rst robot vision task at ImageCLEF 2009 attracted a considerable attention and proved
an interesting complement to the existing tasks. The approach presented by the participating
groups were diverse and original, o ering a fresh take on the topological localization problem. We
plan to continue the task in the next years, adding laser information and odometry to the visual
information, and proposing new challenges to the perspective participants.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. B. Caputo
was supported by the EMMA project, funded by the Hasler foundation. A. Pronobis and P.
Jensfelt were supported by the EU FP7 project CogX ICT-215181. The support is gratefully
acknowledged.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Clough</surname>
          </string-name>
          , Henning Muller, Thomas Deselaers, Michael Grubinger,
          <string-name>
            <surname>Thomas</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann</surname>
            , Je ery Jensen, and
            <given-names>William</given-names>
          </string-name>
          <string-name>
            <surname>Hersh</surname>
          </string-name>
          .
          <article-title>The CLEF 2005 cross{language image retrieval track</article-title>
          .
          <source>In Cross Language Evaluation Forum (CLEF</source>
          <year>2005</year>
          ),
          <source>Springer Lecture Notes in Computer Science</source>
          , pages
          <volume>535</volume>
          {
          <fpage>557</fpage>
          ,
          <year>September 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Clough</surname>
          </string-name>
          , Henning Muller, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sanderson</surname>
          </string-name>
          .
          <article-title>The CLEF cross{language image retrieval track (ImageCLEF) 2004</article-title>
          . In Carol Peters, Paul Clough, Julio Gonzalo,
          <string-name>
            <given-names>Gareth J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kluck</surname>
          </string-name>
          , and Bernardo Magnini, editors,
          <source>Multilingual Information Access for Text</source>
          ,
          <article-title>Speech and Images: Result of the fth CLEF evaluation campaign</article-title>
          , volume
          <volume>3491</volume>
          of Lecture Notes in Computer Science (LNCS), pages
          <fpage>597</fpage>
          {
          <fpage>613</fpage>
          ,
          <string-name>
            <surname>Bath</surname>
          </string-name>
          , UK,
          <year>2005</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pronobis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Jensfelt</surname>
          </string-name>
          .
          <article-title>The KTH-IDOL2 database</article-title>
          .
          <source>Technical Report CVAP304</source>
          , Kungliga Tekniska Hoegskolan, CVAP/CAS,
          <year>October 2006</year>
          . Available at: http://www.cas.kth.se/IDOL/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pronobis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Jensfelt</surname>
          </string-name>
          .
          <article-title>Incremental learning for place recognition in dynamic environments</article-title>
          .
          <source>In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS07)</source>
          , San Diego, CA, USA,
          <year>October 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mu</surname>
          </string-name>
          ller, Thomas Deselaers, Eugene Kim, Jayashree Kalpathy-Cramer,
          <string-name>
            <given-names>Thomas M.</given-names>
            <surname>Deserno</surname>
          </string-name>
          , Paul Clough, and
          <string-name>
            <given-names>William</given-names>
            <surname>Hersh</surname>
          </string-name>
          .
          <article-title>Overview of the ImageCLEFmed 2007 medical retrieval and annotation tasks</article-title>
          .
          <source>In CLEF 2007 Proceedings, volume 5152 of Lecture Notes in Computer Science (LNCS)</source>
          , pages
          <fpage>473</fpage>
          {
          <fpage>491</fpage>
          , Budapest, Hungary,
          <year>2008</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jacques</given-names>
            <surname>Savoy</surname>
          </string-name>
          .
          <source>Report on CLEF{2001 experiments. In Report on the CLEF Conference 2001 (Cross Language Evaluation Forum)</source>
          , pages
          <fpage>27</fpage>
          {
          <fpage>43</fpage>
          ,
          <string-name>
            <surname>Darmstadt</surname>
          </string-name>
          , Germany,
          <year>2002</year>
          . Springer LNCS 2406.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>