<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Run &amp; error score
TAUbiomed</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Overview of the CLEF 2009 medical image annotation track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tatiana Tommasi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barbara Caputo</string-name>
          <email>bcaputog@idiap.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petra Welter</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Oliver Guld</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas M. Deserno</string-name>
          <email>tdesernog@mi.rwth-aachen.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Idiap Research Institute</institution>
          ,
          <addr-line>Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>RWTH, Aachen University, Dept. of Medical Informatics</institution>
          ,
          <addr-line>Aachen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <volume>95</volume>
      <issue>9</issue>
      <abstract>
        <p>This paper describe the last round of the medical image annotation task in ImageCLEF 2009. After four years, we de ned the task as a survey of all the past experience. Seven groups participated to the challenge submitting 19 runs. They were asked to train their algorithms on 12,677 images, labeled according to four di erent settings representing the yearly annotation tasks, and to classify 1,733 images in the four annotation frameworks. The aim is to understand how each strategy answers to the increasing number of classes and to the unbalancing. A plain classi cation scheme using support vector machines and local descriptors outperformed the other methods.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>4 Systems and Software</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>7 Digital Libraries</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In 2005, the medical image annotation task was introduced in the ImageCLEF1 challenge. Its
main contribution was to provide a resource for benchmarking content-based image classi cation
systems focusing on medical images. Hospitals collect hundreds of imaging data every day, and
automatic image annotation can be an important step when searching for images in huge databases
[
        <xref ref-type="bibr" rid="ref4 ref8">8, 4</xref>
        ]. Automatic techniques able to identify acquisition modality, body orientation, body region,
and biological system examined could be used for multilingual image annotations as well as for
DICOM header corrections in medical image acquisition routine.
      </p>
      <p>
        Over the last four years, the medical annotation task evolved in terms of number of images,
classes, and classes' framework provided. It was born as a 60 plain class problem [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], grew up to
a 120 class problem [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and became a complex hierarchical class task in 2007 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In 2008, class
imbalance was added to foster the use of prior knowledge encoded into the hierarchy of classes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>This year we celebrate the fth medical image annotation task anniversary and we decided to
organize its conclusive round as a survey on the last years experience. The idea is to compare
the scalability of di erent image classi cation techniques as the number of classes grows, their
hierarchical structure increase, and badly populated classes appear.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Database and Task Description</title>
      <p>As in the past challenge editions, the annotation task was de ned on the basis of the IRMA
project2. This year, a database of 12,677 fully classi ed radiographs, taken randomly from medical
routine, was made available as training set. Images are labelled according to four classi cation
label sets considering:
57 classes as in 2005 ( 12631 images) + a \clutter" class C (46 images);
116 classes as in 2006 ( 12334 images) + a \clutter" class C (343 images);
116 IRMA codes as in 2007 ( 12334 images) + a \clutter" class C (343 images);
193 IRMA codes as in 2008 ( 12677 images).</p>
      <p>For the rst two label settings, images are associated to simple raw numbers while in the last
two label settings, images are identi ed by their complete IRMA code (see Section 3). The 1-57
labels used for the rst group de nition are derived through a high level identi cation of images
in IRMA code terms. Considering a more detailed image annotation and the introduction of some
new classes, we pass to 116 and then to 193 classes. The \clutter" class for a speci c setting
contains all the images belonging to new classes, or images described with a higher level of detail
in the nal 2008 setting.</p>
      <p>The test data consisted of 1733 images. Not all the training classes have examples in this set:
2005 labelling - 55 classes (out of 57) with 1639 images + class C 94 with images;
2006 labelling - 109 classes (out of 116) with 1353 images + class C 380 with images;
2007 labelling - 109 IRMA codes (out of 116) with 1353 images + class C 380 with images;
2008 labelling - 169 IRMA codes (out of 193) with 1733 images.</p>
      <p>Note the distribution of the images in the classes of the training set: for 2005, 2006 and
2007, classes have more than six images while in 2008, there are classes with one to ve images.
Concerning the 2008 labels, the test data have a 20% of images which are badly (classes with less
than ten images) represented in the training data.</p>
      <p>Participants to the medical annotation task were asked to classify the test images according to
all the four label settings. Each group is allowed to submit di erent runs, but each of them should
be based only on one algorithm which should be optimized to face the four di erent classi cation
problems. The aim is to understand how each algorithm answers to the increasing number of
classes and to the unbalancing. The classi cation results are considered per year and the error
score summed to have a nal unique way to rank the performance of the submitted runs.
3</p>
    </sec>
    <sec id="sec-3">
      <title>IRMA Code</title>
      <p>
        Standardized nomenclature for medical imaging are generally roughly structured, ambiguous, and
often use optional tags. Concerning the needs for content-based image retrieval (CBIR) and
annotation in the medical eld, a detailed unambiguous coding scheme is required. Valid relations
between code and sub-code elements could be \is-a" and \part-of", de ning a strict hierarchical
2http://ganymed.imib.rwth-aachen.de/irma/index en.php
order. Causality is also important for grouping of processing strategies. Therefore, a
monohierarchical scheme is required, where each sub-code element is connected to only one code element.
Since categorization of medical images must cover all aspects in uencing the image content and
structure, a multi-axial scheme is needed [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The IRMA code strictly relies on these rules. It is composed from four axes having three to
four positions, each in f0, . . . , 9, a, . . . , zg, where \0" denotes \unspeci ed " to determine the
end of a path along an axis:
the technical code (T) describes the image modality;
the directional code (D) models body orientations;
the anatomical code (A) refers to the body region examined;
the biological code (B) describes the biological system examined.</p>
      <p>This results in a string of 13 characters (IRMA: TTTT-DDD-AAA-BBB). A small exemplary
excerpt from the anatomy axis of the IRMA code is given in Table 1. The IRMA code can be
easily extended by introducing characters in a certain code position, e.g., if new image modalities
are introduced. Based on the hierarchy, the more code positions di er from \0", the more detailed
is the description.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Error Evaluation</title>
      <p>We now de ne the error score for the medical image annotation challenge. On the basis of the
image labeling, we de ned two di erent evaluation strategies.
2005 and 2006. For these two years, the error is evaluated just on the capability of the algorithm
to make the correct decision. There is also the possibility to say \don't know", which is encoded
by \*". An example is given in Table 2.
2007 and 2008. For these two years, the error is evaluated on the basis of the hierarchical IRMA
code.</p>
      <p>Let an image be coded by the technical, directional, anatomical and biological axes. These
axes are independent and therefore, the errors for each axis simply are summed up:
let l1I = l1; l2; . . . ; li; . . . ; lI be the correct code (for one axis) of an image;
let ^l1I = ^l1; ^l2; . . . ; ^li; . . . ; ^lI be the classi ed code (for one axis) of an image;
where li is speci ed precisely for every position, and in ^li is allowed to say \don't know", which is
encoded by \*". Note that I (the depth of the tree to which the classi cation is speci ed) may be
di erent for di erent axes and di erent images.</p>
      <p>Given an incorrect classi cation at position ^li, we consider all succeeding decisions to be wrong
and given a not speci ed position, we consider all succeeding decisions to be not speci ed.
Furthermore, we do not count any error if the correct code is unspeci ed and the predicted code is a
wildcard. In that case, we do consider all remaining positions to be not speci ed.</p>
      <p>We want to penalize wrong decisions that are easy (fewer possible choices at that node) over
wrong decisions that are di cult (many possible choices at that node), we can say, a decision at
position li is correct by chance with a probability of b1i if bi is the number of possible labels for
position i. This assumes equal priors for each class at each position.</p>
      <p>Furthermore, we want to penalize wrong decisions at an early stage in the code (higher up in
the hierarchy) over wrong decisions at a later stage in the code (lower down on the hierarchy) (i.e.
li is more important than li+1 ).</p>
      <p>Putting together:
(1)
(2)
XI 1 1 (li; l^i)
i=1 |{bzi} |{iz} | ({cz) }</p>
      <p>(a) (b)
(li; l^i) =
8
&lt;
:</p>
      <p>0 if lj = l^j
0:5 if lj =
1 if lj 6= l^j
8j
9j
9j
i
i
i
with
where the parts of the equation:
(a) accounts for di culty of the decision at position i (branching factor);
(b) accounts for the level in the hierarchy (position in the string);
(c) correct/not speci ed/wrong, respectively.</p>
      <p>In addition, for every axis, the maximal possible error is calculated and the errors are normalized
such that a completely wrong decision (i.e. all positions for that axis wrong) gets an error count
of 0.25 and a completely correctly predicted axis has an error of 0.00. Thus, an image where all
positions in all axes are wrong has an error count of 1.00, and an image where all positions in all
axes are correct has an error count of 0.00. Finally setting a wildcard \*" instead of a \0" is not
considered a mistake (see Table 3).</p>
      <p>Clutter in 2005, 2006 and 2007. For these three years, we introduced a class called \clutter"
(C). Even if in the test set there are images belonging to this class, their classi cation do NOT
in uence the error score for the challenge (see Table 4).</p>
      <p>An example of the released database complete labeling is given in Table 5.
classi ed 2005-06 error count
18 0.0
21 0.0
* 0.0</p>
      <p>C 0.0
classi ed 2007 error count
111 0.000000
11* 0.000000
1** 0.000000
*** 0.000000
*C* 0.000000</p>
    </sec>
    <sec id="sec-5">
      <title>Participation</title>
      <p>In 2009, in total seven groups from ve nations of two continents participated in the medical
annotation task, and 19 runs were submitted in total. In the following, we brie y describe the
methods applied by the participating groups.</p>
      <p>TAUbiomed. The Medical Image Processing Lab from Tel Aviv University in Israel submitted
one run using a multiple-resolution patch-based bag-of-visual words approach. Classi cation
is performed through support vector machines. The code hierarchy is completely neglected
and no wildcards \*" were used.</p>
      <p>Idiap. The Idiap Research Institute from Switzerland submitted four runs reproposing the same
strategies used in 2008. They consisted of di erent classi cation schemes for support vector
machines and coupling two di erent image descriptors.</p>
      <p>FEITIJS. The Faculty of Electrical Engineering and Information Technologies from the
University of Skopje in Macedonia submitted one run. It is based on global and local image
descriptors, which are classi ed using bagging and random forest.</p>
      <p>VPA. The Computer Vision and Pattern Analysis Laboratory from Sabanci University in Turkey
submitted ve runs. They used local binary patterns as features and support vector machine
as classi er. They adopted a hierarchical approach considering, when applicable, the four
IRMA code axes separately.
medGIFT. The medGIFT group from University Hospitals of Geneva in Switzerland submitted
three runs using di erent descriptors and voting schemes in the medGIFT image retrieval
system.</p>
      <p>DEU. The Dokuz Eylul University in Turkey participated submitting four runs. Di erent global
and local features are extracted from images and classi cation is performed with a k-Nearest
Neighbor algorithm.</p>
      <p>IRMA. As in the previous years, the Image Retrieval in Medical Application (IRMA) group
at RWTH Aachen University, Germany, provided a baseline run. It was de ned using
Tamura Texture Measures, Cross Correlation Features, and the Image Distortion Model.
The parametrization was unchanged over all the years to provide a general reference.
Therefore, the IRMA code hierarchy is disregarded.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Results and Discussion</title>
      <p>The results of the challenge evaluation are given in Table 6, sorted by error score sum over the four
year label setting. Considering the error score per-year, the group ranking does not change except
for the rst and second rank positions between the Idiap and TAU group in 2006, respectively.</p>
      <p>
        In general, analyzing the results it can be seen that the top-performing runs do not consider the
hierarchical structure of the given task (2007 and 2008 labels), but rather use each individual code
as one class and train a plain classi er. To assess the semantics captured in the code hierarchy,
local rather global image features are required to narrow the semantic gap [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In addition, the
local features should be associated to segmented image objects rather than squared patches.
      </p>
      <p>Comparing the 2005 and 2006 results, we see that there is a general decrease in the error
score. A possible explanation is that in 2005 the 57 classes are wide, each one containing di erent
sub-levels in terms of IRMA codes. This make them di cult to be modeled by a classi er in the
training phase. On the other hand, comparing the 2007 and 2008 results there is a general increase
in the error score. This e ect was expected: here new classes with the same level of detail respect
to the IRMA code are added passing from 2007 to 2008. Moreover, some of the new classes are
poorly populated in the training set.</p>
      <p>As a nal remark, we notice that methods using patch-based local image descriptors and
discriminative SVM classi cation methods outperform the other approaches.
We have presented the ImageCLEF 2009 medical image annotation task. This is its conclusive
round and we organized it as a survey on the last four years experience. We want to compare
the scalability of di erent image classi cation techniques as the number of classes grows, their
hierarchical structure increase, and badly populated classes appear. A plain classi cation scheme
using support vector machine and local descriptors outperformed the other methods. The obtained
scores range from 852.8, over 1994.84, to 3979.8 for best, baseline and worst respectively.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. The
authors T.Tommasi and B.Caputo were supported by the EMMA project thanks to the Hasler
foundation (www.haslerstiftung.ch). The IRMA project has been funded by the German Research
Foundation, DFG, grants Le 1108/4, Le 1108/6, and De 1563/9.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Deselaers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Thomas M.</given-names>
            <surname>Deserno</surname>
          </string-name>
          .
          <article-title>Medical image annotation in ImageCLEF 2008</article-title>
          . In Carol Peters, Danilo Giampiccolo, Nicola Ferro, Vivien Petras, Julio Gonzalo, Anselmo Pen~as, Thomas Deselaers, Thomas Mandl, Gareth Jones, and Mikko Kurimo, editors,
          <source>Evaluating Systems for Multilingual and Multimodal Information Access | 9th Workshop of the Cross{Language Evaluation Forum, Lecture Notes in Computer Science</source>
          , Aarhus, Denmark,
          <year>September 2009</year>
          { to appear.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Deselaers</surname>
          </string-name>
          , Henning Muller, Paul Clough, Hermann Ney, and
          <string-name>
            <surname>Thomas</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann</surname>
          </string-name>
          .
          <article-title>The CLEF 2005 automatic medical image annotation task</article-title>
          .
          <source>International Journal in Computer Vision</source>
          ,
          <volume>74</volume>
          (
          <issue>1</issue>
          ):
          <volume>51</volume>
          {
          <fpage>58</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Thomas</surname>
            <given-names>M Deserno</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Antani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L Rodney</given-names>
            <surname>Long</surname>
          </string-name>
          .
          <article-title>Ontology of gaps in content-based image retrieval</article-title>
          .
          <source>J Digit Imaging</source>
          ,
          <volume>22</volume>
          (
          <issue>2</issue>
          ):
          <volume>202</volume>
          {
          <fpage>15</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Thomas</surname>
            <given-names>M Lehmann</given-names>
          </string-name>
          , Mark Oliver Guld, Thomas Deselaers, Daniel Keysers, Henning Schubert, Klaus Spitzer, Hermann Ney, and Wein Berthold B.
          <article-title>Automatic categorization of medical images for content-based retrieval and data mining</article-title>
          .
          <source>Comput Med Imaging Graph</source>
          ,
          <volume>29</volume>
          (
          <issue>2</issue>
          ):
          <volume>143</volume>
          {
          <fpage>55</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Thomas</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann</surname>
          </string-name>
          , Henning Schubert, Daniel Keysers, Michael Kohnen, and
          <string-name>
            <surname>Berthold</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Wein</surname>
          </string-name>
          .
          <article-title>The IRMA code for unique classi cation of medical images</article-title>
          . In H. K. Huang and
          <string-name>
            <surname>O. M.</surname>
          </string-name>
          Ratib, editors,
          <source>Medical Imaging</source>
          <year>2003</year>
          :
          <article-title>PACS and Integrated Medical Information Systems: Design and Evaluation</article-title>
          ., volume
          <volume>5033</volume>
          <source>of SPIE Proceedings</source>
          , pages
          <volume>440</volume>
          {
          <fpage>451</fpage>
          , San Diego, California, USA, May
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mu</surname>
          </string-name>
          ller, Thomas Deselaers, Eugene Kim, Jayashree Kalpathy-Cramer,
          <string-name>
            <given-names>Thomas M.</given-names>
            <surname>Deserno</surname>
          </string-name>
          , Paul Clough, and
          <string-name>
            <given-names>William</given-names>
            <surname>Hersh</surname>
          </string-name>
          .
          <article-title>Overview of the ImageCLEFmed 2007 medical retrieval and annotation tasks</article-title>
          .
          <source>In Working Notes of the 2007 CLEF Workshop</source>
          , Budapest, Hungary,
          <year>September 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mu</surname>
          </string-name>
          ller, Thomas Deselaers, Thomas Lehmann, Paul Clough, Eugene Kim, and
          <string-name>
            <given-names>William</given-names>
            <surname>Hersh</surname>
          </string-name>
          .
          <article-title>Overview of the ImageCLEFmed 2006 medical retrieval and annotation tasks</article-title>
          .
          <source>In Working Notes of the 2006 CLEF Workshop</source>
          , Alicante, Spain,
          <year>September 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Henning</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N</given-names>
            <surname>Michoux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D</given-names>
            <surname>Bandon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A</given-names>
            <surname>Geissbuhler</surname>
          </string-name>
          .
          <article-title>A review of content-based image retrieval systems in medical applications. clinical bene ts and future directions</article-title>
          .
          <source>Int J Med Inform</source>
          ,
          <volume>73</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>23</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>