Baseline results for the ImageCLEF 2006 medical
            automatic annotation task
          Mark O Güld, Christian Thies, Benedikt Fischer, and Thomas M Lehmann
           Department of Medical Informatics, RWTH Aachen, Aachen, Germany
                                mgueld@mi.rwth-aachen.de


                                             Abstract
     The ImageCLEF 2006 medical automatic annotation task encompasses 11,000 images
     from 116 categories, compared to 57 categories for 10,000 images of the similar task in
     2005. As a baseline for comparison, a run using the same classifiers with the identical
     parameterization as in 2005 is submitted. In addition, the parameterization of the
     classifier was optimized according to the 9,000/1,000 split of the 2006 training data.
     In particular, texture-based classifiers are parallel combined with classifiers, which use
     spatial intensity information to model common variabilities among medical images.
     However, all individual classifiers are based on global features, i.e. one feature vector
     describes the entire image. The parameterization from 2005 yields an error rate of
     21.7%, which ranks 13th among the 28 submissions. The optimized classifier yields
     21.4% error rate (rank 12), which is insignificantly better.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Managment]: Languages—Query Languages

General Terms
Measurement, Performance, Experimentation

Keywords
Content-based image retrieval, Pattern recognition, Classifier combination


1    Introduction
The ImageCLEF medical automatic annotation task was established in 2005 [1], demanding the
classification of 1,000 radiographs into 57 categories based on 9,000 categorized reference images.
The ImageCLEF 2006 annotation task [2] consists of 10,000 reference images grouped into 116
categories and 1,000 images to be automatically categorized. This paper aims at providing a
baseline for comparison of the experiments in 2005 and 2006 rather than presenting an optimal
classifier.
2     Methods
2.1    Image Features
The content of a medical image is represented by texture features proposed by Tamura et al. [3]
and Castelli et al. [4], denoted as TTM and CTM, respectively. Down-scaled representations
of the original images were computed to 32 × 32 and X × 32, pixels disregarding and according to
the original aspect ratio, respectively. Since these image icons maintain the spatial intensity infor-
mation, variabilities which are commonly found in a medical imagery are modelled by the distance
measure. These include radiation dose, global translation, and local deformation. In particular,
the cross-correlation function (CCF) which is based on Shannon and the image distortion model
(IDM) suggested by Keysers et al. [5] is used. In particular, the following parameters were set:
    • TTM: texture histograms, 384 bins, Jensen-Shannon divergence
    • CTM: texture features, 43 bins, Mahalanobis distance with diagonal covariance matrix Σ.
    • CCF: 32 × 32 icon, 9 × 9 translation window
    • IDM: X × 32 icon, gradients, 5 × 5 window, 3 × 3 context

2.2    Classifiers
The single classifiers are combined within a parallel scheme, which performs a weighting of the
normalized distances obtained from the single classifiers Ci , and applies the nearest-neighbor-
decision function C to the resulting distances:
                                                 X
                                     dc (q, r) =   λi · di (q, r)                            (1)
                                                  i
                       P
where λi , 0 ≤ λi ≤ 1, i λi = 1 denotes the weight for the normalized distance di (q, r) obtained
from classifier Ci for a sample q and a reference r.

2.3    Submissions
Based on a combination of classifiers used for the annotation task in 2005 [6], a run using the
exact same parameterization is submitted. Additionally, a second run is submitted which uses a
parameterization obtained from an exhaustive search (using a step size of 0.05 for λi ) for the best
combination of the single classifiers. For this purpose, the development set of 1,000 images is used
and the system was trained on the remaining 9,000 images.


3     Results
All results are obtained non-interactively, i.e. without relevance feedback by a human user. Table
1 shows the error rates in percent obtained for the 1,000 unknown images using single k-nearest
neighbor classifiers and their combination for both k = 1 and k = 5. The error rate of 21.4% ranks
12th among 28 submitted runs for this task. The weights that were optimized on the ImageCLEF
2005 medical automatic annotation task (10,000 images from 57 categories) yield an error rate of
21.7% and rank 13th.


4     Discussion
The comparability of similar experiments such as ImageCLEF 2005 and ImageCLEF 2006 is
difficult since several parameters such as the images and the class definitions were changed. In
general, the difficulty of a classification problem is proportional to the error rate, increases with
                 Table 1: Error rates for the medical automatic annotation task.

               Classifier            λTTM     λCTM     λCCF    λIDM     k=1      k=5
               TTM                   1.00     0        0       0        44.4%    44.9%
               CTM                   0        1.00     0       0        27.9%    25.7%
               CCF                   0        0        1.00    0        23.0%    23.4%
               IDM                   0        0        0       1.00     57.2%    54.9%
               ImageCLEF 2005        0.40     0        0.18    0.42     21.7%    22.0%
               Exhaustive search     0.25     0.05     0.25    0.45     21.5%    21.4%


the number of classes, and decreases with the number of references per class. Since the number
of references varies for the classes, this relation is rather complex than linear. However, linearity
can be used as a rough but first approximation.
    Since the error rates are not substantially improved by a training on the 2006 data, the classifier
that has been optimized for 2005 is also suitable for the ImageCLEF 2006 medical automatic
annotation task. The parameterization optimized for ImageCLEF 2005 and 2006 yields E 2005 =
13.3% and E 2006 = 21.7%, respectively, and the quotient E 2005 /E 2006 can be used to relate the
results from the 2005 and 2006 campaigns. In other words, the best E = 16.2% from 2006 would
approximately have resulted in E = 10, 0%, which is better than the best rate obtained in 2005.
This is in accordance with the rank of submission, that drops from the 2nd to the 13th place. In
conclusion, general improvements for automatic image annotation have been made during the last
year.


5    Acknowledgment
This work is part of the IRMA project, which is funded by the German Research Foundation,
grant Le 1108/4.


References
[1] Clough P, Müller H, Deselaers T, Grubinger M, Lehmann TM, Jensen J, Hersh W. The CLEF
   2005 cross-language image retrieval track. LNCS 2006; 4022: in press.
[2] Müller H, Deselaers T, Lehmann TM, Clough P, Hersh W. Overview of the ImageCLEFmed
   2006 medical retrieval and annotation tasks. In: CLEF Working Notes; 2006, Sep; Alicante,
   Spain; in press.
[3] Tamura H, Mori S, Yamawaki T. Textural features corresponding to visual perception. IEEE
   Transactions on Systems, Man and Cybernetics 1978; 8(6): 460-73.
[4] Castelli V, Bergman LD, Kontoyiannis I, Li CS, Robinson JT, Turek JJ. Progressive search
   and retrieval in large image archives. IBM Journal of Research and Development 1998; 42(2):
   253-68.
[5] Keysers D, Dahmen J, Ney H, Wein BB, Lehmann TM. A statistical framework for model-
   based image retrieval in medical applications. Journal of Electronic Imaging 2003; 12(1): 59-68.
[6] Güld MO, Thies C, Fischer B, Lehmann TM. Content-based retrieval of medical images by
   combining global features. LNCS 2006; 4022: in press.