Overview of the ImageCLEFmed 2006 medical
          retrieval and annotation tasks
                   Henning Müller1 , Thomas Deselaers2 , Thomas Lehmann3 ,
                         Paul Clough5 , Eugene Kim5 , William Hersh5
            1
               Medical Informatics, University and Hospitals of Geneva, Switzerland
                2
                  Computer Science Dep., RWTH Aachen University, Germany
                  3
                    Medical Informatics, RWTH Aachen University, Germany
                              4
                                 Sheffield University, Sheffield, UK
             5
               Oregon Health and Science University (OHSU), Portland, OR, USA
                                henning.mueller@sim.hcuge.ch


                                           Abstract
     This paper describes the medial image retrieval and the medical annotation tasks
     of ImageCLEF 2006. These tasks are described in a separate paper from the other
     task to reduce the size of the overview papaer.These two medical tasks are described
     separately with respect to the goals, databases used, topics created and distributed
     among participants, results and techniques used. The best performing techniques are
     described in more detail to provide better insights about successful strategies. Some
     ideas for future tasks are also presented.
        The ImageCLEFmed medical image retrieval task had 12 participating groups and
     received 100 submitted runs. Most runs were automatic, with only a few manual
     or interactive. Purely textual runs were in the majority compared to purely visual,
     runs but most runs were mixed, i.e., using visual and textual information. None
     of the manual or interactive techniques were significantly better than those used for
     the automatic runs. The best–performing systems used visual and textual techniques
     combined, but combinations of visual and textual features often did not improve a
     system’s performance. Purely visual systems only performed well on the visual topics.
        The medical automatic annotation used a larger database in 2006, with 10’000
     training images and 116 classes, up from 57 in 2005. Twelve participating groups
     submitted 27 runs. Despite the much larger number of classes, results were almost as
     good as in 2005 and a clear improvement in performance could be shown. The best–
     performing system of 2005 would have only received a position in the upper middle
     part in 2006.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Management]: Languages—Query Languages

General Terms
Measurement, Performance, Experimentation
Keywords
Image Retrieval, Performance Evaluation, Image Classification, Medical Imaging


1      Introduction
ImageCLEF1 [3] started within CLEF2 (Cross Language Evaluation Forum) in 2003. A medical
image retrieval task was added in 2004 to explore domain–specific multilingual information re-
trieval as well as multi-modal retrieval (combining visual and textual features for retrieval). Since
2005, a medical retrieval and a medical image annotation task are parts of ImageCLEF.
    This paper concentrates on the two medical tasks, whereas a second paper [2] describes the new
object classification and the photographic retrieval tasks. More detailed information can also be
found on the task web pages for ImageCLEFmed3 and the medical annotation task4 . A detailed
analysis of the 2005 medical image retrieval task is available in [8].


2      The Medical Image Retrieval Task
2.1     General Overview
In 2006, the medical retrieval task was run for the third year, and for the second year in a row with
the same dataset of over 50’000 images from four distinct collections. One of the most interesting
findings for 2005 was the variable performance of systems based on whether the topics had been
classified as amenable to visual, textual, or mixed retrieval methods. For this reason, we developed
30 topics for 2006, with 10 each in the categories of being amenable to visual, textual, or mixed
retrieval methods.
    The scope of the topic development was slightly enlarged by using the log files of a medical
media search engine of the Health on the Net (HON) foundation. Analysis of these logs showed a
great number of general topics not covering the entire four axes defined in 2005:

    • Anatomic region shown in the image;
    • Image modality (e.g. x–ray, CT, MRI, gross pathology, etc.);
    • Pathology or disease shown in the image;
    • Abnormal visual observation (e.g. enlarged heart).
   The process of relevance judgments was similar to 2005 and, for the evaluation of the results,
the trec eval package was used, since it is the standard in information retrieval.

2.2     Registration and participation
In 2006, a record number of 47 groups registered for ImageCLEF and among these, 37 also regis-
tered for the medical image retrieval task. Groups came from four continents and from a total of
16 countries.
    Unfortunately, many of the registered group did not send in results In the end, 12 groups from
8 countries submitted results. Each entry below describes briefly the techniques used for their
submissions.
    • Concordia University, Canada. The CINDI group from Concordia University, Montreal,
      Canada submitted a total of four runs, one purely textual, one purely visual, and two
    1 http://ir.shef.ac.uk/imageclef/
    2 http://www.clef-campaign.org/
    3 http://ir.ohsu.edu/images
    4 http://www-i6.informatik.rwth-aachen.de/~deselaers/imageclef06/medicalaat.html
  combined runs. Text retrieval was based on Apache Lucene. For visual information a com-
  bination of global and local features were used and compared using the Euclidean distance.
  Most of the submissions used relevance feedback.

• Microsoft Research, China. Microsoft Research China submitted one purely visual run using
  a combination of various features accounting for color, texture, and blocks.
• Institute for Infocomm Research I2R–IPAL, Singapore. IPAL submitted 26 runs, the largest
  number of any group. Textual and visual runs were prepared in cooperation with I2R. For vi-
  sual retrieval patches of image regions were applied and manually classified into semantically
  valid categories and mapped to Unified Medical Language System (UMLS). For the textual
  analysis, the three languages were separately mapped to UMLS terms and then applied to
  retrieval. Several classifiers based on SVMs and other classical approaches were used and
  combined.
• University Hospitals of Freiburg, Germany. The Freiburg group submitted a total of 9
  runs mainly using textual retrieval. Interlingua and the original language were used (mor-
  phosaurus and Lucene). Queries were preprocessed by removing the “show me” test. Runs
  differed in query language and combination with GIFT settings.
• Jaen University (SINAI), Spain. The SINAI group submitted 12 runs, three of them using
  only textual information and nine using a textual retrieval system and addind provided data
  from the GIFT image retrieval system. The runs differed in settings for “information gain”
  and the weighting of textual and visual information.
• Oregon Health and Science University (OHSU), USA. OHSU performed manual modification
  of queries and then attempted to augment output by fusing results from visual runs. One
  set of runs from OHSU established a baseline using the text of the topics as given. Another
  set of runs then manually modified the topic text removing common words and adding
  synonyms. For both sets of runs, there were submissions in each of the three individual
  languages (English, French, German) plus a merged run with all three and another run
  with the English topics expanded with automatic translation using the Babelfish translator.
  The manual modification of the queries improved performance substantially, though still
  below other groups’ automated methods. The best results came from the English-only
  queries, followed by the automatically translated and the merged queries. One additional
  run assessed fusing data from a visual run with the merged queries. This decreased MAP but
  did improve precision at high levels of retrieval output, e.g., precision at 10 and 30 images.
• I2R Medical Analysis Lab, Singapore. Their submission was together with the IPAL group
  from the same lab.
• MedGIFT, University and Hospitals of Geneva, Switzerland. The University and Hospitals of
  Geneva relied on two retrieval systems for their submission. The visual part was performed
  with the medGIFT retrieval system. The textual retrieval used a mapping of the query
  and document text towards concepts in the MeSH (Medical Subject Headings) terminology.
  Then, matching was performed with a frequency–based weighting methods using easyIR. All
  results were automatic runs using visual, textual and mixed features. Separate runs were
  submitted for the three languages.
• RWTH Aachen University – Computer Science, Germany. RWTHi6 submitted a total of
  nine runs, all using the FIRE retrieval system and a variety of features describing color,
  texture, and global appearance in different ways. For one of the runs, the queries and
  the qrels of last year were used as training data to obtain weights for the combination of
  features using maximum entropy training. One run was purely textual, three runs were
  purely visual, and the remaining five runs used textual and visual information. All runs
  were fully–automatic runs without any user interaction or manual tuning.
   • RWTH Aachen University – Medical Informatics, Germany. RWTHmi submitted two purely
     visual runs without any user interaction. Both runs used a combination of various global
     appearance features compared using invariant distance measures and texture features. The
     runs differed in the weights for the features used.
   • State University New York, Buffalo, USA. SUNY submitted fourruns, two purely textual and
     two using textual and visual information. Parameters for their system were tuned using the
     ImageCLEF 2005 topics, and automatic relevance feedback was used in different variations.
   • LITIS Lab, INSA Rouen, France. The INSA group from Rouen submitted one run using
     visual and textual information. For the textual information the MeSH dictionaries were
     used and the images were represented by various features accounting for global and local
     information. Most of the topics were treated fully automatic, and only four topics were
     treated with manual interaction.

2.3    Databases
In 2006, the same dataset was used as in 2005 containing four distinct sets of images. The Casim-
age5 dataset was made available to participants [14], containing almost 9’000 images of 2’000 cases
[15]. Images present in Casimage include mostly radiology modalities, but also photographs, Pow-
erPoint slides and illustrations. Cases are mainly in French, with around 20% being in English
and 5% without annotation. We also used the PEIR6 (Pathology Education Instructional Re-
source) database with annotation based on the HEAL7 project (Health Education Assets Library,
mainly Pathology images [1]). This dataset contains over 33.000 images with English annotations,
with the annotation being on a per image and not a per case basis as in Casimage. The nuclear
medicine database of MIR, the Mallinkrodt Institute of Radiology8 [16], was also made available
to us for ImageCLEFmed. This dataset contains over 2.000 images mainly from nuclear medicine
with annotations provided per case and in English. Finally, the PathoPic9 collection (Pathology
images [5]) was included in our dataset. It contains 9.000 images with extensive annotation on a
per image basis in German. Part of the German annotation is translated into English. As such, we
were able to use a total of more than 50.000 images, with annotations in three different languages.
Through an agreement with the copyright holders, we were able to distribute these images to the
participating research groups.

2.4    Query topics
The query topics were based on two surveys performed in Portland and Geneva [7, 12]. In addition
to this, a log file of a media search engine HON10 was used to create topics. Based on the surveys,
topics for ImageCLEFmed were developed along the following axes:
   • Anatomic region shown in the image;
   • Image modality (x–ray, CT, MRI, gross pathology, etc.);
   • Pathology or disease shown in the image;
   • Abnormal visual observation (e.g. enlarged heart).
Still, as the HON log-files indicated rather general topics than the fairly specific ones used in 2005,
we used real queries from these log-files in 2006. We could not use the most frequent queries, since
they were too general, e.g. heart, lung, etc., but rather those that satisfied at least two of the
  5 http://www.casimage.com/
  6 http://peir.path.uab.edu/
  7 http://www.healcentral.com/
  8 http://gamma.wustl.edu/home.html
  9 http://alf3.urz.unibas.ch/pathopic/intro.htm
 10 http://www.hon.ch/
                             Show me chest CT images with nodules.
                          Zeige mir CT Bilder der Lunge mit Knötchen.
                          Montre–moi des CTs du thorax avec nodules.

                              Figure 1: Example for a visual topic.


            Show me blood smears that include polymorphonuclear neutrophils.
              Zeige mir Blutabstriche mit polymorphonuklearer Neutrophils.
     Montre–moi des échantillons de sang incluant des neutrophiles polymorphonucléaires.

                              Figure 2: Example for a mixed topic.


defined axes and that appeared frequently. After identifying over 50 of such candidate topics, we
grouped them into three classes based upon as estimation of what retrieval techniques to which
they would be most retrievable -visual, mixed, or textual. Another goal was to cover frequent
diseases and have a balanced variety of imaging modalities and anatomic regions corresponding
to the database that contains many pathology images.
    After choosing ten queries for each of the three categories, we searched query images on the web
manually. In 2005, images were taken partly from the collection. Although they were cropped most
of the time, having images from another collection made the visual task more challenging, as these
images could be from other modalities and have completely different characteristics concerning
texture, luminosity, etc. This year we created 10 topics for each of the 3 groups for a total of 30
topics. Figures 1, 2, 3 show examples for a visual, a mixed and a semantic topic.


                              Show me x–ray images of bone cysts.
                          Zeige mir Röntgenbilder von Knochenzysten.
                          Montre–moi des radiographies de kystes d’os.

                             Figure 3: Example for a semantic topic.
             Figure 4: Evaluation results and number of relevant images per topic.


2.5    Relevance Judgements
For relevance judging, pools were built from all images for a given topic ranked in the top 30
retrieved. This gave pools of anywhere from 647 to 1187 images, with a mean of 910 per topic.
Relevance judgments were performed by seven US physicians enrolled in the OHSU biomedical
informatics graduate program. Eleven of the 30 topics were judged in duplicate, with two judged
by three different judges. Each topic had a designated ”original” judge from the seven.
     A total of 27,306 relevance judgements were made. (These were primary judgments; ten topics
had duplicate judgments that we will analyze later.) The judgments were turned into a qrels
file, which was then used to calculate results with trec eval. We used Mean Average Precision
(MAP) as the primary evaluation measure. We note, however, that its orientation to recall (over
precision) may may not be appropriate for many image retrieval tasks.

2.6    Submissions and Results
A total of 12 groups participated in ImageCLEFmed 2006 from eight different countries (Canada,
China, France, Germany, Singapore, Spain, Switzerland, and the United States). These groups
collectively submitted 100 runs, with each group submitting anywhere from 1 to 26 runs.
    We defined two categories for the submitted runs: one for the interaction used (automatic – no
human intervention, manual – human modification of the query before the output of the system
is seen, and interactive – human modification of the query after the output of the system is seen
) and one for the data used for retrieval (visual, textual, or a mixture). The majority of the
submitted runs were automatic. There were fewer visual runs than there were textual and mixed
runs.
    Figure 4 gives an overview of the number of relevant images per topic and of the performance
that this topic obtained on average (MAP). It can be seen that the variation in this case was
substantial. Some topics had several hundred relevant images in the collection, whereas other only
had very few. Likewise, performance could be extremely good for a few topics and extremely bad
for others. There does not appear to be a direct connection between number of relevant images
Figure 5: Evaluation results for the best runs of each system in each category, ordered by MAP.


for a topics and the average performance that systems obtain.
    In Figure 5 is a comparison of several performance measurements for all submitted runs. In
particular when looking at early precision (P(30)) these variations were quite large, but slowly
disappear for later precision (P(100)). On the other hand, these measures do seem to correlate
fairly well.

2.6.1   Automatic retrieval
The category of automatic runs was by far the most common category for results submissions.
A total of 79 of the 100 submitted runs were in this category. In Table 1 the best run of each
participating system per category is shown as is in the following tables. Showing all 100 runs
would have results in information difficult to read.
   We can see that the best submitted automatic run was a mixed run and that other mixed
runs had very good results. Nonetheless, several of the very good results were textual only,
so a generalization does not seem completely possible. Visual systems had a fairly low overall
performance, although for the first ten visual topics, their performance was very good.

2.6.2   Manual retrieval
Figure 2 shows the submitted manual runs. With the small numbers of these runs, generalization
is difficult.

2.6.3   Interactive retrieval
Table 3 shows the submitted interactive runs. The first run had good performance but was still
not better than the best automatic run of the same group.
               Table 1: Overview of the automatic runs.
Run identifier                visual textual MAP           R–Prec
IPAL Cpt Im                     x         x     0.3095     0.3459
IPAL Textual CDW                          x     0.2646     0.3093
GE 8EN.treceval                           x     0.2255     0.2678
UB-UBmedVT2                     x         x     0.2027     0.2225
UB-UBmedT1                                x     0.1965     0.2256
UKLFR origmids en en                      x     0.1698     0.2127
RWTHi6-EnFrGePatches            x         x     0.1696     0.2078
RWTHi6-En                                 x     0.1543     0.1911
OHSU baseline trans                       x     0.1264     0.1563
GE vt10.treceval                x         x     0.12       0.1703
SINAI-SinaiOnlytL30                       x     0.1178     0.1534
CINDI Fusion Visual             x               0.0753     0.1311
MSRA WSM-msra wsm               x               0.0681     0.1136
IPAL Visual SPC+MC              x               0.0634     0.1048
RWTHi6-SimpleUni                x               0.0499     0.0849
SINAI-SinaiGiftT50L20           x         x     0.0467     0.095
GE-GE gift                      x               0.0467     0.095
UKLFR mids en all co            x         x     0.0167     0.0145


               Table 2: Overview of the manual runs.
Run identifier               visual textual MAP            R-Prec
OHSUeng                                 x     0.2132       0.2554
IPAL CMP D1D2D4D5D6            x              0.1596       0.1939
INSA-CISMef                    x        x     0.0531       0.0719


              Table 3: Overview of the interactive runs.
Run identifier                visual textual MAP           R–Prec
IPAL Textual CRF                         x       0.2534    0.2976
OHSU-OHSU m1                    x        x       0.1563    0.187
CINDI Text Visual RF            x        x       0.1513    0.1969
CINDI Visual RF                 x                0.0957    0.1347
2.7    Conclusions
The best overall run by the IPAL institute is an automatic run using visual and textual features.
From the submitted runs, we can say that interactive and manual runs do not perform better than
the automatic runs. This may be partly due to the fact that most groups submitted many more
automatic runs than other runs. The automatic approach appears to be less time–consuming and
most research groups have more experience in optimizing these runs. Visual features seem to be
mainly good for the visual topics but fail to help for the semantic features. Text-only runs perform
very well and only a few mixed runs manage to be better.


3     The Medical Automatic Annotation Task
Automatic image annotation is a classification task, where a given image is automatically labeled
with a text describing its contents. In restricted domains, the annotation may be just a class
from a constrained set of classes, or it may be an arbitrary narrative text describing the contents
of the images. Last year, the medical automatic annotation task was performed in ImageCLEF
to compare state-of-the-art approaches to automatic image annotation and classification and to
make a first step toward using automatically annotated images in a multi-modal retrieval system
[13]. This year’s medical automatic annotation task builds on top of last year, with 1,000 new
images to be classified were collected and the number of classes is more than doubled, resulting in
a harder task.

3.1    Database & Task Description
The complete database consists of 11,000 fully classified radiographs taken randomly from medical
practice at the RWTH Aachen University Hospital. A total of 9,000 of these were released together
with their classification as training data, with another 1,000 also published with their classification
as validation data to allow the groups for tuning their classifiers in a standardized manner. One
thousand additional images were released at a later date without their classification as test data.
These 1,000 images had to be classified using the 10,000 images (9,000 training + 1,000 validation)
as training data.
    The complete database of 11,000 images was subdivided into 117 classes according to the
complete IRMA code annotation [11]. The IRMA code is a multi-axial scheme assessing anatomy,
biosystem, creation and direction of imaging. Currently, this code is available in English and
German, but could easily be translated to other languages. It is planned to use the result of such
automatic annotation experiments for further, textual image retrieval tasks in the future.
    Example images from the database together with their class numbers are shown in Figure 6.
The classes in the database are not uniformly distributed, for example, class 111 has a 19.3% share
of the complete dataset, class 108 has a 9.2% share of the database, while six classes have only
10/00 or less.

3.2    Participating Groups & Methods
In total, 28 groups registered and 12 of these submitted runs. For each group, a brief description
of the methods of the submitted runs is provided. The groups are listed alphabetically by their
group id, which is later used in the results section to refer to the groups.

CINDI. The CINDI group from Concordia University in Montreal, Canada submitted 5 runs
using a variety of features including MPEG-7 Edge Histogram Descriptor, MPEG-7 Color Layout
Descriptor, invariant shape moments, downscaled images, and semi-global features. Some of the
experiments combine these features with a principal component analysis (PCA). The dimension-
ality of the feature vectors is up to 580. For four of the runs, a support vector machine (SVM) is
used for classification with different multi-class voting schemes. In one run, the nearest neighbor
decision rule is applied. The group expects the run cindi-svm-sum to be their best submission.
      1               2              5                10            15        20           50


    60           70          80          90     100             111                  112


Figure 6: Example images from the IRMA database together with their class numbers. the bottom
row emphasized the intra-class variety of the IRMA database.


DEU. The Department of Computer Engineering of the Dokuz Eylul University in Tinaztepe,
Turkey submitted one run using the MPEG-7 Edge Histogram as 80-dimensional image descriptor
and a 3-nearest neighbor classifier for classification.

MedIC-CISMeF.          The CISMeF team from the INSA Rouen in Saint-Étienne-du-Rouvray
Cedex, France submitted four runs. Two of them use a combination of global and local image
descriptors and the other two are based on local image descriptors only. Features are dimension-
ality reduced by a PCA and those runs which use the same features differ in the PCA coefficients
kept. The features include statistical measures extracted from image regions and texture infor-
mation. This yields a 1953-dimensional feature vector when only local features are used and 2074
dimensional feature vector when local and global features are combined. These feature vectors are
reduced by PCA to 335 and 470 dimensions, respectively. For classification a support vector ma-
chine with radial basis function kernel is used. The group expects the run local+globalPCA450
to be their best submission.

MSRA. The Web Search and Mining Group from Microsoft Research Asia submitted two runs.
One run uses a combination of gray-block features, block-wavelet features, features accounting for
binarized images, and an edge histogram. In total, a 397-dimensional feature vector is used. The
other run uses a bag of features approach with vector quantization, where a histogram of quantized
vectors is computed region-wise on the images. In both runs, SVM is used for classification. The
group did not identify which of these they expect to be better.

MU I2R. The Media Understanding Group of the Institute for Infocomm Research, Singapore
submitted one run. In this run, a two-stage medical image annotation method was applied. In
the first stage, the images are reduced to 32×32 pixels (1024 dimensional vector) and classified
using a support vector machine. In the second stage, those decisions for which the support vector
machine was unsure were refined using a classifier that was trained on a subset of the training
images. In addition to down-scaled images, SIFT (scale invariant feature transformation) features
and principal components of features were used for classification.

NCTU DBLAB.            The DBLAB of the National Chiao Tung University in Hsinchu, Taiwan
submitted one run using tree image features, Gabor texture features, coherence moment and
related vector layout as image descriptors. The classification was done using a nearest neighbor
classifier.
OHSU. The Department of Medical Informatics & Clinical Epidemiology of the Oregon Health
and Science University in Portland, OR, USA submitted 4 runs. For image representation, a va-
riety of descriptors was tested including 16×16 pixel versions of the images, and partly localized
gray level cooccurrence matrix (GLCM) features resulting in a feature vector of up to 380 com-
ponents. For classification, a multilayer perceptron were used and settings were optimized using
the development set.

RWTHi6. The Human Language Technology and Pattern Recognition Group from the RWTH
Aachen University in Aachen, Germany submitted three runs. One uses the image distortion model
(IDM) that was used for the best run of last year, and the other a sparse histogram of image patches
and absolute position. The IDM run is based on a nearest neighbor classifier, while the other runs
use SVM or a maximum entropy classifier. The feature vectors for the IDM experiments have less
than 1024 components and the sparse histograms have 65536 bins. The group expects the run
SHME to be their best submission.

RWTHmi. The Image Retrieval in Medical Applications (IRMA) group, Department of Medi-
cal Informatics, RWTH Aachen University Hospital in Aachen, Germany submitted two runs using
cross-correlation on 32×32 images with explicit translation shifts, IDM for X×32 images, global
texture features as proposed by Tamura, and global texture features as proposed by Castelli et
al. based on fractal concepts resulting in an approximately 2500-dimensional feature vector. For
classification, a nearest neighbor classifier was used. For the run RWTHmi-opt weights for these
features were optimized on the development set, and for the run RWTHmi-baseline the default
parameters of the IRMA system were used.

UFR. The Pattern Recognition and Image Processing group from the University of Freiburg in
Freiburg, Germany submitted two runs using gradient-like features extracted over interest points.
Gradients over multiple directions and scale are calculated and used as a local feature vector.
The features are clustered to form a code book of size 20 and a cluster co-occurrence matrix is
computed over multiple distance ranges and multiple angle ranges (since rotation invariance is not
desired), resulting in a 4-D array per image which is flattened and used as final, approximately
160000-dimensional, feature vector. Classification is done using multi-class SVM in a one-vs-rest
approach with a histogram intersection kernel.

ULG. The Systems and Modeling group of the Institute Montefiore from Liège, Belgium ex-
tracts a large number of possibly overlapping, squared sub-windows of random sizes and at random
positions from training images. Then, an ensemble model composed by twenty randomized trees is
automatically built based on size-normalized versions of the sub-windows. It is operated directly
on their pixel values to predict classes of sub-windows. Given this sub-window classifier, a new
image is classified via sub-windows and combining the classification decisions. The feature vectors
are 576-dimensional. The group expects the run ULG-SYSMOD-RANDOM-SUBWINDOWS-EX to be their
best submission.

UTD. The Data Mining Laboratory group of the University of Texas at Dallas, Richardson,
TX, USA submitted one run. The images are scaled to 16×16 pixels, and their dimensionality
is reduced by PCA, resulting in a maximally 256-dimensional feature vector. Then, a weighted
k-nearest neighbor algorithm is applied for classification.

MedGIFT. The medGIFT group of the University and Hospitals of Geneva submitted three
runs to the medical automatic annotation task. One was entirely based on tf/idf weighting of the
GNU Image Finding Tool (GIFT) and thus acted as a baseline using only collection frequencies of
features with no learning on the training data supplied. For the second run features are weighted
with an additional factor, learned from the supplied training data. For these submissions a 5-
NN was used as classifier. The third submission is a combination of several separate runs by
voting. The combined results are quite different, so the combination-run is expected to be the
best submission. The runs were submitted after the evaluation ended and are thus not ranked.

3.3    Results
The results from the evaluation are shown in Table 4. The error rates ranged from 16.2% to
34.1%. Based on the training data, a system guessing the most frequent group for all 1,000 test
images would result in a 80.5% error rate, since 195 radiographs of the test set were from class
111, which was the biggest class in the training data. A more realistic baseline is given by a
nearest neighbor classifier using Euclidean distance to compare the images scaled to 32×32 pixels
[9]. This classifier yields an error rate of 32.1%. The average confusion matrix of all submitted
runs is shown in Figure 7. Obviously, a diagonal structure is reached. Thus on the average, many
images were classified correctly, but it can also be seen that some classes have high inter-class
similarity: e.g. classes 108 to 111 are often confused. In total, many images from other classes
were classified to be from class 111, which was the class with the highest amount of training data.
Obviously, not all classes were equally difficult. A tendency that classes with only few training
instances were harder to classify than classes with a large amount of training data could be seen;
which was to be expected and had been reported in the literature earlier [6].
    Given the confidence files of all runs, we tried to combine the classifiers by the sum rule.
Therefore, all confidence files were normalized such that the confidences could be interpreted as
a-posteriori probabilities p(c|x) where c was the class and x the observation. Unlike last years
results, where this technique could not improve the results, clear improvements were possible
combining several classifiers [10]: Using the top 3 ranked classifiers in combination, an error rate
of 14.4% was obtained. The best result was obtained combining the top 7 ranked classifiers. Note,
that here no additional parameters were tuned but the classifiers were combined weighted equally.

3.4    Discussion
The most interesting observation of this year’s evaluation can be seen when comparing the results
with the results of last year: The RWTHi6-IDM [4] system that performed best in last years
task (error rate: 12.1%) obtained an error rate of 20.4% this year. This increase in error rate
can be explained by the larger number of classes and thus more similar classes that can easily be
confused. On the other hand, 10 methods clearly outperformed this result this year, nine of these
use SVMs as a classifier (ranks 2-10) and one using a discriminatively trained log-linear model
(rank 1). Thus, it can clearly be stated that the performance of image annotation techniques
strongly improved over the last year, and that techniques that were initially developed in the field
of object recognition and detection are very well suited for the automatic annotation of medical
radiographs.
    Another interesting observation drawn from the combination of classifiers was that in contrast
to last year, where a combination of arbitrary classifiers from the evaluation did not lead to an
improvement over the best submission, this year a clear improvement was obtained by combining
several submissions. A reason for this might be the improved performance of the submissions or
the higher diversity among the submitted methods.
    To give an approximate idea of runtime and memory requirements of the various methods
we give the dimensionality of the feature vectors used by the groups. Naturally, the dimension
of the feature vectors alone does not say very much about runtime, because the used models for
classification have a high impact on runtime and memory consumption, too. However, a trend that
high dimensional feature vectors lead to good results can clearly be seen in the results as the best
three methods use feature vectors of very high dimensionality (65,536 and 160,000 respectively)
and implicitly transform these into even higher dimensional feature spaces by the use of kernel
methods.
Table 4: Results of medical automatic annotation task. If a group submitted several runs, the run
that was expected to be their best is marked with ’*’
      rank Group                Runtag                                       Error rate [%]
    ∗     1 RWTHi6              SHME                                                   16.2
    ∗     2 UFR                 UFR-ns-1000-20x20x10                                   16.7
          3 RWTHi6              SHSVM                                                  16.7
          4 MedIC-CISMeF local+global-PCA335                                           17.2
          5 MedIC-CISMeF local-PCA333                                                  17.2
          6 MSRA                WSM-msra-wsm-gray                                      17.6
    ∗     7 MedIC-CISMeF local+global-PCA450                                           17.9
          8 UFR                 UFR-ns-800-20x20x10                                    17.9
          9 MSRA                WSM-msra-wsm-patch                                     18.2
         10 MedIC-CISMeF local-PCA150                                                  20.2
         11 RWTHi6              IDM                                                    20.4
    ∗ 12 RWTHmi                 opt                                                    21.5
         13 RWTHmi              baseline                                               21.7
    ∗ 14 CINDI                  cindi-svm-sum                                          24.1
         15 CINDI               cindi-svm-product                                      24.8
         16 CINDI               cindi-svm-ehd                                          25.5
         17 CINDI               cindi-fusion-KNN9                                      25.6
         18 CINDI               cindi-svm-max                                          26.1
    ∗ 19 OHSU                   OHSU-iconGLCM2-tr                                      26.3
         20 OHSU                OHSU-iconGLCM2-tr-de                                   26.4
         21 NCTU                dblab-nctu-dblab2                                      26.7
         22 MU                  I2R-refine-SVM                                         28.0
         23 OHSU                OHSU-iconHistGLCM2-t                                   28.1
    ∗ 24 ULG                    SYSMOD-RANDOM-SUBWINDOWS-EX                            29.0
         25 DEU                 DEU-3NN-EDGE                                           29.5
          - medGIFT             combination                                            29.7
         26 OHSU                OHSU-iconHist-tr-dev                                   30.8
          - medGIFT             fw-bwpruned                                            31.7
         27 UTD                 UTD                                                    31.7
          - medGIFT             baseline                                               32.0
         28 ULG                 SYSMOD-RANDOM-SUBWINDOWS-24                            34.1
Figure 7: Average confusion matrix over all runs of the medical automatic annotation task. Dark
points denote high entries, white points denote zero. On the x-axis, the correct class is given and
on the y-axis the class to which images have been classified is given. For visualization purposes
values are in logarithmic scale.
4    Overall Conclusions
For the medical retrieval task, none of the manual or interactive techniques were significantly
better than those used for the automatic runs. The best–performing systems used visual and
textual techniques combined but several times a combination of visual and textual features did
not improve a system’s performance. Thus, combinations for multi-modal retrieval need to done
carefully. Purely visual systems only performed well on the visual topics.
    For the automatic annotation task, discriminative methods outperformed methods based on
nearest neighbor classification and the top-performing methods were based on the assumption that
images consist of images parts which can be modelled more or less independently.
    One goal for future tasks is to motivate groups to work more on interactive or manual runs
than automated retrieval. With proper manpower, such runs should be better than even optimized
automatic runs. Another future goal is to motivate an increasing number of subscribed groups
to participate. Collections are planned to become larger as well to stay realistic. Some groups
already complained about too large datasets, so a smaller second dataset might be an option for
these groups to at least submit some results and compare them with the other techniques.
    For the automatic annotation task, a future goal is to use textual labels with varying annotation
precision rather than a simple class-based annotation scheme and to consider semi-automatic
annotation methods.


Acknowledgements
We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. Further-
more, the authors would like to thank LTUtech11 for providing the database for the non-medical
automatic annotation task and to Tobias Weyand for creating the web interface for submissions.
   This work was partially funded by the DFG (Deutsche Forschungsgemeinschaft) under con-
tracts NE-572/6 and Le-1108/4, the Swiss National Science Foundation (FNS) under contract
205321-109304/1, the American National Science Foundation (NSF) with grant ITR–0325160,
and the EU Sixth Framework Program with the SemanticMining project (IST NoE 507505) and
the MUSCLE NoE.


References
 [1] C. S. Candler, S. H. Uijtdehaage, and S. E. Dennis. Introducing HEAL: The health education
     assets library. Academic Medicine, 78(3):249–253, 2003.
 [2] P Clough, M Grubinger, T Deselaers, A Hanbury, and H. Müller. Overview of the ImageCLEF
     2006 photo retrieval and object annotation tasks. In CLEF working notes, Alicante, Spain,
     Sep. 2006.
 [3] Paul Clough, Henning Müller, and Mark Sanderson. Overview of the CLEF cross–language
     image retrieval track (ImageCLEF) 2004. In Carol Peters, Paul D. Clough, Gareth J. F.
     Jones, Julio Gonzalo, M. Kluck, and B. Magnini, editors, Multilingual Information Access
     for Text, Speech and Images: Result of the fifth CLEF evaluation campaign, Lecture Notes in
     Computer Science, Bath, England, 2005. Springer–Verlag.
 [4] Thomas Deselaers, Tobias Weyand, Daniel Keysers, Wolfgang Macherey, and H. Ney. FIRE
     in ImageCLEF 2005: Combining content-based image retrieval with textual information re-
     trieval. In Workshop of the Cross–Language Evaluation Forum (CLEF 2005), Lecture Notes
     in Computer Science, page in press, Vienna, Austria, September 2005.
 11 http://www.ltutech.com/
 [5] K Glatz-Krieger, D. Glatz, M. Gysel, M. Dittler, and M. J. Mihatsch. Webbasierte Lern-
     werkzeuge für die Pathologie – web–based learning tools for pathology. Pathologe, 24:394–399,
     2003.

 [6] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning:
     Data Mining, Inference, and Prediction. Springer, Berlin, 2001.
 [7] William Hersh, Jeffery Jensen, Henning Müller, Paul Gorman, and Patrick Ruch. A qualita-
     tive task analysis of biomedical image use and retrieval. In ImageCLEF/MUSCLE workshop
     on image retrieval evaluation, pages 11–16, Vienna, Austria, September 2005.
 [8] William Hersh, Henning Müller, Jeffery Jensen, Jianji Yang, Paul Gorman, and Patrick Ruch.
     Imageclefmed: A text collection to advance biomedical image retrieval. Journal of the Amer-
     ican Medical Informatics Association, September/October, 2006.
 [9] Daniel Keysers, Christian Gollan, and Hermann Ney. Classification of medical images using
     non-linear distortion models. In Proc. BVM 2004, Bildverarbeitung für die Medizin, pages
     366–370, Berlin, Germany, March 2004.
[10] J. Kittler. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine
     Intelligence, 20(3):226–239, March 1998.
[11] Thomas M. Lehmann, Henning Schubert, Daniel Keysers, M Kohnen, and Berthold B Wein.
     The irma code for unique classification of medical images. In Proceedings SPIE, number 5033,
     pages 440–451, 2003.
[12] Henning Müller, Christelle Despont-Gros, William Hersh, Jeffery Jensen, Christian Lovis, and
     Antoine Geissbuhler. Health care professionals’ image use and search behaviour. In Proceed-
     ings of the Medical Informatics Europe Conference (MIE 2006), Maastricht, The Netherlands,
     August 2006.
[13] Henning Müller, Antoine Geissbuhler, Johan Marty, Christian Lovis, and Patrick Ruch. The
     Use of medGIFT and easyIR for ImageCLEF 2005. In Proceedings of the Cross Language
     Evaluation Forum 2005, LNCS, page in press, Vienna, Austria, September 2006.
[14] Henning Müller, Antoine Rosset, Jean-Paul Vallée, Francois Terrier, and Antoine Geissbuhler.
     A reference data set for the evaluation of medical image retrieval systems. Computerized
     Medical Imaging and Graphics, 28:295–305, 2004.
[15] Antoine Rosset, Henning Müller, Martina Martins, Natalia Dfouni, Jean-Paul Vallée, and
     Osman Ratib. Casimage project – a digital teaching files authoring environment. Journal of
     Thoracic Imaging, 19(2):1–6, 2004.
[16] J. W. Wallis, M. M. Miller, T. R. Miller, and T. H. Vreeland. An internet–based nuclear
     medicine teaching file. Journal of Nuclear Medicine, 36(8):1520–1527, 1995.