=Paper= {{Paper |id=Vol-1173/CLEF2007wn-ImageCLEF-TommasiEt2007 |storemode=property |title=CLEF2007: Image Annotation Task: an SVM-based Cue Integration Approach |pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-ImageCLEF-TommasiEt2007.pdf |volume=Vol-1173 |dblpUrl=https://dblp.org/rec/conf/clef/TommasiOC07a }} ==CLEF2007: Image Annotation Task: an SVM-based Cue Integration Approach== https://ceur-ws.org/Vol-1173/CLEF2007wn-ImageCLEF-TommasiEt2007.pdf

CLEF2007 Image Annotation Task: an
SVM-based Cue Integration Approach
Tatiana Tommasi, Francesco Orabona, and Barbara Caputo
IDIAP Research Institute,
Centre Du Parc, Av. des Pres-Beudin 20,
P. O. Box 592, CH-1920 Martigny, Switzerland
{ttommasi, forabona, bcaputo}@idiap.ch

Abstract
This paper presents the algorithms and results of our participation to the medical
image annotation task of ImageCLEFmed 2007. We proposed, as a general strategy,
a multi-cue approach where images are represented both by global and local descrip-
tors, so to capture different types of information. These cues are combined during the
classification step following two alternative SVM-based strategies. The first algorithm,
called Discriminative Accumulation Scheme (DAS), trains an SVM for each feature
type, and considers as output of each classifier the distance from the separating hyper-
plane. The final decision is taken on a linear combination of these distances: in this
way cues are accumulated, thus even when they both are misleaded the final result can
be correct. The second algorithm uses a new Mercer kernel that can accept as input
different feature types while keeping them separated. In this way, cues are selected
and weighted, for each class, in a statistically optimal fashion. We call this approach
Multi Cue Kernel (MCK). We submitted several runs, testing the performance of the
single-cue SVM and of the two cue integration methods. Our team was called BLOOM
(BLanceflOr-tOMed.im2) from the name of our sponsors. The DAS algorithm obtained
a score of 29.9, which ranked fifth among all submissions. We submitted two versions
of the MCK algorithm, one using the one-vs-all multiclass extension of SVMs and
the other using the one-vs-one extension. They scored respectively 26.85 and 27.54,
ranking first and second among all submissions.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Management]: Database Applications—Image databases; I.5 [Pattern Recognition]: I.5.2 De-
sign Methodology

General Terms
Measurement, Performance, Experimentation

Keywords
Automatic Image Annotation, Cue Integration, Support Vector Machines, Kernel Methods
1 Introduction
The amount of medical image data produced nowadays is constantly growing, with average-sized
radiology departments producing several tera-bites of data annually. The cost of manually an-
notating these images is very high; furthermore, manual classification induces errors in the tag
assignment, which means that a part of the available knowledge is not accessible anymore to
physicians [5]. This calls for automatic annotation algorithms able to perform the task reliably,
and benchmark evaluations are thus extremely useful for boosting advances in the field. The
ImageCLEFmed annotation task has been established in 2005, and in 2007 it has provided partic-
ipants with 11000 training and development data, spread across 116 classes. The task consisted
in assigning the correct label to 1000 test images. For further informations on the annotation task
of ImageCLEF 2007 we refer the reader to [6].
This paper describes the algorithms submitted by the BLOOM (BLanceflOr-tOMed.im2) team,
at its first participation to the CLEF benchmark competition. In order to achieve robustness,
a crucial property for a reliable automatic system, we opted for a multi-cue approach, using
raw pixels as global descriptors and SIFT features as local descriptors. The two feature types
were combined together using two different SVM-based integration schemes. The first is the
Discriminative Accumulation Scheme (DAS), proposed first in [7]. For each feature type, an
SVM is trained and its output consists of the distance from the separating hyperplane. Then,
the decision function is built as a linear combination of the distances, with weighting coefficients
determined via cross validation. We submitted a run using this method (BLOOM-BLOOM DAS)
that obtained a score of 29.9, ranking fifth among all submissions.
The second integration scheme consists in designing a new Mercer kernel, able to take as
input different feature types for each image data. We call it Multi Cue Kernel (MCK); the main
advantage of this approach is that features are selected and weighted during the SVM training,
thus the final solution is optimal as it minimizes the structural risk. We submitted two runs using
this algorithm, the first (BLOOM-BLOOM MCK oa) using the one-vs-all multiclass extension
of SVM; the second (BLOOM-BLOOM MCK oo) using instead the one-vs-one extension. These
two runs ranked first and second among all submissions, with a score of respectively 26.85 and
27.54. These results overall confirm the effectiveness of using multiple cues for automatic image
annotation.
The rest of the paper is organized as follows: section 2 describes the two types of feature
descriptors we used at the single cue stage. Section 3 gives details on the two alternative SVM-
based cue integration approaches. Section 3 reports the experimental procedure adopted and the
results obtained, with a detailed discussion on the performance of each algorithm. The paper
concludes with a summary discussion.

2 Single Cue Image Annotation
The aim of the automatic image annotation task is to classify images into a set of classes. In par-
ticular classes were defined in relation to the four independent axis of modality, body orientation,
body region, and biological system examined, according to the IRMA code [3]. The labels are
hierarchical therefore, errors in the annotation are counted depending on the level at which the
error is done and on the number of possible choices. For each image the error ranges from 0 to 1,
respectively if the image is correctly classified or if the predicted label is completely wrong. It is
also possible to assign a “don’t know” label, in this case the score is 0.5.
The strategy we propose is to extract a set of features from each image and to use then a
Support Vector Machine (SVM) to classify the images. We have explored a local approach, using
SIFT descriptors, and a global approach, using the raw pixels.
2.1 Feature Extraction
We explored the idea of “bag of words” for classification, a common concept in many state of the
art approaches in images classification. This is based on the idea that it is possible to transform
the images into a set of prespecified visual words, and to classify the images using the statistics
of appearance of each word as feature vectors.
Most of these systems are based on the use of the SIFT descriptor [4]. The basic idea of SIFT
is to describe an area of an image in a way that is robust to noise, illumination, scale, translation
and rotation changes. The SIFT points are selected in the image as local maxima of the scale-
space, in this sense the SIFT points are intrinsically easy to be tracked. Despite the usefulness of
SIFT, there is no reason to believe that these points are the most informative for a classification
task. This has been pointed out by different works and systematically verified by [8]. In that work
it is shown that a dense random sampling of the SIFT points is always superior to any strategy
based on interest points detectors. Moreover due to the low contrast of the radiographs it would
be difficult to use any interest point detector. So in our approach we densely sampled each input
image, extracting in each point a SIFT descriptor.
Another modification we made is based on the fact that the rotation invariance could be useless
for the ImageCLEF classification task, as the various structures present in the radiographs are
likely to appear always with the same orientation. Moreover the scale is not likely to change too
much between images of the same class, so we extracted the SIFT at only one octave, the one that
gave us the best classification performances. In this sense we have decoupled the extraction of a
SIFT keypoint from the description of the point itself. To keep the complexity of the description
of each image low and at the same to retain as much information as possible, we matched each
extracted SIFT with a number of template SIFTs. These template SIFTs form our vocabulary
of visual words. It is built using a standard K-means algorithm, with K equal to 500, on a
random collection of SIFTs extracted from the training images. Various sizes of vocabulary were
tested with no significant differences, so we have chosen the smaller one with good recognition
performances. Note that in this phase also testing images can be used, because the process is not
using the labels and it is unsupervised. At this point each image could be described with the raw
counts of each visual word.
To add some kind of spatial information to our features we divided the images in four subimages,
collecting the histograms separately for each subimage. In this way the dimension of the input
space is multiplied by four, but in our tests we gained about 3% of classification performances.
We have extracted 1500 SIFT in each subimage: such dense sampling adds robustness to the
histograms. See Figures 1 and 2 for an example.
Another approach that we explored was the simplest possible global description method: the
raw pixels. The images were resized to 32x32 pixels, regardless of the original dimension, and
normalized to have sum equal to one, then the 1024 raw pixels values were used as input features.
This approach is at the same time a baseline for the classification system and a useful “companion”
method to boost the performance of the SIFT based classifier (see section 2.2).

2.2 Classification
For the classification step we used an SVM with an exponential χ2 as kernel, for both the local
and global approaches: Ã !
XN
(Xi − Yi )2
K(X, Y ) = exp −γ . (1)
i=1
Xi + Yi

The parameter γ was tuned through cross-validation (see section 4). This kernel has been suc-
cessfully applied for histogram comparison and it has been demonstrated to be positive definite
[2], thus it is a valid kernel.
Even if the labels are hierarchical, we have chosen to use the standard multi-class approaches.
This choice is motivated by the finding that, with our features, the recognition rate was lower using
an axis-wise classification. This could be due to the fact that each super-class has a variability so
100 40

80
30

60
20
40

10
20

0 0
0 100 200 300 400 500 0 100 200 300 400 500

100 60

80 50

40
60
30
40
20
20 10

0 0
0 100 200 300 400 500 0 100 200 300 400 500

(a) (b)

Figure 1: (a) Radiographic image divided in 4 subimages and (b) corresponding counts of the
visual words.

high that the chosen features are not able to model it, while they can very well model the small
sub-classes. In particular we have tested both one-vs-one and one-vs-all multi-class extension for
SVM.

3 Multi Cue Annotation
Due to the fundamental difference in how local and global features are computed it is reasonable
to suppose that the two representations provide different kinds of information. Thus, we expect
that by combining them through an integration scheme, we should achieve a better performance,
namely higher classification performance and higher robustness.
In the computer vision and pattern recognition literature some authors have suggested different
methods to combine information derived from different cues (for a review on the topic we refer
the reader to [9]). Some of them are based on building new representations, but this technique
does not solve the robustness problem because if one of the cues gives misleading information it
is quite probable that the new feature vector will be adversely affected. Moreover, the dimension
of such a feature vector would increase as the number of of cues grows, implying longer learning
and recognition times, greater memory requirements and possibly curse of dimensionality effects.
The strategy we follow in this paper is to use integration schemes, thus keeping the feature
descriptors separated and fusing them at a mid- or high- level. In the rest of the section we
describe the two alternative integration schemes we used in the ImageCLEF competition. The
first, the Discriminative Accumulation Scheme (DAS, [7]), is a high-level integration scheme,
meaning that each single cue first generate a set of hypotheses on the correct label of the test
image, and then those hypotheses are combined together so to obtain a final output. This method
is described in section 3.1. The second, the Multi Cue Kernel (MCK), is a mid-level integration
scheme, meaning that the different features descriptors are kept separated but they are combined
in a single classifier generating the final hypothesis. This algorithm is described in section 3.2.

3.1 Discriminative Accumulation Scheme
The Discriminative Accumulation Scheme is an integration scheme for multiple cues that does not
neglect any cue contribution. It is based on a weak coupling method called accumulation. The
main idea of this method is that information from different cues can be summed together.
Nj
Suppose we are given M object classes and for each class, a set of Nj training images {Iij }i=1 ,
j = 1, . . . M . For each image, we extract a set of P different cues:
(a) (b)

Figure 2: Difference between random sampling and interest point detector. In (a) the four most
present visual words in the image are drawn, each with a different color. In (b) the result of
standard SIFT extraction in the same octave used in (a).

Tp = Tp (Iij ), p = 1...P (2)
N
so that for an object j we have P new training sets {Tp (Iij )}i=1
j
, j = 1, . . . M, p = 1 . . . P . For
each we train an SVM. Kernel functions may differ from cue to cue and model parameters can
be estimated during the training step via cross validation. Given a test image Iˆ and assuming
M ≥ 2, for each single-cue SVM we compute the distance from the separating hyperplane:

mp
Xj ³ ´
Dj (p) = p
αij ˆ + bp .
yij Kp Tp (Iij ), Tp (I) (3)
j
i=1

After collecting all the distances {Dj (p)}P p=1 for all the j objects j = 1, . . . , M and the p cues
ˆ
p = 1, . . . , P , we classify the image I using the linear combination:

P
X
M
j ∗ = argmax{ ap Dj(p) }, ap ∈ < + . (4)
j=1 p=1

The coefficients {ap }P
p=1 are evaluated via cross validation during the training step.

3.2 Multi Cue Kernel
DAS can be defined a high-level integration scheme, as fusion is performed as a post-processing
step after the single-cue classification stage. As an alternative, we developed a mid-level integrat-
ing scheme based on multi-class SVM with a Multi Cue Kernel KM C . This new kernel combines
different features extracted form images; it is a Mercer kernel, as positively weighted linear com-
bination of Mercer kernels are Mercer kernels themselves [1]:

P
X
KM C ({Tp (Ii )}p , {Tp (I)}p ) = ap Kp (Tp (Ii ), Tp (I)). (5)
p=1
In this way it is possible to perform only one classification step, identifying the best weighting
factors ap while optimizing the other kernel parameters. Another advantage of this approach is
that it makes it possible to work both with one-vs-all and one-vs-one SVM extensions to the
multiclass problem.

4 Experiments
Our experiments started evaluating the performance of local and global features separately before
testing our integration methods. Two sets of experiments using single-cue SVM were ran to
select the best kernel parameters through cross validation. The original dataset was divided in
three parts: training, validation and testing. We merged them together and extracted 5 random
and disjoint train/test splits of 10000/1000 images. We considered as the best parameters the one
giving the best average score on the 5 splits. Note that, according to the method used for the score
evaluation, the best average score is not necessary the best recognition rate. Besides obtaining the
optimal parameters, these experiments showed that the SIFT features outperform the raw pixel
ones. It could be predictable since the last year ImageCLEF competition results showed that local
features are generally more informative than global features for the annotation task.
Then we adopted the same experimental setup for DAS and MCK. In particular for DAS we
used the distances from the separating hyperplanes associated with the best results of the previous
step, so the cross validation was used only to search the best weights for cue integration. On the
other hand, for MCK the cross validation was applied to look for the best kernel parameters and
the best feature’s weights at the same time. In both cases weights could vary from 0 to 1.
Finally we used the results of the previous phases to run our submission experiment on the
1000 unlabeled images of the challenge test set using all the 11000 images of the original dataset
as training.
The ranking, name and score of our submitted runs together with the score gain respect to the
best run of other participants are listed in Table 1. Our two runs based on the MCK algorithm
ranked first and second among all submissions stating the effectiveness of using multiple cues for
automatic image annotation. It is interesting to note that even if DAS has a higher recognition
rate, its score is worse than that obtained using the feature SIFT alone. This could be due to the
fact that when the label predicted by the global approach, the raw pixels, is wrong, the true label
is far from the top of the decision ranking.
In Table 2 there is a summary of the parameters used for our runs and the number of support
vectors obtained. As we could expect, the best feature weight (see (4) and (5)) for SIFT results
higher than that for raw pixels for all the integration methods. The number of support vectors
for the MCK run using one-vs-one multiclass SVM extension (MCK oa) is slightly higher than
that used by the single cue SIFT oa but lower than that used by PIXEL oa. For the MCK run
using one-vs-one multiclass SVM extension (MCK oo) the number of support vectors is even lower
than that of both the single cues SIFT oo and PIXEL oo. These results show that combining two
features with the MCK algorithm can simplify the classification problem. For DAS we counted
the support vectors summing the ones from SIFT oa and PIXEL oa but considering only once the
support vectors associated with the training images that resulted in common between the single
cues. The number of support vector for DAS exceed that obtained for both MCK oa and MCK oo
showing a higher complexity of the classification problem.
Table 3 shows in details some examples of classification results. The first, second and third
column contain examples of images misclassified by one of the two cues but correctly classified by
DAS and MCK oa. The fourth column shows an example of an image misclassified by both cues
and by DAS but correctly classified by MCK oa. It is interesting to note that combining local and
global features can be useful to recognize images even if they are compromised by the presence
of artifacts that for medical images can be prosthesis or reference labels put on the acquisition
screen.
A deeper analysis of our results can be done considering the performance of the single-cue,
discriminative accumulation and multicue kernel approach for each class. In Table 4 the number
Rank Name Score Gain Rec. rate
1 BLOOM-BLOOM MCK oa 26.8470167911 4.0828086669 89.7%
2 BLOOM-BLOOM MCK oo 27.5449911826 3.3848342754 89.0%
3 BLOOM-BLOOM SIFT oo 28.7301320009 2.1996934571 88.4%
4 BLOOM-BLOOM SIFT oa 29.45575794 1.474067518 88.5%
5 BLOOM-BLOOM DAS 29.9033537771 1.0264716809 88.9%
28 BLOOM-BLOOM PIXEL oa 68.2130545639 −37.2832291059 79.9%
29 BLOOM-BLOOM PIXEL oo 72.410704904 −41.4808794460 79.2%

Table 1: Ranking of our submitted runs, name, score and gain respect to the best run of the other
participants.

Rank Name γsif t γpixel C asif t apixel #SV
1 BLOOM-BLOOM MCK oa 0.5 5 5 0.80 0.20 7916
2 BLOOM-BLOOM MCK oo 0.1 1.5 20 0.90 0.10 7037
3 BLOOM-BLOOM SIFT oo 0.05 40 7173
4 BLOOM-BLOOM SIFT oa 0.25 10 7704
5 BLOOM-BLOOM DAS 0.25 5 10 0.76 0.24 9090
28 BLOOM-BLOOM PIXEL oa 5 10 8329
29 BLOOM-BLOOM PIXEL oo 3 20 7381

Table 2: Here are shown the best parameters obtained by cross validation and used for the
classification, together with the number of Support Vectors for each of our submitted runs.

of images correctly recognized for each class are listed and it is possible to note that in few cases
PIXEL oa outperforms SIFT oa, and to observe where MCK oa outperforms both SIFT oa and
DAS. The difference between our approaches can be better evaluated considering the confusion
matrices. They are shown as images in Figure 3. We ordered the classes following the way in which
they are listed in table 4 and used a colormap corresponding to the number of images varying
from zero to five to let the misclassified images stand out. It is clear that our methods differ
principally for how the wrong images are labeled. The more matrices present sparse values out of
the diagonal and far away from it, the worse the method is.

PIXEL oa 11◦ 1◦ 12◦ 5◦
SIFT oa 1◦ 2◦ 2◦ 5◦
DAS 1◦ 1◦ 1◦ 2◦
MCK oa 1◦ 1◦ 1◦ 1◦

Table 3: Example of images misclassified by one or both cues and correctly classified by DAS or
MCK.The values correspond to the decision rank.
PIXEL oa

PIXEL oa

PIXEL oa
MCK oa

MCK oa

MCK oa
SIFT oa

SIFT oa

SIFT oa
TOT

TOT

TOT
DAS

DAS

DAS
CLASS CLASS CLASS CLASS
1121-110-213-700 0 0 0 0 3 1121-120-516-700 2 2 2 2 4 1121-210-331-700 0 0 0 0 1 1121-240-441-700 4 4 4 4 5
1121-110-411-700 13 11 12 8 14 1121-120-517-700 2 2 2 3 3 1121-220-213-700 2 2 2 1 2 1121-240-442-700 4 4 4 4 4
1121-110-414-700 38 38 38 35 38 1121-120-800-700 22 22 22 22 22 1121-220-230-700 17 17 17 17 17 1121-320-941-700 10 10 10 10 10
1121-110-415-700 9 9 8 6 9 1121-120-911-700 5 5 5 5 6 1121-220-310-700 7 7 7 6 7 1121-420-212-700 4 2 3 4 4
1121-115-700-400 13 13 13 12 13 1121-120-914-700 6 6 6 6 6 1121-220-330-700 0 0 0 0 1 1121-420-213-700 4 4 4 4 4
1121-115-710-400 2 2 2 2 3 1121-120-915-700 5 5 5 5 6 1121-228-310-700 1 1 1 0 1 1121-430-213-700 8 8 8 6 9
1121-116-917-700 1 1 1 1 1 1121-120-918-700 2 2 2 3 3 1121-229-310-700 1 0 1 1 1 1121-430-215-700 0 0 0 0 1
1121-120-200-700 33 33 33 33 34 1121-120-919-700 2 2 2 1 2 1121-230-462-700 2 2 2 2 2 1121-460-216-700 1 1 1 2 2
1121-120-310-700 20 20 20 20 20 1121-120-921-700 9 9 9 5 9 1121-230-463-700 5 5 5 5 5 1121-490-310-700 1 1 1 0 1
1121-120-311-700 3 3 3 3 3 1121-120-922-700 11 11 11 7 11 1121-230-911-700 0 0 0 0 2 1121-490-415-700 6 6 6 6 6
1121-120-320-700 11 11 11 9 11 1121-120-930-700 0 0 0 0 2 1121-230-914-700 1 1 1 0 1 1121-490-915-700 4 4 4 4 4
1121-120-330-700 22 23 22 20 23 1121-120-933-700 0 0 0 0 1 1121-230-915-700 1 1 1 1 1 1122-220-333-700 0 0 0 0 0
1121-120-331-700 1 0 1 1 1 1121-120-934-700 2 1 2 1 2 1121-230-921-700 7 7 7 6 7 1123-110-500-000 84 78 80 69 91
1121-120-413-700 3 3 3 0 3 1121-120-942-700 9 10 9 7 10 1121-230-922-700 6 6 6 6 8 1123-112-500-000 0 0 0 0 5
1121-120-421-700 4 4 4 4 5 1121-120-943-700 9 9 9 9 10 1121-230-930-700 0 0 0 0 1 1123-121-500-000 5 5 5 5 8
1121-120-422-700 4 4 4 3 4 1121-120-950-700 0 1 0 0 1 1121-230-934-700 1 1 1 0 2 1123-127-500-000 182 184 184 172 196
1121-120-433-700 1 2 1 1 2 1121-120-951-700 1 2 2 0 3 1121-230-942-700 9 9 9 9 9 1123-211-500-000 89 89 89 88 89
1121-120-434-700 2 2 2 0 2 1121-120-956-700 1 0 0 0 2 1121-230-943-700 7 6 6 6 7 1124-310-610-625 6 6 6 6 6
1121-120-437-700 0 0 0 0 1 1121-120-961-700 3 3 3 3 4 1121-230-950-700 0 0 0 0 1 1124-310-620-625 7 6 6 6 7
1121-120-438-700 0 0 0 0 1 1121-120-962-700 5 4 5 3 5 1121-230-953-700 0 0 0 0 1 1124-410-610-625 7 7 7 7 7
1121-120-441-700 4 4 4 2 5 1121-127-700-400 0 0 0 0 3 1121-230-961-700 4 4 4 3 4 1124-410-620-625 7 7 7 7 7
1121-120-442-700 3 3 3 3 4 1121-127-700-500 0 0 0 0 1 1121-230-962-700 2 2 2 0 3 1121-120-91a-700 0 0 0 0 1
1121-120-451-700 1 1 1 0 1 1121-129-700-400 1 1 1 1 1 1121-240-413-700 0 0 0 0 2 1121-12f-466-700 0 0 0 0 1
1121-120-452-700 0 0 0 0 1 1121-200-411-700 9 7 7 4 13 1121-240-421-700 4 4 4 3 5 1121-12f-467-700 2 2 2 2 2
1121-120-454-700 0 0 0 0 1 1121-210-213-700 1 1 1 1 1 1121-240-422-700 2 2 2 1 3 1121-4a0-310-700 2 2 2 0 2
1121-120-462-700 4 4 4 5 5 1121-210-230-700 13 13 13 11 13 1121-240-433-700 1 1 1 0 3 1121-4a0-414-700 8 7 8 8 8
1121-120-463-700 7 7 7 6 7 1121-210-310-700 10 10 10 9 10 1121-240-434-700 1 1 1 0 3 1121-4a0-914-700 3 3 3 2 5
1121-120-514-700 1 1 1 0 2 1121-210-320-700 11 11 11 10 12 1121-240-437-700 0 0 0 0 2 1121-4a0-918-700 0 1 1 0 1
1121-120-515-700 3 3 3 3 3 1121-210-330-700 20 20 20 18 21 1121-240-438-700 0 0 0 0 1 1121-4b0-233-700 4 4 4 3 4

Table 4: Performance of the single-cue, discriminative accumulation and multicue kernel approach
for each class.

5 Conclusions
This paper presented a discriminative multi-cue approach to medical image annotation. We com-
bined global and local information using two alternative fusion strategies, the discriminative ac-
cumulation scheme [7] and the multi cue kernel. This last method gave the best performance
obtaining a score of 26.85, which ranked first among all submissions.
This work can be extended in many ways. First, we would like to use various types of local
and global descriptors, so to select the best features for the task. Second, we would like to add
shape descriptors in our fusion scheme, which should result in a better performance. Finally,
our algorithm does not exploit at the moment the natural hierarchical structure of the data, but
we believe that this information is crucial for achieving significant improvements in performance.
Future work will explore these directions.

Acknowledgments
This work was supported by the ToMed.IM2 project (B. C. and F. O), under the umbrella of the
Swiss National Center of Competence in Research (NCCR) on Interactive Multimodal Information
Management (IM2, www.im2.ch), and by the Blanceflor Boncompagni Ludovisi foundation (T. T.,
www.blanceflor.se). The support is gratefully acknowledged.

References
[1] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines (and Other
Kernel-Based Learning Methods). CUP, 2000.
[2] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the nyström method.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):214–225, 2004.
[3] Schubert Henning Keysers Daniel Kohnen Michael Wein Berthold B. Lehmann, Thomas M.
The irma code for unique classification of medical images. In Proceedings of SPIE Medical
Imaging, volume 5033, pages 440–451, May 2003.
5 5

10 10
4.5 4.5

20 20
4 4
30 30
3.5 3.5
40 40

3 3
50 50

60 2.5 60 2.5

70 2 70 2

80 80
1.5 1.5

90 90
1 1

100 100
0.5 0.5
110 110
0 0
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110

(a) (b)
5 5

10 10
4.5 4.5

20 20
4 4
30 30
3.5 3.5
40 40

3 3
50 50

60 2.5 60 2.5

70 2 70 2

80 80
1.5 1.5

90 90
1 1

100 100
0.5 0.5
110 110
0 0
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110

Figure 3: These images represent the confusion matrices respectively for (a) SIFT oa, (b) Pixel oa,
(c) DAS and (d) MCK oa. We ordered the classes following the way in which they are listed in
table 4 and used a colormap corresponding to the number of images varying from zero to five to
let the misclassified images stand out. All the position in the matrices containing five or more
images appear dark red.

[4] D. G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the In-
ternational Conference on Computer Vision (ICCV), volume 2, pages 1150–1157, Washington,
DC, USA, 1999. IEEE Computer Society.
[5] M-O-Gueld, M. Kohnen, D. Keysers, H. Schubert, B. B. Wein, J. Bredno, and T. M. Lehmann.
Quality of dicom header information for image categorization. In Proceedings of SPIE Medical
Imaging, volume 4685, pages 280–287, 2002.
[6] Henning Müller, Thomas Deselaers, Eugene Kim, Jayashree Kalpathy-Cramer, Thomas M.
Deserno, Paul Clough, and William Hersh. Overview of the ImageCLEFmed 2007 medical
retrieval and annotation tasks. In Working Notes of the 2007 CLEF Workshop, Budapest,
Hungary, September 2007.
[7] M.E Nilsback and B. Caputo. Cue integration through discriminative accumulation. In Pro-
ceedings of the International conference on Computer Vision and Pattern Recognition, 2004.
[8] E. Nowak, F. Jurie, and B. Triggs. Sampling strategies for bag-of-features image classification.
In Proceedings of the European Conference on Computer Vision, 2006.
[9] R. Polikar. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine,
6(3):21–45, 2006.