=Paper=
{{Paper
|id=Vol-1171/CLEF2005wn-ImageCLEF-OGuldEt2005
|storemode=property
|title=Combining Global features for Content-based Retrieval of Medical Images
|pdfUrl=https://ceur-ws.org/Vol-1171/CLEF2005wn-ImageCLEF-OGuldEt2005.pdf
|volume=Vol-1171
|dblpUrl=https://dblp.org/rec/conf/clef/GuldTFL05a
}}
==Combining Global features for Content-based Retrieval of Medical Images==
Combining global features
for content-based retrieval of medical images
Mark O Güld, Christian Thies, Benedikt Fischer, and Thomas M Lehmann
Department of Medical Informatics, RWTH Aachen, Aachen, Germany
mgueld@mi.rwth-aachen.de
Abstract
A combination of several classifiers using global features for the content description
of medical images is proposed. Beside well known texture histogram features, down-
scaled representations of the original images are used, which preserve spatial informa-
tion and utilize distance measures which are robust regarding common variations in
radiation dose, translation, and local deformation. These features were evaluated for
the annotation task and the interactive query task in ImageCLEF 2005 without using
additional textual information or query refinement mechanisms. For the annotation
task, a categorization rate of 86.7% was obtained, which ranks second among all sub-
missions. When applied in the interactive query task, the image content descriptors
yielded a mean average precision (MAP) of 0.0751, which is rank 14 of 28 submitted
runs. As the image deformation model is not fit for interactive retrieval tasks, two
mechanisms are evaluated regarding the trade-off between loss of accuracy and speed
increase: hierarchical filtering and prototype selection.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval;
General Terms
Measurement, Performance, Experimentation
Keywords
image retrieval, classifier combination, hierarchical filtering, prototype selection
1 Introduction
ImageCLEFmed 2005 consists of several challenges for content-based retrieval[1] on medical im-
ages. For the retrieval task, the reference set was expanded to over 50,000 images, compared to
8,725 medical images in 2004. A newly introduced annotation task poses a classification problem
of mapping 1,000 query images with no additional textual information onto one of 57 pre-defined
categories. The mapping is to be learned based on a ground truth of 9,000 categorized reference
images. These tasks reflect the real-life constraints of content-based image retrieval in medical
applications, as image corpora are large, heterogeneous and additional textual information about
an image, especially its content, is not always reliable due to improper configuration of the imaging
devices, ambiguous naming schemes, and both inter- and intra-observer variability.
2 The annotation task
The annotation task consists of 9,000 images grouped into 57 categories and 1,000 images to be
automatically categorized. It should be noted that the category definition is based solely on the
aspects of
1. imaging modality, i.e. identification of the imaging device (three different device types)
2. imaging direction, i.e. relative position of the body part towards the imaging device
3. anatomy of the body part examined, and
4. biological system, which encodes certain contrast agents and a coarse description of the
diagnostic motivation for the imaging.
Thus, the category definition does not incorporate any diagnosis information, e.g. the detection
of pathologies or their quantitative analysis.
2.1 Image features and their comparison
Based on earlier experiments conducted on a similar image set, three types of features and simi-
larity measures were employed[2].
Tamura et al. proposed a set of texture features to capture global texture properties of an
image, namely coarseness, contrast, and directionality[3]. This information is stored in a three-
dimensional histogram, which is quantized into M = 6 × 8 × 8 = 384 bins. To capture this
texture information at a comparable scale, the extraction is performed on downscaled images of
size 256×256, ignoring their aspect ratio. Two images q (query) and r (reference) are compared
by applying Jensen-Shannon divergence[4] on their histograms H(q) and H(r):
M · ¸
1 X 2Hm (q) 2Hm (r)
dJSD (q, r) = Hm (q) log + Hm (r) log (1)
2 m=1 Hm (q) + Hm (r) Hm (q) + Hm (r)
To keep spatial information about the image content, down-scaled representations of the orig-
inal images are used and the accompanying distance measures work directly on intensity values.
It is therefore possible to incorporate a priori knowledge into the distance measure by modelling
typical variability in the image data, which does not alter the category that the image belongs to.
The cross-correlation function (CCF) from signal processing determines the maximum correlation
between two 2D image representations, each one of size h × h:
Ph Ph
x=1 y=1 (r(x − m, y − n) − r) · (q(x, y) − q)
sCCF (q, r) = max qP
h Ph
(r(x − m, y − n) − r)2 h
Ph
|m|,|n|≤d
(q(x − m, y − n) − q)2
P
x=1 y=1 x=1 y=1
(2)
Here, q(x, y) and r(x, y) refer to intensity values at a pixel position on the scaled representations
of q and r, respectively. Note that sCCF is a similarity measure and the values lie between 0 and 1.
CCF includes robustness regarding two very common variabilites among the images: translation,
which is explicitly tested within the search window of size 2d + 1, and radiation dose, which is
normalized by subtracting the average intensity values q and r. For the experiments, downscaling
to 32 × 32 pixels and a translation window of size d = 4 was used, i.e. translation can vary from
−4 to +4 pixels in both x- and y-direction.
While sCCF considers global displacements, i.e. translations of entire images, pathologies, im-
plants and normal inter-patient variability suggest to model local deformations for medical images.
This can be done with an image distortion model (IDM)[5]:
X X
X Y X
dIDM (q, r) = min ||r(x + x0 + x00 , y + y 0 + y 00 ) − q(x + x00 , y + y 00 )||2
|x0 |,|y 0 |≤W1
x=1 y=1 |x00 |,|y 00 |≤W2
(3)
Again, q(x, y) and r(x, y) refer to intensity values of the scaled representations. Note that each
pixel of q must be mapped on some pixel in r, whereas not all pixels of r need to be target of a
mapping. Two parameters steer dIDM : W1 defines the size of the neighborhood when searching for
a corresponding pixel. To prevent a totally unordered pixel mapping, it is useful to incorporate
the local neighborhood as context when evaluating a correspondence hypothesis. The size of the
context information is controlled by W2 . For the experiments, W1 = 2, i.e. a 5 × 5 pixel search
window, and W2 = 1, i.e. a 3 × 3 context patch are used. Also, better results are obtained if the
gradient images are used instead of the original images, because the correspondence search will
then focus on contrast and be robust to global intensity differences due to radiation dose. It should
be noted that this distance measure is computationally expensive as each window size influences
the computation time in quadratic manner. The images were scaled to a fixed height of 32 pixels,
i.e. the original aspect ratio was preserved.
2.2 Nearest-neighbor classifier
To obtain a decision q 7→ c ∈ {1 . . . C} for a query image q, a nearest neighbor classifier evaluating k
nearest neighbors according to a distance measure is used (k-NN). It simply votes for the category
which accumulated the most votes among the k reference images closest to q. This classifier also
allows visual feedback in interactive queries.
2.3 Classifier combination
Prior experiments showed that the performance of the single classifiers can be improved signif-
icantly if their single decisions are combined[2]. This is especially the case for classifiers which
model different aspects of the image content, such as the global texture properties with no spatial
information and the scaled representations, which keep spatial information. The easiest way is
a parallel combination scheme, since it can be performed as a post-processing step after the sin-
gle classifier stage[6]. Also, no assumptions are required for the application, whereas a serial or
sieve-like combination require an explicit construction.
For comparability, the distance values (dTamura , dIDM ) are normalized at first over all distances
d(q, ri ), i = 1 . . . N between sample q and each reference ri :
d(q, ri )
d0 (q, ri ) = PN (4)
n=1 d(q, rn )
Afterwards, a new distance measure can be obtained by a weighted sum of distance measures d 1 ,
d2 .
dc (q, r) = λ · d01 (q, r) + (1 − λ) · d02 (q, r), λ ∈ [0; 1] (5)
For a similarity measure s, d(q, r) := 1 − s(q, r) is used and the normalization is performed
afterwards. Thus, the parallel combination of the three classifiers results in
dcombined(q, r) = λTamura · d0Tamura (q, r)
+ λCCF · d0CCF (q, r))
+ λIDM · d0IDM (q, r) (6)
with λTamura , λCCF , λIDM ≥ 0 and λTamura + λCCF + λIDM = 1.
2.4 Training and evaluation on the reference set
The combined classification process relies on three parameters: λTamura , λCCF and k for the number
of nearest neighbors to be evaluated (λIDM is linear dependent). To obtain suitable values for them,
the reference set of 9,000 images was split at random into a training set of 8,000 images and a test
set of 1,000 images. The best parameter values found for this configuration are then applied to
the 1,000 query images. For practical reasons, the matrices DTamura = (dTamura (qi , rj ))ij , SCCF =
(sCCF (qi , rj ))ij , and DIDM = (dIDM (qi , rj ))ij are computed once. Afterwards, all combination
experiments can be performed rather quickly by processing the matrices.
2.5 Use of class prototypes
Since the distance computations for the scaled representations are rather expensive, there is -in
general- great interest for prototype selection which allows to save computation time, storage space
and might even improve the categorization rate by removing possible outliers in the reference set.
Prototype sets can be obtained in various ways [7]. For simplicity, only random prototype
selection and KCentres for K=1 and a simplified variation of it were used. Based on the empirically
optimized
S dcombined, a set of category prototypes is computed by using KCentres: Rprototypes ⊂
R = c=1...C Rc , with Rc being the set of all references belonging to class c, can be determined:
[ X
Rprototypes = arg min dcombined(r, r0 ) (7)
r∈Rc 0
c=1...C q ∈Rc
These elements {rc0 }, c
= 1..C yield the smallest sum of distances to all members of their
respective category.
The prototypes are used to obtain a dissimilarity-space representation of the reference images
and the unknown images:
tr
r 7→ (d(r, r10 ), . . . , d(r, rC
0
)) ∈ IRC (8)
tr
q 7→ (d(q, r10 ), . . . , d(q, rC
0
)) ∈ IRC (9)
For the classification, two representations are compared using Euclidian distance.
3 The retrieval task
The retrieval task uses 50,024 images for reference and consists of 25 queries, which are given as a
combination of text information and query images, with some queries specifying both positive and
negative example images. While the image data for the annotation task only contains grayscale
images from mostly x-ray modalities (plain radiography, fluoroscopy, and angiography), the image
material in this task is much more heterogeneous: It also contains photographs, ultrasonic imaging
and even scans of illustrations used for teaching. Note that the interactive task demands a higher
level of image understanding, since several of the 25 queries directly refer to the diagnosis of
medical images, which is often based on local image details, e.g. bone fractures or the detection
of emphysema in computed tomography images (CT) of the lungs.
3.1 Image features and their comparison
The content representations described in the previous section only use grayscale information, i.e.
color images are converted into grayscale by using standard color weighting:
6969 · R + 23434 · G + 2365 · B
Y = (10)
32768
In general, however, color is the single most important discriminate feature type on stock-house
media and the image corpus used for the interactive query task contains many photographs, color
Categorization rate
Content representation k=1 k=5
Tamura texture histogram, Jensen-Shannon divergence 69.3 70.3
32×32, CCF (9 × 9 translation window) 81.6 82.6
X×32, IDM (gradients, 5 × 5 window, 3 × 3 context) 85.0 83.8
X×32, IDM (as above, 32x32 Euclid 500-NN as filter) 84.1 83.5
Table 1: Results for single classifiers.
scans of teaching material, and microscopic imaging. Therefore, a basic color feature was employed
to compute mean, variance and third order moments for each color channel red, green, and blue.
This yields a 9-dimensional feature vector. Euclidean distance with equal weights for each color
component is used to compute the distance between two vectors.
3.2 Summation scheme for queries consisting of multiple images
Some of the queries do not consist of a single example image, but use several images as a query
pool Q: positive and negative examples. For such queries, a simple summation scheme is used to
obtain an overall distance:
|Q| ½
X
0
[ 1 : qi positive ex.
d(Q, r) = wi · d (qi , r), Q = {(qi , wi )}, wi = (11)
−1 : qi negative ex.
i=1 i
4 Results
All results were obtained non-interactively, i.e. without relevance feedback by a human user, and
without using textual information for the interactive query task.
4.1 Annotation task
Table 1 shows the categorization results obtained for the 1,000 unknown images using single
classifiers. As IDM is very expensive (see running times below), a serial combination with a faster,
but more inaccurate classifier as a filter was also tested. For this, Euclidian distance on 32 × 32
scaled representations was used and only the 500 closest references were passed to IDM. This cuts
computation time down to 1/18, as the costs for the filtering step are neglectable.
To obtain the optimal empirical weighting coefficients λ for the parallel combination, an ex-
haustive search would have been necessary. Instead, a more time-efficient two-step process was
employed: First, a combination of the two spatial representations was considered. Afterwards, a
combination of the combined spatial representations with the Tamura texture feature was investi-
gated. Both runs tested values for λ increasing at a stepsize of 0.1. The results for these two steps
on the testing set, i.e. 1,000 images of the 9,000 original reference images, are shown in Tab. 2.
This results in the parameter configuration shown in Tab.3. When using this parameter set
for the classification of the 1,000 images to be categorized, a categorization rate of 86.7% for the
1-NN is obtained.
Using the 57 prototypes obtained via (7) as a representation set, a dissimilarity-space represen-
tation for the reference set and the unknown set was computed. The dissimilarity representations
were then compared using a Euclidian distance. In addition, not only the elements with min-
imum sum of distances were used, but also the ones with the best n, n = 2 . . . 5 elements per
category. This yields 114, 171, 228, and 285 components for the representation vectors. For com-
parison, experiments were also done for a random pick of 1 . . . 5 elements per category, resulting
in representation vectors of the same size. The results are shown in Fig.1.
Weights Categorization rate Weights Categorization rate
λIDM λCCF k=1 k=5 λIDM,CCF λTamura k=1 k=5
0.0 1.0 82.5 80.9 0.0 1.0 70.8 69.0
0.1 0.9 84.0 82.0 0.1 0.9 78.1 76.9
0.2 0.8 84.8 83.5 0.2 0.8 81.4 80.7
0.3 0.7 85.6 84.1 0.3 0.7 83.5 83.1
0.4 0.6 85.6 84.3 0.4 0.6 84.6 84.0
0.5 0.5 85.5 84.5 0.5 0.5 85.5 84.6
0.6 0.4 86.2 84.2 0.6 0.4 86.7 85.2
0.7 0.3 86.6 84.5 0.7 0.3 86.7 85.1
0.8 0.2 85.9 84.0 0.8 0.2 86.6 85.1
0.9 0.1 85.5 82.8 0.9 0.1 86.6 84.9
1.0 0.0 84.7 82.6 1.0 0.0 86.6 84.5
Table 2: Results on the testing subset, combination of IDM, CCF (left), combination of IDM,
CCF, and Tamura texture (right).
λIDM λCCF λTamura k
0.42 0.18 0.4 1
Table 3: Empirical weighting parameters for the annotation task.
85
IDM (normal, 1-NN)
CCF (normal, 5-NN)
80
75
Categorization rate
70
Tamura (normal, 5-NN)
CCF (center, 5-NN)
IDM (center, 1-NN)
65 CCF (random, 5-NN)
IDM (random, 1-NN)
Tamura (center, 5-NN)
Tamura (random, 5-NN)
60
55
1 2 3 4 5
Number of prototypes per category
Figure 1: Results for single classifiers using dissimilarity representation
λIDM λIDM λIDM λRGB MAP
0.4 0.4 0.2 0.0 0.0659
0.36 0.36 0.18 0.1 0.0751
Table 4: Results for the interactive query task.
Content Representation Extraction [s] Query [s]
(per reference) (per sample)
Tamura texture histogram, Jensen-Shannon divergence 5 ¡1
32×32, CCF (9 × 9 translation window) 3 6
X×32, IDM (gradients, 5 × 5 window, 3 × 3 context) 3 190
X×32, IDM (as above, 32x32 Euclid 500-NN as filter) 6 9
Table 5: Running times for the annotation task.
4.2 Retrieval task
Since no ground truth for the automatical optimization of the parameters is available, only a short
visual inspection was done and two runs were submitted. The results are listed in Tab.4. The
result quality is measured by mean average precision (MAP).
These results are 19th and 14th place among 28 submitted runs in the “visual only, automatic”
category of this task, reaching half the MAP of the leading competitor in this category.
4.3 Running times
Table 5 lists the computation times of the algorithms for the annotation task on a standard
Pentium 4 PC running at 2.6 GHz. For the retrieval task, extraction times per image are identical
and query time is 5.5 times higher as there are 50,000 references compared to 9,000.
5 Discussion
Concerning running times, texture features by Tamura and CCF are fit for interactive use. By pre-
filtering with a computationally inexpensive distance measure, computation time can be severly
reduced without sacrificing too much accuracy. In the experiments, pre-filtering clearly outper-
formed dissimilarity space approaches for both random prototype selection and 1centres. However,
further evaluation of algorithms for prototype selection is necessary. The parallel combination of
single classifiers proved very useful, as it improves the categorization results considerably and can
also be performed as an easy post-processing step on the distance matrices.
5.1 Annotation task
The results obtained for the annotation task verify the results obtained on a smaller corpus using
leaving-one-out[2].
Note that the rather high weight λTamura overemphasizes the role of the texture features in
the experiments, as the actual improvement of the categorization rate is statistically insignificant
for the 1-NN. It marginally improves the quality of the next nearest neighbors a bit as seen in the
results for the 5-NN, which produces slightly better results for interactive queries which list a set
of nearest neighbors.
Query Semantic constraint
2 fracture of the femur
10 emphysema in lung CT
12 enlarged heart in PA chest radiograph
15 gross pathologies of myocardial infarction
16 osteoarthritis in hand
17 micro nodules in lung CT
18 tuberculosis in chest radiograph
19 Alzheimer’s desease in microscopic pathologies
20 chronic myelogenous leukemia in microscopic pathologies
21 bone fracture(s) in radiograph
23 differentiate between malignant and benign melanoma
24 right middle lobe pneumonia
Table 6: Queries in the interactive retrieval task which directly refer to diagnoses.
5.2 Interactive query task
While results were satisfactory for queries based on grayscale radiographs, other queries, especially
from photograpy imaging, had rather poor results. It should also be noted that several queries
demand a high level of image content understanding, as they aim at diagnosis-related information,
which is often derived from local details in the image (see Tab.6).
The methods used in this work to describe the image content either preserve no spatial infor-
mation at all (texture features by Tamura) or capture it at very large scale, omitting local details
important for diagnosis-relevant questions. Using only the image information, such queries cannot
be processed with satisfactory quality of the results with a one-level approach. Refering to the
concepts described in [8], the methods employed in this paper work on the categorization layer of
the content abstraction chain. For a better query completion, subsequent image abstraction steps
are required.
6 Acknowledgment
This work is part of the IRMA project, which is funded by the German Research Foundation,
grant Le 1108/4.
References
[1] Smeulders AWM, Worring M, Santini S, Gupta A, Jain R. Content-based image retrieval at
the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence
22(12): 1349-1380, 2000.
[2] Güld MO, Keysers D, Deselaers T, Leisten M, Schubert H, Ney H, Lehmann TM. Comparison
of global features for categorization of medical images Proceedings SPIE 2004; 5371: 211-222
[3] Tamura H, Mori S, Yamawaki T. Textural features corresponding to visual perception. IEEE
Transactions on Systems, Man, and Cybernetics; SMC-8(6), 460-472, 1978.
[4] Puzicha J, Rubner Y, Tomasi C, Buhmann J. Empirical evaluation of dissimilarity measures
for color and texture. Proceedings International Conference on Computer Vision, 2, 1165-1173,
1999.
[5] Keysers D, Gollan C, Ney H. Classification of medical images using non-linear distortion mod-
els. Proceedings BVM 2004, Bildverarbeitung für die Medizin 2004, Springer-Verlag, Berlin,
366-370, 2004.
[6] Jain AK, Duin RPW, Mao J. Statistical pattern recognition: a review. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(1): 4-36, 2000.
[7] Pekalska E, Duin RPW, Paclik P. Prototype selection for dissimilarity-based classification.
Pattern Recognition, to appear, 2005.
[8] Lehmann TM, Güld MO, Thies C, Fischer B, Spitzer K, Keysers D, Ney H, Kohnen M, Schu-
bert H, Wein BB. Content-based image retrieval in medical applications. Methods of Information
in Medicine 2004; 43(4): 354-361