FIRE – Flexible Image Retrieval Engine:
                ImageCLEF 2004 Evaluation
                          Thomas Deselaers, Daniel Keysers, Hermann Ney
                     Lehrstuhl für Informatik VI – Computer Science Department
                       RWTH Aachen University – D-52056 Aachen, Germany
                             {deselaers, keysers, ney}@cs.rwth-aachen.de


                                                   Abstract
          In this paper we present FIRE, a content-based image retrieval system and the
       methods we used in the ImageCLEF 2004 evaluation. In FIRE, different features
       are available to represent images. This diversity of available features allows the user
       to adapt the system to task specific characteristics. A weighted combination of these
       features admits very flexible query formulations and helps in processing specific queries.
       For the ImageCLEF 2004 evaluation, we used content-based methods only and the
       experimental results compare favorably well with other systems that make use of the
       textual information in addition to the images.


1      Introduction
Content-based image retrieval is an area of active research in the field of pattern analysis and
image processing. The need for content-based techniques becomes obvious when considering the
enormous amounts of digital images produced day by day e.g. by digital cameras or digital imaging
methods in medicine. The alternative of annotating large amounts of images manually is a very
time consuming task. Another very important aspect is that images can contain information that
cannot be expressed precisely in textual annotation [17]. Thus, even the most complete annotation
is useless if it does not contain the details that might be of importance to the actual users. The
only way to solve these problems is to use fully automatic, content-based methods.
    Several content-based image retrieval systems have been proposed. One of the first systems
was the QBIC system [5]. Other popular research systems are BlobWorld [1], VIPER/GIFT [18],
SIMBA [16], and SIMPLIcity [20].
    In this work we present FIRE, a content-based image retrieval system and the methods we
used in the ImageCLEF 2004 evaluation. FIRE is easily extendable, offers a wide repertory of
available features and distance functions and these varieties allow for assessing the performance of
different features for different tasks. FIRE is freely available under the terms of the GNU General
Public License1 .


2      Retrieval techniques
In content-based image retrieval, images are searched by their appearance and not by textual
annotations. Thus, images have to be represented by features and these features are compared
to search for images similar to a given query image. In FIRE, each image is represented by a set
of features. To find images similar to a given query image, the features from the images in the
database are compared to those of the query image.
    1 http://www-i6.informatik.rwth-aachen.de/˜deselaers/fire.html
2.1    Query by Example
Query by example means that the system is given a query image Q and the goal is to find images
from the database which are similar to the given query image. In FIRE, images are represented
by features and compared using feature-specific distance measures. These distances are combined
in a weighted sum:
                                           M
                                           X
                              D(Q, X) :=        wm · dm (Qm , Xm )
                                             m=1

where Q is the query image, X ∈ B is an image from the database B, Qm and Xm are the mth
features of the images Q and X, respectively,
                                         P    dm is the corresponding distance measure, and wm
is a weighting coefficient. For each dm , X∈B dm (Qm , Xm ) = 1 is enforced by re-normalization.
The K database images with lowest D(Q, X) are returned.

2.2    Relevance Feedback
Relevance Feedback is a widely used technique [13] that allows for good user interaction and easy
query refinements. After a query has been processed, the user is presented a reasonably large set
of results. From these, the user can select some images as relevant results Q+ and some images as
irrelevant results Q− and requery the system with these sets. To process this query, we calculate
scores S(Q, X) = e−γD(Q,X) with γ = 1.0 for the images and combine these into one score
                                          X              X
                        S(Q+ , Q− , X) =       S(q, X) +      (1 − S(q, X)).
                                         q∈Q+             q∈Q−

The set of the K images with the highest scores is returned. The interface used for for relevance
feedback is shown in Figure 1.

2.3    Query Expansion
A frequent method for enhancing the query results is query expansion. In FIRE, query expansion
is implemented as an “automatic relevance feedback” [13]. The user specifies a number of images
G that he expects to be relevant after the first query. Then a query is processed in two steps: First
the query is evaluated and the first G images are returned. These G images are automatically
used as set of relevant images Q+ to requery the database and the K best matches are returned.


3     Features and Associated Distance Measures
This section gives a short description of each of the features used in the FIRE image retrieval system
for the ImageCLEF 2004 evaluation. Table 1 gives an overview of the features and associated
distance measures.

3.1    Appearance-based Image Features
The most straight-forward approach is to directly use the pixel values of the images as features.
For example, the images might be scaled to a common size and compared using the Euclidean
distance. In optical character recognition and for medical data improved methods based on image
features usually obtain excellent results [9, 10, 11].
    In this work, we used 32 × 32 and 32 × X(keeping the aspect ration) versions of the images.
The 32 × 32 images are compared using Euclidean distance and the 32 × X images are compared
using image distortion model distance (IDM) [9].
    IDM is a zero-order image comparison measure that allows local pixel displacements. Local
dependencies in the displacement grid are neglected. We consider a deformation grid xIJ          IJ
                                                                                           11 , y11
explaining an I × J image A = {aij }, i = 1, . . . , I, j = 1, . . . , J with an X × Y image B =
Figure 1: Interface for relevance feedback. The user is presented with the best matches from
the database (top left is the query image) and can select for each image whether it is relevant,
irrelevant or neutral.

{bxy }, x = 1, . . . , X, y = 1, . . . , Y . Given this deformation grid, the distance between the aligned
images is computed as                                         X
                                  C(A, B, (xIJ        IJ
                                                11 , y11 )) =   ||aij − bxij ,yij ||2 .
                                                            i,j

and the IDM distance is calculated as
                                                          C(A, B, (xIJ    IJ
                                                      
                               D(A, B) = min                        11 , y11 ))
                                           xIJ   IJ
                                            11 ,y11


taking into account a global warp range limiting the displacement range for the pixels. Instead
of using only single pixels, here local context of 3 × 3 pixels of the horizontal and vertical Sobel
derivatives of the images are used [9].

3.2    Color Histograms
Color histograms are widely used in image retrieval [2, 5, 15, 17]. Color histograms are one of the
most basic approaches and to show performance improvements, image retrieval systems often are
compared to a system using only color histograms. The color space is partitioned and for each
partition the pixels with a color within its range are counted, resulting in a representation of the
relative frequencies of the occurring colors. In accordance with [15], we use the Jeffrey divergence
to compare histograms.
3.3    Invariant Feature Histograms
A feature is called invariant with respect to certain transformations if it does not change when
these transformations are applied to the image. The transformations considered here are trans-
lation, rotation, and scaling. In this work, invariant feature histograms as presented in [16] are
used. These features are based on the idea of constructing features invariant with respect to cer-
tain transformations by integration over all considered transformations. The resulting histograms
are compared using the Jeffrey divergence [15]. Previous experiments have shown that the char-
acteristics of invariant feature histograms and color histograms are very similar and that invariant
feature histograms often outperform color histograms [4]. Thus, in this work color histograms are
not used.

3.4    Tamura Features
In [19] the authors propose six texture features corresponding to human visual perception: coarse-
ness, contrast, directionality, line-likeness, regularity, and roughness. From experiments testing
the significance of these features with respect to human perception, it was concluded that the
first three features are very important. Thus in our experiments we use coarseness, contrast, and
directionality to create a histogram describing the texture [2] and compare these histograms using
the Jeffrey divergence [15]. In the QBIC system [5] histograms of these features are used as well.

3.5    Global Texture Descriptor
In [2] a texture feature consisting of several parts is described: Fractal dimension measures the
roughness or the crinkliness of a surface. In this work the fractal dimension is calculated using the
reticular cell counting method [7]. Coarseness characterizes the grain size of an image. Here it is
calculated depending on the variance of the image. Entropy is used as a measure of unorderedness
or information content in an image. The Spatial gray-level difference statistics (SGLD) describes
the brightness relationship of pixels within neighborhoods. It is also known as co-occurrence
matrix analysis [8]. . The Circular Moran autocorrelation function measures the roughness of the
texture. For the calculation a set of autocorrelation functions is used [6].

3.6    Region-based Features
Another approach to representing images is based on finding image regions which roughly corre-
spond to objects or parts of objects in the images. To this purpose, the image is segmented into
regions. The task of segmentation has been thoroughly studied [14], but most of the algorithms
are limited to special tasks because image segmentation is closely connected to understanding
arbitrary images, a yet unsolved problem. Nevertheless, some image retrieval systems successfully
use image segmentation techniques [1, 20]. We use the approach presented in [20] to compare
region descriptions of images.

3.7    Color/Gray Binary Feature
Since the databases contain both color images and gray value images, an obvious feature is whether
the image is a color or gray valued image. This can be extracted easily by examining a reasonably
large amount of pixels in the image. If all of these pixels are gray valued, the image is considered
to be a gray valued images, otherwise it is considered to be a color image. This feature can easily
be tested for equality.


4     Submissions to the ImageCLEF 2004 Evaluation
The ImageCLEF 2004 evaluation covered 3 tasks: 1. Bilingual ad-hoc task using the St. Andrews
database of historic photographs, 2. Medical Retrieval Task using the Casimage database of
medical images, and 3. Interactive Retrieval task using the St. Andrews database.
Table 1: Features extracted for the ImageCLEF 2004 evaluation and their associated distance
measures.
                                                                            associated
 number feature                                                         distance measure
       0 32 × 32 down scaled version of the image                       Euclidean
       1 32 × X down scaled version of the image (keeping aspect ratio) IDM
       2 global texture descriptor                                      Euclidean
       3 Tamura texture histogram                                       Jeffrey divergence
       4 invariant feature histogram with monomial kernel               Jeffrey divergence
       5 invariant feature histogram with relational kernel             Jeffrey divergence
       6 binary feature: color/gray                                     equal/not equal


    We participated in the bilingual ad-hoc task and the medical retrieval task. First a set of
features was extracted from each of the images from both databases and the given query images.
Table 1 gives an overview of the features extracted from the databases and the distance measures
used to compare these features. The features extracted were chosen based on previous experiments
with other databases [3, 4].

4.1     Medical Retrieval Task
The Medical Retrieval Task consisted of 26 query images for which similar images had to be
retrieved from the Casimage database, a database of 8725 medical images from various medical
domains. Along with the images a set of 2078 text documents describing the medical cases is
available, which were not used in the system, however. Each of the images belongs to one of the
cases, thus several images may belong to one case.
    In ImageCLEF 2004 it was possible to submit results to the medical retrieval task under
different conditions:
   • only visual retrieval
   • query expansion textual/visual
   • manual feedback from the first 20 results images visual
   • manual feedback from the first 20 results images visual/textual
We submitted results to the first three categories using visual information only. We did not make
use of the textual data at all.

4.1.1   Fully Automatic Queries / Only visual retrieval
Fully Automatic Query means that the system is given the query image and has to return a list
of the most similar images without any further user interaction.
    To this task we submitted 3 runs differing in the feature weightings used. The precise feature
weightings are given in Table 2. The table clearly shows that the parameters optimized for this
task outperformed the other parameters and thus that optimizing the feature weightings in image
retrieval for a given task improves the results. The feature weightings were chosen on the following
basis:
   • Use all available features equally weighted. This run can be seen as a baseline and is labelled
     with the run-tag i6-111111.
   • Use the features in the combination that produces the best results on the IRMA database
     [12], labelled i6-020500.
   • Use the features in a combination which was optimized towards the given task. See Sec-
     tion 4.1.4 on how we optimized the parameters towards this task. This run is labelled with
     the run-tag i6-025501.
Three example queries are given in Figure 2.
 Figure 2: Three example queries with results from the fully automatic medical retrieval task.


Table 2: Different feature weightings and the mean average precision (MAP) from the ImageCLEF
2004 evaluation used for the medical retrieval task for the fully automatic runs.
                                            feature number
                         run-tag       0 1 2 3 4 5 6                MAP
                         i6-111111 1 1 1 1 1 1 1 0.2857
                         i6-020500 0 5 0 2 0 0 0 0.2665
                         i6-025501 5 5 0 2 1 0 0 0.3407


4.1.2   Fully Automatic Queries with Query Expansion
This task was similar to the fully automatic task. The system was given the query image only and
could perform the query in two steps, but without any user interaction. This method is described
in Section 2.3:
  1. normal query
  2. query expansion, i.e. use the query image and its first nearest neighbor to requery the
     database.
    We decided to use this method after we observed that for most query images the best match is
a relevant one. In our oppinion, this method slightly enhanced the retrieval result, but the results
are worse than the single-pass runs in two of three cases in the ImageCLEF 2004 evaluation. In
Table 3 the results for these runs are given in comparison to the fully automatic runs without query
expansion described in Section 4.1.1. For these experiments we used the same three settings for
the fully automatic runs with and without query expansion. The fact that the results deteriorate
(against our expectation) might be explained by missing medical relevance of the first query result.
Another reason might be that we only looked into the first 20 to 30 results, but for the evaluation
the first 1000 results were assessed.

4.1.3   Queries with Relevance Feedback
In the runs described in the following, relevance feedback was used. The system was queried with
the given query image and a user was presented the 20 most similar images from the database.
Then the user marked one or more of the images presented (including the query image) as relevant,
Table 3: Results from the ImageCLEF evaluation for the experiments with query expansion in
comparison to the fully automatic runs.
                    run-tag         fully automatic with query expansion
                    i6(qe)-020500            0.2665               0.3115
                    i6(qe)-025501            0.3407               0.3323
                    i6(qe)-111111            0.2857               0.2495


Table 4: Feature weighting used for the experiments with relevance feedback in the medical
retrieval task.
                                       feature number
                       run-tag     0 1 2 3 4 5 6             MAP
                       i6-rfb1 10 0 0 2 1 0 0 0.3437


irrelevant or neutral. The sets of relevant and irrelevant images were then used to requery the
system as described in Section 2.2. Although in some scenarios several steps of relevance feedback
might be useful, here only one step of query refinement was used.
    As user interaction was involved here, a fast system was desirable. To allow for faster retrieval,
the image distortion model was not used for the comparison of images. The feature weighting
used is given in Table 4.
    The mean average precision of 0.3437 reached here is slightly better than in the best of the
fully automatic runs (0.3407).

4.1.4   Manual selection
To find a good set of parameters for this task, we performed some manual experiments. To be
able to compare different parameter sets, we manually created relevance estimates for some of
the images. These relevance estimates were submitted as “human visual system (of a computer
scientist)”. These experiments were carried out as follows:
  1. Start with an initial feature weighting.
  2. Query the database with all query images using this weighting.
  3. Present the first 30 results for each query image to the user.
  4. The user marks all images as either relevant or irrelevant. The system calculates the number
     of relevant results in total.
  5. Slightly change the weighting and go back to 2.
We performed experiments to assess the quality of particular features, i.e. we used only one feature
at a time (cf. Table 5). With this information in mind we started to combine different features.
First we tried to use all features with identical weight at the same time and the setting which
proved best on the IRMA task. Then we modified these settings to improve the results. In this
way we could approximately assess the quality of the results for different settings. We tried 11
different settings in total. The complete results from these experiments are given in Table 6.
    The mean average precision for this run is very low (0.279) because not enough images were
viewed and assessed, and thus the number of returned images was far too small. In average only
53 results were returned with a minimum number of 6 images and a maximum of 142 images for
the 26 queries.


    Table 5: The subjective performance of particular features on the medical retrieval task.
        feature number                      0      1      2      3     4      5       6
        precision of the first 30 images 0.55 0.44 0.31 0.54 0.40 0.36 0.03
                   Table 6: Effect of various feature combination on precision.
                                feature number           precision of the
                          0    1 2 3         4 5 6        first 30 results
                          1    1 1 1         1 1 1                    0.60
                          0    5 0 2         0 0 0                    0.65
                          0    5 0 2         2 0 0                    0.61
                          0 10 0 2           2 0 0                    0.63
                          0    5 0 2         0 2 0                    0.59
                         10    0 0 2         2 0 0                    0.65
                          0 10 0 2 0.5 0 0                            0.63
                          5    0 0 2         0 0 0                    0.65
                          0 10 0 2           1 0 0                    0.65
                          5    5 0 2         1 0 0                    0.67
                         10    0 0 2 0.5 0 0                          0.65


Table 7: Different feature weightings used for the bilingual retrieval task for the fully automatic
runs and the run with relevance feedback.
                                           feature number
                         run-tag      0 1 2 3 4 5 6                 MAP
                         i6-111111 1 1 1 1 1 1 0 0.0859
                         i6-010012 0 0 0 1 2 1 0 0.0773
                         i6-010101 0 1 0 1 1 0 0 0.0859
                         i6-rfb1      0 0 0 1 1 0 0 0.0839


4.2     Bilingual Retrieval Task
The Bilingual Retrieval Task consisted of 25 queries given as a short textual description in several
languages, a slightly longer textual description in English, and an example image fitting the query.
In our system we only used the 25 example images to query the database. The database is the St.
Andrews Image Collection consisting of approximately 30 000 images.
    In the experiments described in the following, the provided example images were used to query
the database. Unfortunately, no other group participated in this track.

4.2.1   Fully Automatic Queries
Here, the example images given were used to query the database. Different feature weightings
were used:
  1. equal weight for each feature (run-tag i6-111111)
  2. two weightings which have proven to work well for general purpose photographs [2] (run-tags
     i6-010012 and i6-010101).
The exact weightings are given in Table 7 together with the results from the ImageCLEF 2004
evaluation.
    A look at the query topics clearly showed that pure content-based image retrieval would not
be able to deliver satisfactory results as queries like “Portrait pictures of church ministers by
Thomas Rodger” are not processible by image content only (church ministers do not differ in their
appearance from any other person, and it is usually not possible to see from an image who made
it). The mean average precision values clearly show that visual information alone is not sufficient
to obtain good results, although the results from queries are visually quite promising as shown in
Figure 3. Due to the fact that this task was quite futile we did not focus on this task.
Figure 3: Query results for the bilingual retrieval task for three different queries using only visual
information.
4.2.2    Queries with Relevance Feedback
Using the feature weighting given in Table 7, i.e. column i6-rfb, we submitted one run using
relevance feedback for this task. No improvement can be seen: A mean average precision of 0.0839
was measured. This is even worse than the best of the fully automatic runs.


5       Summary of the Evaluation
In this section, our results are compared to the results of other groups in the ImageCLEF 2004
evaluation. For the medical retrieval task, the results of the evaluation (cf. Table 8) show that
the methods presented here compare favorably well with the other systems. There are three
better systems, however the differences are very small and it is not yet clear to which extend the
other systems used the textual information in addition to the images. For the bilingual retrieval
task, the comparison with the other systems seems to show that the textual information is very
important. Note that all results presented in this paper were obtained using visual information
only. Furthermore, the results in the medical retrieval task show that suitable selection and
weighting of the features used improves the results strongly. The optimization here is not to be
seen as “training on the testing data” as only a few different settings were compared.
    For the future it is planned to extend the FIRE system to be able to use textual features in
the retrieval process.

Table 8: Exemplary results (mean average precision, MAP) from the ImageCLEF 2004 evaluation
for the fully automatic runs in the medical retrieval task.
                                    1


                                                              se


                                                              10

                                                              is
                                 t0


                                                             1
                                                           ba


                                                           50

                                                           _v


                                                          ne
                               x

                                        n2


                                                           1
                            mT


                                                        4_

                                                        50

                                                        25

                                                        4d


                                                       bi


                                                                                              un


         run-tag
                                       u
                             I

                                    _r

                                             l0

                                                     25

                                                     e0

                                                     g_


                                                    om


                                                                                               r
                          ed


                                                                                            d1
                                   ds

                                          _c

                                                  -0

                                                  -q

                                                  _4


                                                 _c
                        M


                                                                                          i
                     UB

                                 ki

                                        ic

                                               i6

                                               i6

                                               GE


                                               mi


                                                                                       en


                                                                  ...          ...
         MAP       0.35     0.35    0.35     0.34   0.33   0.32   ...   0.27   ...   0.18
References
 [1] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A System for Region-
     Based Image Indexing and Retrieval. In International Conference on Visual Information Systems,
     Springer Verlag, Amsterdam, The Netherlands, pages 509–516, June 1999.
 [2] T. Deselaers. Features for Image Retrieval. Diploma thesis, Lehrstuhl für Informatik VI, RWTH
     Aachen University, Aachen, Germany, December 2003.
 [3] T. Deselaers, D. Keysers, and H. Ney. Classification Error Rate for Quantitative Evaluation of
     Content-based Image Retrieval Systems. In International Conference on Pattern Recognition, Cam-
     bridge, UK, August 2004. In press.
 [4] T. Deselaers, D. Keysers, and H. Ney. Features for Image Retrieval – A Quantitative Comparison. In
     DAGM 2004, Pattern Recognition, 26th DAGM Symposium, LNCS, Tübingen, Germany, September
     2004. In press.
 [5] C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, and W. Equitz. Efficient
     and Effective Querying by Image Content. Journal of Intelligent Information Systems, 3(3/4):231–
     262, July 1994.
 [6] Z. Q. Gu, C. N. Duncan, E. Renshaw, M. A. Mugglestone, C. F. N. Cowan, and P. M. Grant.
     Comparison of Techniques for Measuring Cloud Texture in Remotely Sensed Satellite Meteorological
     Image Data. Radar and Signal Processing, 136(5):236–248, October 1989.
 [7] P. Haberäcker. Praxis der Digitalen Bildverarbeitung und Mustererkennung. Carl Hanser Verlag,
     München, Wien, 1995.
 [8] R. M. Haralick, B. Shanmugam, and I. Dinstein. Texture Features for Image Classification. IEEE
     Transactions on Systems, Man, and Cybernetics, 3(6):610–621, November 1973.
 [9] D. Keysers, C. Gollan, and H. Ney. Classification of Medical Images using Non-linear Distortion
     Models. In Bildverarbeitung für die Medizin, Berlin, Germany, pages 366–370, March 2004.
[10] D. Keysers, C. Gollan, and H. Ney. Local Context in Non-linear Deformation Models for Handwritten
     Character Recognition. In International Conference on Pattern Recognition, Cambridge, UK, August
     2004. In press.
[11] D. Keysers, W. Macherey, H. Ney, and J. Dahmen. Adaptation in Statistical Pattern Recognition
     using Tangent Vectors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):269–
     274, February 2004.
[12] T. Lehmann, M. Güld, C. Thies, B. Fischer, K. Spitzer, D. Keysers, H. Ney, M. Kohnen, H. Schubert,
     and B. Wein. The IRMA Project – A State of the Art Report on Content-Based Image Retrieval in
     Medical Applications. In Korea-Germany Joint Workshop on Advanced Medical Image Processing,
     Seoul, Korea, pages 161–171, October 2003.
[13] H. Müller, W. Müller, S. Marchand-Maillet, and D. McG Squire Strategies for positive and neg-
     ative Relevance Feedback in Image Retrieval In International Conference on Pattern Recognition,
     Barcelona, Spain, 1:1043–1046, September 2000.
[14] N. R. Pal and S. K. Pal. A Review on Image Segmentation Techniques. Pattern Recognition,
     26(9):1277–1294, November 1993.
[15] J. Puzicha, Y. Rubner, C. Tomasi, and J. Buhmann. Empirical Evaluation of Dissimilarity Measures
     for Color and Texture. In International Conference on Computer Vision, volume 2, Corfu, Greece,
     pages 1165–1173, September 1999.
[16] S. Siggelkow. Feature Histograms for Content-Based Image Retrieval. PhD thesis, University of
     Freiburg, Institute for Computer Science, Freiburg, Germany, 2002.
[17] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-Based Image Retrieval:
     The End of the Early Years. IEEE Transactions on Pattern Analysis and Machine Intelligence,
     22(12):1349–1380, December 2000.
[18] D. M. Squire, W. Müller, H. Müller, and J. Raki. Content-Based Query of Image Databases, Inspi-
     rations from Text Retrieval: Inverted Files, Frequency-Based Weights and Relevance Feedback. In
     Scandinavian Conference on Image Analysis, Kangerlussuaq, Greenland, pages 143–149, June 1999.
[19] H. Tamura, S. Mori, and T. Yamawaki. Textural Features Corresponding to Visual Perception. IEEE
     Transaction on Systems, Man, and Cybernetcs, 8(6):460–472, June 1978.
[20] J. Z. Wang, J. Li, and G. Wiederhold. SIMPLIcity: Semantics-Sensitive Integrated Matching for
     Picture LIbraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(9):947–963,
     September 2001.