=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-ImageCLEF-GuldEt2007
|storemode=property
|title=Baseline Results for the CLEF 2007 Medical Automatic Annotation Task
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-ImageCLEF-GuldEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/GuldD07a
}}
==Baseline Results for the CLEF 2007 Medical Automatic Annotation Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1173/CLEF2007wn-ImageCLEF-GuldEt2007.pdf</pdf>
<pre>
             Baseline Results for the CLEF 2007
             Medical Automatic Annotation Task
                            Mark O. Güld, Thomas M. Deserno
       Department of Medical Informatics, RWTH Aachen University, Aachen, Germany
                     mgueld@mi.rwth-aachen.de, deserno@ieee.org


                                             Abstract
      This paper provides baseline results for the medical automatic annotation task of CLEF
      2007. Therefore, the algorithms initially used for the corresponding tasks in 2005 and
      2006 are applied, using the same parameterization. Three classifiers based on global
      image features are used and combined within a nearest neighbor approach. In 2007,
      a hierarchical code is introduced to describe the image contents, with the evaluation
      scheme allowing a finer granularity of the classification accuracy. We therefore evaluate
      some techniques for estimating the confidence in the classifier decision, which stop or
      alter classifier reports at code levels with uncertain classifier reports.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Managment]: Languages—Query Languages

General Terms
Measurement, Performance, Experimentation

Keywords
Content-based image retrieval, Pattern recognition, Classifier combination


1     Introduction
This paper provides baseline results for the medical automatic annotation challenge of CLEF
2007. By using the same methods and the same parameterization, the obtained results allow to
roughly estimate the complexity of the annotation tasks from 2005, 2006, and 2007 relative to
each other. In 2007, the evaluation scheme addresses the hierarchical structure of the IRMA code
[1] by allowing the classifier to decide don’t know at any level of the code, independently for each
of the four axes [2]. To effectively support this scheme, models which estimate the classifier’s
confidence in its decision are required.


2     Methods
2.1    Features, classifiers, and their combination
The image content is described by global features, i.e. the intensity information from the images
is drastically reduced to few (numerical) values [3].
    Tamura et al. proposed histograms of the coarseness, directionality, and contrast to capture
texture properties [4]. The histograms have 6×88×8 = 384 bins and are compared by using Jensen-
Shannon divergence as a distance measure. This approach is denoted as Tamura texture measures
(TTM). By using down-scaled representations of the original images, a-priori knowledge about
common variabilities can be integrated into the distance measure. Here, the cross correlation-
function (CCF) is used to measure similarity between 32×32 representations. It is robust to global
translations (by using a 9 × 9 translation window) and varying radiation dose (by normalizing the
intensity values). The image distortion model (IDM) models local deformations by allowing pixel
warping within a neighborhood [5]. It uses X × 32 representations, a 5 × 5 search window, 3 × 3
contexts, gradients (instead of intensities) and a distance threshold.
    The three classifiers are combined within a nearest neighbor classifier. The overall distance
between a sample q and a reference r is computed by

               dc (q, r) = λIDM · d′IDM (q, r) + λCCF · d′CCF (q, r) + λTTM · d′TTM (q, r)

where d′ (q, r) denotes a normalized distance between q and r. Normalization is done by dividing
the individual distance by the sum of distances between the sample q and all references. For CCF,
distances are obtained by transforming similarity s to d = 1 − s. The nearest neighbor classifier
then decides based on the class information of the k best references.

2.2     Code hierarchy and confidence
To address the modified evaluation scheme of the 2007 challenge, the nearest neighbor decision
rule is modified. Three options are implemented:
    1. From the k neighbors, a common code is generated by setting differing parts (and their
       subparts) to don’t know, e.g. two neighbors with codes 1121-120-434-700 and 1121-12f-466-
       700 result in a common code of 1121-12X-4XX-700.
    2. For the k neighbors, a threshold td is applied to the majority vote decision. If the distance
       for the best neighbor from the decided class is greater than td , the decision is rejected, i.e.
       the reported code is XXXX-XXX-XXX-XXX.
    3. A threshold tn is applied to the k neighbors. A neighbor is excluded from the decision if its
       distance is greater than tn . For the remaining neighbors, majority vote is used to obtain the
       decision. If all neighbors are excluded, XXXX-XXX-XXX-XXX is reported.
    To keep the number of combinations at bay, only combinations 1+2 and 1+3 are evaluated.


3     Results
The classifier weights are the same as 2005 and 2006: λIDM = 0.42, λCCF = 0.18, λTTM = 0.40.
To obtain estimates for td and tn , the development set is used: inspecting the classifiction for this
set, the results are sorted based on the best distances from neighbors from the decision class for
each sample. Based on this sorted list, the thresholds are chosen from the 1st , 5th , 10th , 25th , and


                                                         td (index)
                 decision         k       -        1       5      10        25       50
                 majority vote    1   51.29    51.34   52.06 56.32       59.26    71.46
                                  5   52.54    52.82   56.90 62.94       72.69   101.35
                 common code      5   80.47    80.77   81.45 84.49       86.59    97.19

Table 1: Results for decision threshold td . The results in the left column are obtained without td .
50th worst distances encountered. Both thresholds are evaluated with the normal majority vote
decision rule first and afterwards with the policy to obtain the common code parts.
    The evaluation is done using the scheme described in [2]. For each image from the test set, an
error value e ∈ [0..1] is obtained, based on the position of classification errors in the hierarchy. By
summation over all 1,000 test images, the overall value is obtained. Constantly answering don’t
know yields a value of 500.0, the worst possible value is 1, 000.0.
    Results for applications of the decision threshold td are summarized in Tab. 1. The neighbor
threshold tn is used in combination with k ∈ {1, 5, 10, 25, 50, 100}, because otherwise the number
of considered neighbors would be so high that small classes are never reported. Tab. 2 contains
the results for the application of tn .

                  tn (index), majority vote                         tn (index), common code
   k          1          5       10     25       50             1         5       10     25        50
   1      51.34      52.06    56.32 59.26     71.46         51.34    52.06     56.32  59.26     71.46
   5      52.25      53.32    56.10 57.88     70.89         80.51    80.24     83.79  83.06     93.20
  10      54.45      55.38    59.21 61.24     72.51        110.65   109.90 111.76 109.03       115.11
  25      62.78      62.56    66.91 68.82     79.72        161.10   156.95 154.27 147.81       147.69
  50      87.50      82.81    83.84 80.60     86.59        201.71   193.63 186.34 176.36       166.44
 100     114.85    104.59 101.79 94.77        94.88        236.91   225.93 213.50 197.61       179.73

    Table 2: Results for neighbor threshold tn . Results for k = 1 can be found in Tab. 1 as well.


   For comparsion with the the medical automatic annotation task from the previous years, Tab.
3 contains baseline error rates.
                                                               error rate
                             year   references   classes     k=1 k=5
                             2005        9,000        57     13.3% 14.8%
                             2006       10,000      116      21.7% 22.0%
                             2007       11,000      116      20.0% 18.0%

                   Table 3: Error rates for the medical automatic annotation task.


4      Discussion
The proposed mechanisms for the estimation of classifier results and the modification of reported
codes do not improve the baseline results of 51.29 for 1-NN and 52.54/rank 18 for 5-NN. This
seems to have been observed by the other groups as well [2]. In our case, common code performs
generally worse than the majority vote decision. The results become drastically worse for bad
parameter sets, especially when the number of considered neighbors is too high.
   Comparing the baseline error rates to the ones from the previous year, the medical automatic
annotation task in 2007 is a bit easier than 2006. This can be taken into account when comparing
methods by groups who participated in only one of the past years.


References
[1] Lehmann TM, Schubert H, Keysers D, Kohnen M, Wein BB. The IRMA code for unique
   classification of medical images. Procs SPIE 2003; 5033: 109-117
[2] Müller H, Deselaers T, Kim E, Kalpathy–Cramer J, Deserno TM. Overview of the Image-
   CLEFmed 2007 Medical Retrieval and Annotation Tasks. Working Notes of the 2007 CLEF
   Workshop (in this book)
[3] Lehmann TM, Güld MO, Thies C, Fischer B, Spitzer K, Keysers D, Ney H, Kohnen M, Schu-
   bert H, Wein BB. Content-based image retrieval in medical applications. Methods of Information
   in Medicine 2004; 43(4): 354-361

[4] Tamura H, Mori S, Yamawaki T. Textural features corresponding to visual perception. IEEE
   Transactions on Systems, Man and Cybernetics 1978; 8(6): 460-73
[5] Keysers D, Dahmen J, Ney H, Wein BB, Lehmann TM. A statistical framework for model-
   based image retrieval in medical applications. Journal of Electronic Imaging 2003; 12(1): 59-68

</pre>