Introduction

Run & error score TAUbiomed

Overview of the CLEF 2009 medical image annotation track

Tatiana Tommasi

0 1

Barbara Caputo

bcaputog@idiap.ch 0 1

Petra Welter

1 2

Mark Oliver Guld

1 2

Thomas M. Deserno

tdesernog@mi.rwth-aachen.de 1 2

General Terms

1 0 Idiap Research Institute , Martigny , Switzerland 1 Measurement , Performance, Experimentation 2 RWTH, Aachen University, Dept. of Medical Informatics , Aachen , Germany

95 9

This paper describe the last round of the medical image annotation task in ImageCLEF 2009. After four years, we de ned the task as a survey of all the past experience. Seven groups participated to the challenge submitting 19 runs. They were asked to train their algorithms on 12,677 images, labeled according to four di erent settings representing the yearly annotation tasks, and to classify 1,733 images in the four annotation frameworks. The aim is to understand how each strategy answers to the increasing number of classes and to the unbalancing. A plain classi cation scheme using support vector machines and local descriptors outperformed the other methods.

H 3 [Information Storage and Retrieval] H 3 1 Content Analysis and Indexing H 3 3 Information Search and Retrieval H 3 4 Systems and Software H 3 7 Digital Libraries

Introduction

In 2005, the medical image annotation task was introduced in the ImageCLEF1 challenge. Its main contribution was to provide a resource for benchmarking content-based image classi cation systems focusing on medical images. Hospitals collect hundreds of imaging data every day, and automatic image annotation can be an important step when searching for images in huge databases [ 8, 4 ]. Automatic techniques able to identify acquisition modality, body orientation, body region, and biological system examined could be used for multilingual image annotations as well as for DICOM header corrections in medical image acquisition routine.

Over the last four years, the medical annotation task evolved in terms of number of images, classes, and classes' framework provided. It was born as a 60 plain class problem [ 2 ], grew up to a 120 class problem [ 7 ], and became a complex hierarchical class task in 2007 [ 6 ]. In 2008, class imbalance was added to foster the use of prior knowledge encoded into the hierarchy of classes [ 1 ].

This year we celebrate the fth medical image annotation task anniversary and we decided to organize its conclusive round as a survey on the last years experience. The idea is to compare the scalability of di erent image classi cation techniques as the number of classes grows, their hierarchical structure increase, and badly populated classes appear. 2

Database and Task Description

As in the past challenge editions, the annotation task was de ned on the basis of the IRMA project2. This year, a database of 12,677 fully classi ed radiographs, taken randomly from medical routine, was made available as training set. Images are labelled according to four classi cation label sets considering: 57 classes as in 2005 ( 12631 images) + a \clutter" class C (46 images); 116 classes as in 2006 ( 12334 images) + a \clutter" class C (343 images); 116 IRMA codes as in 2007 ( 12334 images) + a \clutter" class C (343 images); 193 IRMA codes as in 2008 ( 12677 images).

For the rst two label settings, images are associated to simple raw numbers while in the last two label settings, images are identi ed by their complete IRMA code (see Section 3). The 1-57 labels used for the rst group de nition are derived through a high level identi cation of images in IRMA code terms. Considering a more detailed image annotation and the introduction of some new classes, we pass to 116 and then to 193 classes. The \clutter" class for a speci c setting contains all the images belonging to new classes, or images described with a higher level of detail in the nal 2008 setting.

The test data consisted of 1733 images. Not all the training classes have examples in this set: 2005 labelling - 55 classes (out of 57) with 1639 images + class C 94 with images; 2006 labelling - 109 classes (out of 116) with 1353 images + class C 380 with images; 2007 labelling - 109 IRMA codes (out of 116) with 1353 images + class C 380 with images; 2008 labelling - 169 IRMA codes (out of 193) with 1733 images.

Note the distribution of the images in the classes of the training set: for 2005, 2006 and 2007, classes have more than six images while in 2008, there are classes with one to ve images. Concerning the 2008 labels, the test data have a 20% of images which are badly (classes with less than ten images) represented in the training data.

Participants to the medical annotation task were asked to classify the test images according to all the four label settings. Each group is allowed to submit di erent runs, but each of them should be based only on one algorithm which should be optimized to face the four di erent classi cation problems. The aim is to understand how each algorithm answers to the increasing number of classes and to the unbalancing. The classi cation results are considered per year and the error score summed to have a nal unique way to rank the performance of the submitted runs. 3

IRMA Code

Standardized nomenclature for medical imaging are generally roughly structured, ambiguous, and often use optional tags. Concerning the needs for content-based image retrieval (CBIR) and annotation in the medical eld, a detailed unambiguous coding scheme is required. Valid relations between code and sub-code elements could be \is-a" and \part-of", de ning a strict hierarchical 2http://ganymed.imib.rwth-aachen.de/irma/index en.php order. Causality is also important for grouping of processing strategies. Therefore, a monohierarchical scheme is required, where each sub-code element is connected to only one code element. Since categorization of medical images must cover all aspects in uencing the image content and structure, a multi-axial scheme is needed [ 5 ].

The IRMA code strictly relies on these rules. It is composed from four axes having three to four positions, each in f0, . . . , 9, a, . . . , zg, where \0" denotes \unspeci ed " to determine the end of a path along an axis: the technical code (T) describes the image modality; the directional code (D) models body orientations; the anatomical code (A) refers to the body region examined; the biological code (B) describes the biological system examined.

This results in a string of 13 characters (IRMA: TTTT-DDD-AAA-BBB). A small exemplary excerpt from the anatomy axis of the IRMA code is given in Table 1. The IRMA code can be easily extended by introducing characters in a certain code position, e.g., if new image modalities are introduced. Based on the hierarchy, the more code positions di er from \0", the more detailed is the description. 4

Error Evaluation

We now de ne the error score for the medical image annotation challenge. On the basis of the image labeling, we de ned two di erent evaluation strategies. 2005 and 2006. For these two years, the error is evaluated just on the capability of the algorithm to make the correct decision. There is also the possibility to say \don't know", which is encoded by \*". An example is given in Table 2. 2007 and 2008. For these two years, the error is evaluated on the basis of the hierarchical IRMA code.

Let an image be coded by the technical, directional, anatomical and biological axes. These axes are independent and therefore, the errors for each axis simply are summed up: let l1I = l1; l2; . . . ; li; . . . ; lI be the correct code (for one axis) of an image; let ^l1I = ^l1; ^l2; . . . ; ^li; . . . ; ^lI be the classi ed code (for one axis) of an image; where li is speci ed precisely for every position, and in ^li is allowed to say \don't know", which is encoded by \*". Note that I (the depth of the tree to which the classi cation is speci ed) may be di erent for di erent axes and di erent images.

Given an incorrect classi cation at position ^li, we consider all succeeding decisions to be wrong and given a not speci ed position, we consider all succeeding decisions to be not speci ed. Furthermore, we do not count any error if the correct code is unspeci ed and the predicted code is a wildcard. In that case, we do consider all remaining positions to be not speci ed.

We want to penalize wrong decisions that are easy (fewer possible choices at that node) over wrong decisions that are di cult (many possible choices at that node), we can say, a decision at position li is correct by chance with a probability of b1i if bi is the number of possible labels for position i. This assumes equal priors for each class at each position.

Furthermore, we want to penalize wrong decisions at an early stage in the code (higher up in the hierarchy) over wrong decisions at a later stage in the code (lower down on the hierarchy) (i.e. li is more important than li+1 ).

Putting together: (1) (2) XI 1 1 (li; l^i) i=1 |{bzi} |{iz} | ({cz) }

(a) (b) (li; l^i) = 8 < :

0 if lj = l^j 0:5 if lj = 1 if lj 6= l^j 8j 9j 9j i i i with where the parts of the equation: (a) accounts for di culty of the decision at position i (branching factor); (b) accounts for the level in the hierarchy (position in the string); (c) correct/not speci ed/wrong, respectively.

In addition, for every axis, the maximal possible error is calculated and the errors are normalized such that a completely wrong decision (i.e. all positions for that axis wrong) gets an error count of 0.25 and a completely correctly predicted axis has an error of 0.00. Thus, an image where all positions in all axes are wrong has an error count of 1.00, and an image where all positions in all axes are correct has an error count of 0.00. Finally setting a wildcard \*" instead of a \0" is not considered a mistake (see Table 3).

Clutter in 2005, 2006 and 2007. For these three years, we introduced a class called \clutter" (C). Even if in the test set there are images belonging to this class, their classi cation do NOT in uence the error score for the challenge (see Table 4).

An example of the released database complete labeling is given in Table 5. classi ed 2005-06 error count 18 0.0 21 0.0 * 0.0

C 0.0 classi ed 2007 error count 111 0.000000 11* 0.000000 1** 0.000000 *** 0.000000 *C* 0.000000

Participation

In 2009, in total seven groups from ve nations of two continents participated in the medical annotation task, and 19 runs were submitted in total. In the following, we brie y describe the methods applied by the participating groups.

TAUbiomed. The Medical Image Processing Lab from Tel Aviv University in Israel submitted one run using a multiple-resolution patch-based bag-of-visual words approach. Classi cation is performed through support vector machines. The code hierarchy is completely neglected and no wildcards \*" were used.

Idiap. The Idiap Research Institute from Switzerland submitted four runs reproposing the same strategies used in 2008. They consisted of di erent classi cation schemes for support vector machines and coupling two di erent image descriptors.

FEITIJS. The Faculty of Electrical Engineering and Information Technologies from the University of Skopje in Macedonia submitted one run. It is based on global and local image descriptors, which are classi ed using bagging and random forest.

VPA. The Computer Vision and Pattern Analysis Laboratory from Sabanci University in Turkey submitted ve runs. They used local binary patterns as features and support vector machine as classi er. They adopted a hierarchical approach considering, when applicable, the four IRMA code axes separately. medGIFT. The medGIFT group from University Hospitals of Geneva in Switzerland submitted three runs using di erent descriptors and voting schemes in the medGIFT image retrieval system.

DEU. The Dokuz Eylul University in Turkey participated submitting four runs. Di erent global and local features are extracted from images and classi cation is performed with a k-Nearest Neighbor algorithm.

IRMA. As in the previous years, the Image Retrieval in Medical Application (IRMA) group at RWTH Aachen University, Germany, provided a baseline run. It was de ned using Tamura Texture Measures, Cross Correlation Features, and the Image Distortion Model. The parametrization was unchanged over all the years to provide a general reference. Therefore, the IRMA code hierarchy is disregarded. 6

Results and Discussion

The results of the challenge evaluation are given in Table 6, sorted by error score sum over the four year label setting. Considering the error score per-year, the group ranking does not change except for the rst and second rank positions between the Idiap and TAU group in 2006, respectively.

In general, analyzing the results it can be seen that the top-performing runs do not consider the hierarchical structure of the given task (2007 and 2008 labels), but rather use each individual code as one class and train a plain classi er. To assess the semantics captured in the code hierarchy, local rather global image features are required to narrow the semantic gap [ 3 ]. In addition, the local features should be associated to segmented image objects rather than squared patches.

Comparing the 2005 and 2006 results, we see that there is a general decrease in the error score. A possible explanation is that in 2005 the 57 classes are wide, each one containing di erent sub-levels in terms of IRMA codes. This make them di cult to be modeled by a classi er in the training phase. On the other hand, comparing the 2007 and 2008 results there is a general increase in the error score. This e ect was expected: here new classes with the same level of detail respect to the IRMA code are added passing from 2007 to 2008. Moreover, some of the new classes are poorly populated in the training set.

As a nal remark, we notice that methods using patch-based local image descriptors and discriminative SVM classi cation methods outperform the other approaches. We have presented the ImageCLEF 2009 medical image annotation task. This is its conclusive round and we organized it as a survey on the last four years experience. We want to compare the scalability of di erent image classi cation techniques as the number of classes grows, their hierarchical structure increase, and badly populated classes appear. A plain classi cation scheme using support vector machine and local descriptors outperformed the other methods. The obtained scores range from 852.8, over 1994.84, to 3979.8 for best, baseline and worst respectively.

Acknowledgements

We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. The authors T.Tommasi and B.Caputo were supported by the EMMA project thanks to the Hasler foundation (www.haslerstiftung.ch). The IRMA project has been funded by the German Research Foundation, DFG, grants Le 1108/4, Le 1108/6, and De 1563/9.

[1]

Thomas

Deselaers and

Thomas M.

Deserno . Medical image annotation in ImageCLEF 2008 . In Carol Peters, Danilo Giampiccolo, Nicola Ferro, Vivien Petras, Julio Gonzalo, Anselmo Pen~as, Thomas Deselaers, Thomas Mandl, Gareth Jones, and Mikko Kurimo, editors, Evaluating Systems for Multilingual and Multimodal Information Access | 9th Workshop of the Cross{Language Evaluation Forum, Lecture Notes in Computer Science , Aarhus, Denmark, September 2009 { to appear.

[2]

Thomas

Deselaers , Henning Muller, Paul Clough, Hermann Ney, and Thomas

Lehmann . The CLEF 2005 automatic medical image annotation task . International Journal in Computer Vision , 74 ( 1 ): 51 { 58 , 2007 .

[3] Thomas

M Deserno

Sameer

Antani , and

L Rodney

Long . Ontology of gaps in content-based image retrieval . J Digit Imaging , 22 ( 2 ): 202 { 15 , 2008 .

[4] Thomas

M Lehmann

, Mark Oliver Guld, Thomas Deselaers, Daniel Keysers, Henning Schubert, Klaus Spitzer, Hermann Ney, and Wein Berthold B. Automatic categorization of medical images for content-based retrieval and data mining . Comput Med Imaging Graph , 29 ( 2 ): 143 { 55 , 2005 .

[5] Thomas

Lehmann , Henning Schubert, Daniel Keysers, Michael Kohnen, and Berthold

Wein . The IRMA code for unique classi cation of medical images . In H. K. Huang and O. M. Ratib, editors, Medical Imaging 2003 : PACS and Integrated Medical Information Systems: Design and Evaluation ., volume 5033 of SPIE Proceedings , pages 440 { 451 , San Diego, California, USA, May 2003 .

[6]

Henning

Mu ller, Thomas Deselaers, Eugene Kim, Jayashree Kalpathy-Cramer,

Thomas M.

Deserno , Paul Clough, and

William

Hersh . Overview of the ImageCLEFmed 2007 medical retrieval and annotation tasks . In Working Notes of the 2007 CLEF Workshop , Budapest, Hungary, September 2007 .

[7]

Henning

Mu ller, Thomas Deselaers, Thomas Lehmann, Paul Clough, Eugene Kim, and

William

Hersh . Overview of the ImageCLEFmed 2006 medical retrieval and annotation tasks . In Working Notes of the 2006 CLEF Workshop , Alicante, Spain, September 2006 .

[8]

Henning

Muller ,

Michoux ,

Bandon , and

Geissbuhler . A review of content-based image retrieval systems in medical applications. clinical bene ts and future directions . Int J Med Inform , 73 ( 1 ):1{ 23 , 2004 .