-

Overview of the ImageCLEFmed 2008 medical image retrieval task

Henning Muller

henning.mueller@sim.hcuge.ch 1 2 4

Jayashree Kalpathy

Cramer

1 3

Charles E. Kahn Jr.

0 1

William Hatt

1 3

Steven Bedrick

1 3

William Hersh

1 3 0 Department of Radiology, Medical College of Wisconsin , Milwaukee, WI , USA 1 Measurement , Performance, Experimentation 2 Medical Informatics, University Hospitals and University of Geneva , Switzerland 3 Oregon Health and Science University (OHSU) , Portland, OR , USA 4 University of Applied Sciences Western Switzerland , Sierre , Switzerland

H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Managment]: Languages|Query Languages

2008 was the fth year for the medical image retrieval task of ImageCLEF, one of the most popular tracks within CLEF. Participation continued to increase in 2008. A total of 15 groups submitted 111 valid runs. Several requests for data access were also received after the registration deadline.

The most signi cant change in 2008 was the use of a new database containing images from the medical literature. These images, part of the Goldminer collection, were from the RSNA journals Radiology and Radiographics. Besides the images, the gure captions and the part of the caption referring to a particular sub gure were supplied to the participants. Access to the full text articles in HTML was also provided, as was each article's Medline PMID (PubMed Identi er). An article's PMID could be used to obtain the o cially assigned MeSH (Medical Subject Headings) terms. Unlike previous years, this year's collection was entirely in English, as it was obtained from English-language medical literature. However, the topics were, as in previous years, supplied in German, French, and English. The topics used in 2008 were a subset of the 85 topics used in 2005-2007. Thirty topics were made available, ten in each of three categories: visual, mixed, and semantic.

As in previous years, most groups concentrated on fully automatic retrieval. However, three groups submitted a total of seven manual or interactive runs; these runs did not show a substantial increase in performance over the automatic approaches. In previous years, multi{modal combinations were the most frequent submissions. However, in 2008 only half as many mixed runs as purely textual runs were submitted. Very few fully visual runs were submitted, and the ones submitted performed poorly. This may be explained in part by the heavily semantic nature of the 2008 topics.

The best MAP scores were very similar for textual and multi{modal approaches, whereas early precision performance was clearly better for the multi-modal approaches.

Categories and Subject Descriptors General Terms

Introduction

ImageCLEF1 [ 1, 2, 5 ] started within CLEF2 (Cross Language Evaluation Forum, [ 6 ]) in 2003. A medical image retrieval task was added in 2004 to explore domain{speci c multilingual visual information retrieval and also multi{modal retrieval by combining visual and textual features for retrieval. A medical retrieval task and a medical image annotation task have been part of ImageCLEFmed since 2005 [ 5 ].

This paper reports on the medical retrieval task whereas additional papers describe the four other tasks of ImageCLEF. More detailed information can also be found on the task web pages for ImageCLEFmed. A detailed analysis of a previous medical image retrieval task is available in [ 3 ]. 2

The medical retrieval task in 2008

The main change in the medical retrieval task in 2008 was the use of a new database. The search tasks remained essentially the same as in the previous years. The collection distributed to the participants included the images and the captions, as published in the medical journals. URLs to access the full text of the journal article were also made available to the participants. 2.1

Registration and participation

As in previous years, registration for the medical retrieval task increased in 2008, albeit slowly. Several of the groups registered solely to obtain the test collection in order to use it as training data for their algorithms, rather than actually participating in the competition. In the end, 15 research groups submitted a total of 130 runs. Groups were asked to not submit more than ten runs in 2008 (di erent from previous years) so as not to bias the pools too much towards any single group.

There were signi cant problems with many of the 130 initial runs: some were submitted in incorrect formats; several runs were duplicated; and there were runs that provided search results for only a subset of the thirty topics. These problems were corrected in collaboration with the authors as much as was possible, resulting in 111 valid runs that were used to generate the pools that were nally judged for relevance. The following groups submitted valid runs:

Hungarian Acadamy of Sciences, Budapest, Hungary; National Library of Medicine (NLM), National Institutes of Health NIH, Bethesda, MD, USA; Bania Luka University, Bosnia-Hercegovina; MedGIFT group, University of Geneva, Switzerland;

Natural Language Processing group, University Hospitals of Geneva, Switzerland;

GPLSI group, University of Alicante, Spain;

1http://www.imageclef.org/ 2http://www.clef-campaign.org/

Multimedia Modelling Group, LIG, Grenoble, France;

Natural Language Processing at UNED. Madrid, Spain; Miracle group, Spain;

Oregon Health and Science University (OHSU), Portland, OR, USA;

IRIT Toulouse, France; University of Jaen, Spain; Tel Aviv University, Israel; National University of Bogota, Colombia; TextMess group, University of Alicante, Spain. Thus, a total of 15 groups from eight countries and four continents submitted results that are presented in the following chapters.

2.2

Database

The database used for the task in 2008 was made available by the Radiological Society of North America (RSNA). The database contains in total slightly more than 66,000 images taken from the radiological journals Radiology and Radiographics. The images are original gures used in published articles. The collection is a subset of a larger database that is available via the Goldminer3 image search engine. For each image, the text of the gure caption was supplied as free text. However, this caption was sometimes associated with a multi-part image. In over 90% of the images the part of the caption actually referring to this sub{image was also provided. Additionally, links to HTML versions of the full{text articles were provided along with the relevant PubMed accession ID numbers. Both the full{size images as well as thumbnails were available to the participants. All text was in English.

The contents of this database represent a broad and signi cant body of medical knowledge, which makes this year's competition a potentially realistic scenario for how clinicians might use image retrieval systems in the future. 2.3

Query topics

The query topics in 2008 were a selection of 30 topics from the previous three years of ImageCLEFmed [ 4 ]. Training data in the form of the 2005-2007 database with images, annotations, topics, sample query images and qrel les was made available to participants. All topics were supposed to cover at least two of the following axes:

Anatomic region shown in the image;

Image modality (x{ray, CT, MRI, gross pathology, ...); View (frontal, sagittal,...); Pathology or disease shown in the image;

abnormal visual observation (eg. enlarged heart).

From the 85 possible topics of past years, similar topics were removed to cover a wide range of di erent modalities and anatomic regions. A visual and textual check was then performed to make sure that at least a few relevant images exist in the dataset. Since the databases of 2008 and 2007 were very di erent, we wanted to ensure that each topic had more than one relevant image exist.

Each query topic consists of the information need in three languages (English, French, German) and at least two example images. Groups could decide which language and media to use for the query processing and also which part of the text to use. 2.4

Relevance judgments

A new system for relevance judgments was introduced in 2008 building on a Ruby for Rails framework and allowing for simple judgments via a web interface for all judges. The rst 35 images of every run were combined into \pools" with an average size of around 900 images. Such pooling is necessary to reduce the amount of data to judge, and the bias can be regarded as very limited [ 7 ]. Medical Doctors who are also students of biomedical informatics at OHSU were hired for the judgment process and paid by the hour for the judgments.

A ternary judgment scheme was used, wherein each image in each pool was judged to be \relevant", \partly relevant", or \non{relevant". Images clearly corresponding to all criteria were judged as \relevant", images whose relevance could not be safely con rmed but could still be possible were marked as \partly relevant", and images for which one or more criteria of the topic were not met were marked as \non{relevant". Judges were instructed in these criteria and results were manually controlled during the judgment process.

During the judging, the new system exhibited a minor problem that resulted in certain images losing their judgments. This resulted in a short delay in the judging process, after which the a ected images were re{judged by the same persons. 3

Submissions and results

This section details the submissions for the tasks and a rst brief evaluation. A more detailed evaluation of the techniques will follow in the nal proceedings when more details on the techniques used for the submissions will be known. Unfortunately, information on the techniques used in the submissions is not always made available by the participants well ahead of time and in great detail.

Trec eval was used for the evaluation process with most of its performance measures. 3.1

Submissions

A total of 130 runs were submitted via the electronic submission system. Scripts to check the validity of the runs were made available to participants ahead of the submission phase, but even so, almost half of the submitted runs contained errors in either content or format and required changes. Common mistakes included a wrong trec eval format, use of only a subset of the topics and incorrect image identi ers. In collaboration with the authors a large number of runs were repaired, resulting in 111 valid runs taken into account for the pools.

In total, only seven runs were \manual" or \interactive". There were also fewer \visual{only" runs than in all previous years, with only 8 such runs being submitted. The large majority were text{only runs, with 65 submissions. Mixed automatic runs had 31 submissions.

Groups subsequently had the chance to evaluate additional runs themselves as the qrels were made available to participants two weeks ahead of the submission deadline for these working notes. 3.2

Visual retrieval

The number of visual runs in 2008 was much lower than in previous years, and the evolution is not as fast as with textual retrieval techniques. Five groups submitted a total of eight runs in 2008. Performance as measured in MAP is very low for all these runs, reaching a maximum of 0.04 for the best run. Early precision averaged over all topics reaches around 0.2, which is absolutely acceptable. When taking into account only the visual topics these results are much better, whereas the purely semantic topics obtained extremely poor results.

Table 1 shows the results and particularly the large di erences between the runs. Some runs managed to retrieve a larger part of the relevant images (809) but with a fairly low MAP, whereas some runs with a higher MAP only found a very small number of relevant images in the rst 1000 results. A higher bpref in this context can mean that a larger number of images from these runs were not judged for relevance. This might also be due to the fact that only very few visual runs were submitted and thus only few visually retrieved documents were nally judged.

Run TAU MIPLAB-TAU norm UNAL-W+QE+JS GE GIFT8 MIPLAB-TAU orig etfbl-max11111 etfbl-sum11111 GE GIFT16 LSI UNED CEB Image

Results of GIFT were available to the all the participants for combinations of visual and textual runs. 3.3

Textual retrieval

Purely automatic textual retrieval had by far the largest number of runs in 2008 with 65, more than half of all submitted runs. Table 2 shows the results for all submitted automatic text runs, ordered by MAP. Most performance measures such as bpref and early precision are similar in order. Only early precision sometimes has signi cant di erences from the ranking with MAP.

Runs from the University of Alicante (Textmess), University of Jaen (SINAI), and LIG Grenoble teams obtained the best results, mainly by using ontologies such as MeSH (Medical Subject Headings) to code the documents. A MAP of 0.29 could be obtained and several systems have a high score very close to this. A more detailed analysis is required with the exact techniques applied for each of the runs. 3.3.1

Using various languages for the retrieval Unfortunately, very little information was available on which languages the groups used for the retrieval. It can be assumed that most groups used English as this promises the best results. It was also possible to use all three query languages together, for example, for extracting MeSH terms. While this multi{lingual approach is not necessarily a realistic scenario, it can lead to interesting results.

The HUG group used the same techniques with several languages and showed that English obtained by far the best results, better than either French or German. The technique they applied was to map of MeSH terms form the text and queries in various languages. Through the PMIDs, the o cially (manually) assigned MeSH terms of the articles were also available. The MeSH terms extracted from the article and query text performed worse for retrieval than the o cially assigned terms. 3.3.2

Additional resources used for the retrieval Groups could also state which additional resources were used for retrieval. The goal of this was to assemble a collection of available resources that could potentially be shared among participants to improve performance in future challenges. A large variety of resources were used, in large part for the combination of visual and textual runs, but also for purely textual runs. Many of the best runs used the ImageCLEFmed 2005-2007 data for training. O cial MeSH terms manually assigned by the National Library of Medicine could be used through the PMIDs of the articles.

The most commonly used resources were the training data sets of ImageCLEF 2005-2007. There were numerous challenges with this approach, as the database used from 2005-07 di ered greatly from the 2008 database. The annotations in the '05-07 database were of much poorer quality than in the 2008 database, and the two databases were made up of very di erent types of images. Nonetheless, the 2008 topics were a subset of those from previous years' competitions, and so the scenario was somewhat realistic with respect to the training data. 3.4

Mixed retrieval

The promotion of mixed{media retrieval has always been one of the main goals of ImageCLEF. In past years, mixed{media retrieval had the highest submission rate. In 2008, however, only half as many mixed runs were submitted than purely textual runs.

Table 3 shows the results for all submitted runs. It is clear that, for a large number of the runs, the MAP results for the mixed retrieval submissions were very similar to those from the purely textual retrieval systems. An interesting observation is that the mixed-media submissions often have higher early precision than the purely textual retrieval submissions. This con rms what has been previously observed.

The text{only runs exhibited relatively high correlation between MAP and bpref. This was not the case among the mixed{media runs. One possible explanation for this di erence could be that the mixed{media runs used a wider variety of techniques than the text{only runs. Another possible explanation is that more of the mixed{media runs were submitted after the deadline for pool inclusion. If the mixed{media runs retrieved a higher proportion of non-judged images than the text{only runs, the result would be a larger MAP/bpref variance.

When comparing these mixed{media results with those from the text{only runs, it becomes clear that mixed retrieval can obtain very low results. From examining mixed{media runs which had corresponding text{only runs, it is particularly clear that combining good textual retrieval techniques with questionable visual retrieval techniques can negatively a ect system performance. This demonstrates the di culty of usefully integrating both textual and visual information, and '#" '!" &" ) ( ' #%$& %" # "! $" #" !" 012/3" 4/24567" Run ohsu int 2 ohsu sdb full interactive ohsu sdb lsa CEB ITD ALL CEB IBaseM CEB TD ALL CEB TD3 the fragility that such combinations can introduce into retrieval systems. As seen in 1, the distribution of MAP for the textual runs was higher than that for the mixed runs. A signi cant mode exists around a MAP of 0.05 for the mixed runs, while the modes for the textual runs are at 0.15 and 0.28. This year, as in previous years, interactive retrieval was only used by a very small number of participants. Interactive retrieval is extremely important, and it is a pity so few groups chose to attempt anything other than purely automated systems.

Table 4 shows the results of all manual and interactive runs submitted. Two runs from OHSU had fairly good results; the other runs were competitive in neither the MAP nor the early precision categories when compared to the fully automatic runs. In general, MAP and early precision were well-correlated (R2 = 0:82 for textual runs, 0.68 for mixed runs); these two runs, however, had higher early precision than their MAP would predict. 3.6

Topic Analysis

Overall, most groups performed signi cantly better on the semantic topics than on the mixed or visual topics, as can be seen in the table below. Topics 6 and 11{18 were quite di cult for many participants. Table 5 gives an overview of the best and average perform per topic. Some topics with a small number of relevant images have a particularly low performance.

The fact that many of the visual topics obtained poorer performance than the semantic topics also shows that groups have much more experience working on semantic topics, and that visual retrieval currently has much more di culty obtaining good results. That said, visual retrieval can have an important positive in uence, and it seems necessary to promote it further by having potentially a larger number of visual topics to push groups towards using visual techniques. Four topics were each judged by two judges. We performed tests of inter-rater agreements using kappa statistics, as seen in table 6. In 3 of the four cases, the inter-rater agreement was quite good. In the last case, one judge interpreted the query more strictly than the other. 4

Conclusions

The focus of many participants in this year's ImageCLEF 2008 has been text{based retrieval. The increasingly semantic topics combined with a database containing high{quality annotations in 2008 may have resulted in less impact of using visual techniques as compared to previous years. This tendency is also seen when looking at the performance by topic where visual topics had signi cantly lower results than the semantic topics. Our goal in the upcoming ImageCLEF medical retrieval task is to increase the number of visual runs submitted. We hope to modify the task to favor more integrated approaches. Another important aspect is that interactive retrieval has always had a poor participation and de nitely needs to be regarded more strongly. Relevance feedback and query modi cations have a potential to signi cantly improve results, but of course research favors laboratory style evaluations.

Visual runs were rare and had no single run with a very convincing performance as was the case in 2007, where the best visual runs had an extremely good performance. Mixed{media runs were very similar in performance to textual runs when looking at MAP. The only di erence was that mixed{media runs obtained better early precision in general. Several mixed{media runs were also broken, resulting in a very poor performance. This highlights that the combination is still not very stable.

A per{topic analysis shows that visual topics obtained lower average results than semantic topics. The analysis also shows that several runs with very few relevant images have a very low average performance, whereas topics with a larger number seem to perform better.

Acknowledgements

We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. The images for the 2008 ImageCLEFmed challenge were contributed by the Radiological Society of North America (RSNA). This work was partially funded by the Swiss National Science Foundation (FNS) under contract 205321{109304/1, the American National Science Foundation (NSF) with grant ITR{0325160, and by the University of Applied Sciences Western Switzerland (HES SO) in the context of the BeMeVIS project.

[1]

Paul

Clough , Michael Grubinger, Thomas Deselaers, Allan Hanbury, and Henning Muller. Overview of the ImageCLEF 2006 photo retrieval and object annotation tasks . In CLEF 2006 Proceedings, volume 4730 of Springer Lecture Notes in Computer Science , pages 579 { 594 , 2007 .

[2]

Paul

Clough , Henning Muller, and

Mark

Sanderson . The CLEF cross{language image retrieval track (ImageCLEF) 2004 . In Carol Peters, Paul Clough, Julio Gonzalo, Michael Jones, Gareth J. F. and Kluck , and Bernardo Magnini, editors, Multilingual Information Access for Text , Speech and Images: Result of the fth CLEF evaluation campaign , volume 3491 of Lecture Notes in Computer Science (LNCS), pages 597 { 613 , Bath , UK, 2005 . Springer.

[3]

William

Hersh , Henning Muller, Je ery Jensen, Jianji Yang, Paul Gorman, and

Patrick

Ruch . Advancing biomedical image retrieval: Development and analysis of a test collection . Journal of the American Medical Informatics Association , September/October: 488 { 496 , 2006 .

[4]

William

Hersh , Henning Muller, and Jayashree Kalpathy-Cramer. The imageclefmed medical image retrieval task test collection . Journal of Digital Imaging , 2008 .

[5]

Henning

Mu ller, Thomas Deselaers, Eugene Kim, Jayashree Kalpathy-Cramer,

Thomas M.

Deserno , Paul Clough, and

William

Hersh . Overview of the ImageCLEFmed 2007 medical retrieval and annotation tasks . In CLEF 2007 Proceedings, volume 5152 of Lecture Notes in Computer Science (LNCS) , Budapest, Hungary, 2008 . Springer.

[6]

Jacques

Savoy . Report on CLEF{2001 experiments. In Report on the CLEF Conference 2001 (Cross Language Evaluation Forum) , pages 27 { 43 , Darmstadt , Germany, 2002 . Springer LNCS 2406.

[7]

Justin

Zobel . How reliable are the results of large{scale information retrieval experiments ? In W. Bruce Croft, Alistair Mo at, C. J. van Rijsbergen, Ross Wilkinson , and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 307 { 314 , Melbourne , Australia, August 1998 . ACM Press, New York.