1 Introduction

The CLEF 2005 Cross-Language Image Retrieval Track

Paul Clough

p.d.clough@sheffield.ac.uk 1

Henning Mueller

henning.mueller@sim.hcuge.ch 4

Thomas Deselaers

deselaers@cs.rwth-aachen.de 3

Michael Grubinger

michael.grubinger@research.vu.edu.au 5

Thomas Lehmann

lehmann@computer.org 2

Je ery Jensen

William Hersh

hersh@ohsu.edu 0 0 Biomedical Informatics, Oregon Health and Science University , Portland, Oregon , USA 1 Department of Information Studies, She eld University , She eld , UK 2 Department of Medical Informatics, Medical Faculty, Aachen University of Technology (RWTH) , Pauwelsstr. 30, Aachen D-52057 , Germany 3 Lehrstuhl fur Informatik VI, Computer Science Department, RWTH Aachen University , D-52056 Aachen , Germany 4 Medical Informatics Service, Geneva University and Hospitals , Geneva Switzerland 5 School of Computer Science and Mathematics, Victoria University , Australia 6 University of Amsterdam, Informatics department , The Netherlands

The purpose of this paper is to outline e orts from the 2005 CLEF crosslanguage image retrieval campaign (ImageCLEF). The aim of this CLEF track is to explore the use of both text and content{based retrieval methods for cross{language image retrieval. Four tasks were o ered in the ImageCLEF track: a ad{hoc retrieval from an historic photographic collection, ad{hoc retrieval from a medical collection, an automatic image annotation task, and a user{centered (interactive) evaluation task that is explained in the iCLEF summary. 24 research groups from a variety of backgrounds and nationalities (14 countries) participated in ImageCLEF. In this paper we describe the ImageCLEF tasks, submissions from participating groups and summarise the main ndings.

1 Introduction

ImageCLEF7 conducts evaluation of cross{language image retrieval and is run as part of the Cross Language Evaluation Forum (CLEF) campaign. The ImageCLEF retrieval benchmark was established in 2003 [ 1 ] and run again in 2004 [ 2 ] with the aim of evaluating image retrieval from multilingual document collections. Images by their very nature are language independent, but often they are accompanied by texts semantically related to the image (e.g. textual captions or metadata). Images can then be retrieved using primitive features based on pixels which form the contents of an image (e.g. using a visual exemplar), abstracted features expressed through text or a combination of both. The language used to express the associated texts or textual queries should not a ect retrieval, i.e. an image with a caption written in English should be searchable in languages other than English.

ImageCLEF provides tasks for both system{centered and user{centered retrieval evaluation within two main areas: retrieval of images from photographic collections and retrieval of images from medical collections. These domains o er realistic scenarios in which to test the performance of image retrieval systems, o ering di erent challenges and problems to participating research groups. A major goal of ImageCLEF is to investigate the e ectiveness of combining text and image for retrieval and promote the exchange of ideas which may help improve the performance of future image retrieval systems.

ImageCLEF has already seen participation from both academic and commercial research groups worldwide from communities including: Cross{Language Information Retrieval (CLIR), Content{ Based Image Retrieval (CBIR), medical information retrieval and user interaction. We provide 7 See http://ir.shef.ac.uk/imageclef/ participants with the following: image collections, representative search requests (expressed by both image and text) and relevance judgements indicating which images are relevant to each search request. Campaigns such as CLEF and TREC have proven invaluable in providing standardised resources for comparative evaluation for a range of retrieval tasks and ImageCLEF aims to provide the research community with similar resources for image retrieval. In the following sections of this paper we describe separately each search task: section 2 describes ad{hoc retrieval from historic photographs, section 3 ad{hoc retrieval from medical images, section sec:annotation the automatic annotation of medical images and. For each we brie y describe the test collections, the search tasks, participating research groups, results and a summary of the main ndings. 2 2.1

Ad{hoc Retrieval from Historic Photographs Aims and Objectives

This is a bilingual ad{hoc retrieval task in which a system is expected to match a user's one{time query against a more or less static collection (i.e. the set of documents to be searched is known prior to retrieval, but the search requests are not). Similar to the task run in previous years (see, e.g. [ 2 ]), the goal of this task is given multilingual text queries, retrieve as many relevant images as possible from the provided image collection (the St. Andrews collection of historic photographs). Queries for images based on abstract concepts rather than visual features are predominant in this task. This limits the e ectiveness of using visual retrieval methods alone as either these concepts cannot be extracted using visual features and require extra external semantic knowledge (e.g. the name of the photographer), or images with di erent visual properties may be relevant to a search request (e.g. di erent views of Rome). However, based on feedback from participants in 2004, the search tasks for 2005 are aimed to re ect more visually{based queries.

Short title: Rev William Swan.

Long title: Rev William Swan.

Location: Fife, Scotland Description: Seated, 3/ 4 face studio portrait of a man.

Date: ca.1850 Photographer: Thomas Rodger Categories: [ ministers ][ identified male ][ dress - clerical ] Notes: ALB6-85-2 jf/ pcBIOG: Rev William Swan ( ) ADD: Former owners of album: A Govan then J J? Lowson. Individuals and other subjects indicative of St Andrews provenance. By T. R. as identi ed by Karen A. Johnstone " Thomas Rodger 1832-1883. A biography and catalogue of selected works". The St. Andrews collection consists of 28,133 images, all of which have associated textual captions written in British English (the target language). The captions consist of 8 elds including title, photographer, location, date and one or more pre{de ned categories (all manually assigned by domain experts). For example, see Fig. 1. Further examples can be found in [?] and the St. Andrews University Library8. We provided participants with 28 topics (titles shown in Table 11 and an example image shown in Fig. 5), the main themes based on analysis of log les from a web server at St. Andrews university, knowledge of the image collection and discussions with maintainers of the image collection. After identifying these main themes, we modi ed queries to test various aspects of cross-language and visual search and used a custom{built IR system to identify suitable topics (in particular those topics with an estimated 20 and above relevant

8 http://www-library.st-andrews.ac.uk/

images). A complexity score was developed by the authors to categorise topics with respect to linguistic complexity [ 4 ].

Each topic consists of a title (a short sentence or phrase describing the search request in a few words), and a narrative (a description of what constitutes a relevant or non{relevant image for that search request). In addition to the text description for each topic, we also provided two example images which we envisage could be used for relevance feedback (both manual and automatic) and query{by{example searches9. Both topic title and narratives have been translated into the following languages: German, French, Italian, Spanish (European), Spanish (Latin American), Chinese (Simpli ed), Chinese (Traditional) and Japanese. Translations have also been produced for the titles only and these are available in 25 languages including: Russian, Croatian, Bulgarian, Hebrew and Norwegian. All translations have been provided by native speakers and veri ed by at least one other native speaker. 2.3

Creating Relevance Assessments

Relevance assessments were performed by sta at the University of She eld (the majority unfamiliar with the St. Andrews collection but given training and access to the collection through our IR system). The top 50 results from all submitted runs (349) were used to create image pools giving an average of 1,376 (max: 2,193 and min: 760) images to judge per topic. The authors judged all topics to create a \gold standard" and at least two further assessments were obtained for each topic. Assessors used a custom{built tool to make judgements accessible on{line enabling them to log in when and where convenient. We asked assessors to judge every image in the topic pool, but also to use interactive search and judge: searching the collection using their own queries to supplement the image pools with further relevant.

The assessment of images in this ImageCLEF task is based on using a ternary classi cation scheme: (1) relevant, (2) partially relevant and (3) not relevant. The aim of the ternary scheme is to help assessors in making their relevance judgements more accurate (e.g. an image is de nitely relevant in some way, but maybe the query object is not directly in the foreground: it is therefore considered partially relevant). Relevance assessment for the more general topics are based entirely on the visual content of images (e.g. \aircraft on the ground"). However, certain topics also require the use of the caption to make a con dent decision (e.g. "pictures of North Street St Andrews"). What constitutes a relevant image is a subjective decision, but typically a relevant image will have the subject of the topic in the foreground, the image will not be too dark in contrast, and maybe the caption con rms the judge's decision.

Based on these judgements, various combinations are used to create the set of relevant images and as in previous years, we used the pisec-total set: those images judges as relevant or partially{ relevant by the topic creator and at least one other assessor. These are then used to evaluate system performance and compare submissions. The size of pools and number of relevant images is shown in Table 11 (the %max indicating the pool size compared to the maximum possible pool size, i.e. if all top 50 images from each submission were unique). 2.4

Participating Groups

In total, 19 groups registered for this task and 11 ended up submitting (including 5 new groups compared to last year) a total of 349 runs (all of which were evaluated). Participants were given queries and relevance judgements from 2004 as training data and access to a default CBIR system (GIFT/Viper). Submissions from participants are brie y described in the following. CEA: CEA from France, submitted 9 runs. Experimented with 4 languages, title and title+narrative, and merging between modalities (text and image). This is simply based on normalised scores obtained by each search and is conservative (results obtained using visual topics and CBIR system are used only to reorder results obtained using textual topics)

9 See http://ir.shef.ac.uk/imageclef2005/adhoc.htm for an example

NII: National Institute of Informatics from Japan, submitted 16 runs with 3 languages. These experiments were aimed to see if the inclusion of learned word association model - the model which represents how words are related - can help nding relevant images in adhoc CLIR setting. To do this, basic unigram language models were combined with di erently estimated word association models that performs soft word-expansion. Also, combining simple keyword matching-like language models to above mentioned soft word-expansion language models at the model-output level. All runs were text only.

Alicante: University of Alicante (Computer Science) from Spain, submitted 62 runs (including 10 joint runs with UNED and Jaen). They experimented with 13 languages using title, automatic query expansion and text only. Their system combines probabilistic information with ontological information and a feedback technique. Several information streams are created using di erent sources: stems, words and stem bigrams, the nal result obtained by combining them. An ontology has been created automatically from the St. Andrews collection to relate a query with several image categories. Four experiments were carried out to analyse how di erent features contribute to retrieval results. Moreover, a voting-based strategy was developed joining three di erent systems of participating universities: University of Alicante, University of Jaen and UNED.

CUHK: Chinese University of Hong Kong, submitted 36 runs for English and Chinese (simpli ed). CUHK experimented with title, title+narrative and using visual methods to rerank search results (visual features are composed of two parts: DCT coe cients and Colour moments with a dimension of 9). Various IR models used for retrieval (trained on 2004 data), together with query expansion. LDC Chinese segmentor is used to extract words from Chinese queries and translated into English using a dictionary. DCU: Dublin City University (Computer Science) from Ireland, submitted 33 runs for 11 languages. All runs were automatic using title only. Standard OKAPI used incorporating stop word removal, su x stripping and query expansion using pseudo relevance feedback. Their main focus of participation was to explore an alternative approach to combining text and image retrieval in an attempt to make use of information provided by the query image. Separate ranked lists returned using text retrieval without feedback and image retrieval based on standard low-level colour, edge and texture features, were investigated to nd documents returned by both methods. These documents were then assumed to be relevant and used for text based pseudo relevance feedback and retrieval as in our standard method.

Geneva: University Hospitals Geneva from Switzerland, submitted 2 runs based on visual retrieval only (automatic and no feedback).

Indonesia: University of Indonesia (Computer Science), submitted 9 runs using Indonesian queries only. They experimented with using title and title+narrative, with and without query expansion and combining text and image retrieval (all runs automatic).

MIRACLE: Daedalus and Madrid University from Spain, submitted 106 runs for 23 languages. All runs were automatic, using title only, no feedback and text-based only.

NTU: National Taiwan University from Taiwan, submitted 7 runs for Chinese (traditional) and English (also included a visual-only run). All runs are automatic and NTU experimented with using query expansion, using title and title+narrative and combining visual and text retrieval.

Jaen: University of Jaen (Intelligent Systems) from Spain, submitted 64 runs in 9 languages (all automatic). Jaen experimented with title and title+narrative, with and without feedback and combining both text and visual retrieval. Jaen experimented with both term weighting and the use of pseudo relevance feedback.

UNED: UNED from Spain, submitted 5 runs for Spanish (both Latin American and European) and English. All runs were automatic, title, text only and with feedback. UNED experimented with three di erent approaches: i) a naive baseline using a word by word translation of the title topics; ii) a strong baseline based on Pirkola's work; and iii) a structured query using the named entities with eld search operators and Pirkola's approach.

Participants were asked to categorise their submissions by the following dimensions: query language, type (automatic or manual), use of feedback (typically relevance feedback is used for automatic query expansion), modality (text only, image only or combined) and the initial query (visual only, title only, narrative only or a combination). A summary of submissions by these dimensions is shown in Table 1. No manual runs have been submitted this year, and a large proportion are text only using just the title. Together with 41% of submissions using query expansion, this co{incides with the large number of query languages o ered this year and the focus on query translation by participating groups (although 6 groups submitted runs involving CBIR). An interesting submission this year was the combined e orts of Jaen, UNED and Alicante to create an approach based on voting for images. Table 2 provides a summary of submissions by query language. At least one group submitted for each language, the most popular (non-English) being French, German and Spanish (European). Results for submitted runs were computed using the latest version of trec eval 10 from NIST (v7.3). From the scores output, four chosen to evaluate submissions are Mean Average Precision (MAP), precision at result 10 (P10), precision at result 100 (P100) and the number of relevant images retrieved (RelRet) from which we compute recall (the proportion of relevant retrieved). Table 3 summarises the top performing systems in the ad-hoc task based on MAP. Whether MAP is the best score to rank image retrieval systems is debatable, hence our inclusion of P10 and P100 scores. The highest English (monolingual) retrieval score is 0.4135, with a P10 of 0.5500 and P100 of 0.3197. On average recall is high (0.8434), but low MAP and P10 indicating that relevant images are likely retrieved at lower rank positions. The highest monolingual score is obtained using combined visual and text retrieval and relevance feedback.

The highest cross{language MAP is Chinese (traditional) for the NTU submission which is 97% of highest monolingual score. Retrieval performance is variable across language with some performing poorly, e.g. Romanian, Bulgarian, Czech, Croatian, Finnish and Hungarian. Although these languages did not have translated narratives available for retrieval, it is more likely low performance results from limited availability of translation and language processing resources and di cult language structure (e.g. results from CLEF2004 showed Finnish to be a very challenging language due to its complex morphology). Hungarian performs the worst at 23% of monolingual. However, it is encouraging to see participation at CLEF for these languages. On average, MAP 10 http://trec.nist.gov/trec eval/trec eval.7.3.tar.gz for English is 0.2084 (0.3933 P10; 0.6454 recall) and across all languages is 0.2009 (0.2985 P10; 0.5737 recall) { see Table 4. The variety of submissions in the ad-hoc task this year has been pleasing with a number of groups experimenting with both visual and text-based retrieval methods and combining the two (although the number of runs submitted as combined is much lower than 2004). As in 2004, the combination of text and visual retrieval appears to give highest retrieval e ectiveness (based on MAP) indicating this is still an area for research. We aimed to o er a wider range of languages of which 13 have submissions from at least two groups (compared to 10 in 2004). It would seem that the focus for many groups in 2005 has been translation with more use made of both title and narrative than 2004. However, it is interesting to see languages such as Chinese (traditional) and Spanish (Latin American) perform above European languages such as French, German and Spanish (European) which performed best in 2004.

Although topics were designed to be more suited to visual retrieval methods (based on comments from participants in 2004), the topics are still dominated by semantics and background knowledge; pure visual similarity still plays a less signi cant role. The current ad-hoc task is not well-suited to purely visual retrieval because colour information, which typically plays an important role in CBIR, is ine ective due to the nature of the St. Andrews collection (historic photographs). Also unlike typical CBIR benchmarks, the images in the St. Andrews collection are very complex containing both objects in the foreground and background which prove indistinguishable to CBIR methods. Finally, the relevant image set is visually di erent for some queries (e.g. di erent views of a city) making visual retrieval methods ine ective. This highlights the importance of using either text-based IR methods on associated metadata alone, or combined with visual features. Relevance feedback (in the form of automatic query expansion) still plays an important role in retrieval as also demonstrated by submissions in 2004: a 17% increase in 2005 and 48% in 2004.

We are aware that research in the ad-hoc task using the St. Andrews collection has probably reached a plateau. There are obvious limitations with the existing collection: mainly black and white images, domain-speci c vocabulary used in associated captions, restricted retrieval scenario (i.e. searches for historic photographs) and experiments with limited target language (English) are only possible (i.e. cannot test further bilingual pairs). To address these and widen the image collections available to ImageCLEF participants, we have been provided with access to a new collection of images from a personal photographic collection with associated textual descriptions in German and Spanish (as well as English). This is planned for use in the ImageCLEF 2006 ad-hoc task. 3 3.1

Ad{hoc Retrieval from Medical Image Collections Goals and objectives

Domain{speci c information retrieval is getting increasingly important and this holds especially true for the medical eld, where patients as well as clinicians and researchers have their particular information needs [ 10 ]. Whereas information needs and retrieval methods for textual documents have been well researched, there is only a small amount of information available on the need to search for images [ 11 ], and even less so for the use of images in the medical domain. ImageCLEFmed is creating resources to evaluate information retrieval tasks on medical image collections. This process includes the creation of image collections, of query tasks, and the de nition of correct retrieval results for these tasks for system evaluation. Part of the tasks have been based on surveys of medical professionals and how they use images [ 9 ].

Much of the basic structure is similar to the non{medical ad{hoc task such as the general outline, the evaluation procedure and the relevance assessment tool used. These similarities will not be described in any detail in this section. 3.2

Data sets used and query topics

In 2004, only the Casimage11 dataset was made available to participants [ 12 ], containing almost 9.000 images of 2.000 cases [ 13 ], 26 query topics with relevance judgements of three medical experts. It is also part of the 2005 collection. Images present in the data set include mostly radiology modalities, but also photographs, powerpoint slides and illustrations. Cases are mainly in French, with around 20% being in English. We were also allowed to use the PEIR12 (Pathology Education Instructional Resource) database using annotation from the HEAL13 project (Health Education Assets Library, mainly Pathology images [ 7 ]). This dataset contains over 33.000 images with English annotation, with the annotation being in XML per image and not per case as casimage. The nuclear medicine database of MIR, the Mallinkrodt Institute of Radiology14 [ 14 ], was also made available to us for ImageCLEF. This dataset contains over 2.000 images mainly from nuclear medicine with annotations per case and in English. Finally, the PathoPic15 collection (Pathology 11 http://www.casimage.com/ 12 http://peir.path.uab.edu/ 13 http://www.healcentral.com/ 14 http://gamma.wustl.edu/home.html 15 http://alf3.urz.unibas.ch/pathopic/intro.htm images [ 8 ]) was included into our dataset. It contains 9.000 images with an extensive annotation per image in German. Part of the German annotation is translated into English, but it is still incomplete. This means, that a total of more than 50.000 images was made available with annotations in three di erent languages. Two collections have case{based annotations whereas two collections have image image{based annotations. Only through the access to the data by the copyright holders, we were able to distribute these images to the participating research groups.

The image topics were based on a small survey at OHSU. Based on this survey, the topics were developed along the following main axes: { Anatomic region shown in the image; { Image modality (x{ray, CT, MRI, gross pathology, ...); { Pathology or disease shown in the image; { abnormal visual observation (eg. enlarged heart); As the goal was clearly to accommodate both visual and textual research groups we developed a set of 25 topics containing three di erent groups of queries: queries that are expected to be solvable with a visual retrieval system (topics 1-12), topics where both text and visual features are expected to perform well (topics 13-23) and semantic topics, where visual features are not expected to improve results. All query topics were of a higher semantic level than the 2004 topics because the automatic annotation task provides a testbed for purely visual retrieval/classi cation. All 25 topics contain one to three images, one query also an image as negative feedback. The query text was given out with the images in the three languages present in the collections: English, German, and French. An example for a visual query of the rst category can be seen in Figure 2.

Show me chest CT images with emphysema.

Zeige mir Lungen CTs mit einem Emphysem.

Montre{moi des CTs pulmonaires avec un emphyseme.

A query topic that will require more than purely visual features can be seen in Figure 3. 3.3

Relevance judgements

The relevance assessments were performed at OHSU in Portland, Oregon. A simple interface was used from previous ImageCLEF relevance assessments. 9 judges, mainly medical doctors and one image processing specialist performed the relevance judgements. Due to a lack of resources, only part of the topics could be judged by more than one person.

To create the image pools for the judgements, the rst 40 images of each submitted run were taken into account to create pools with an average size of 892 images. The largest pool size was Show me all x{ray images showing fractures.

Zeige mir Rontgenbilder mit Bruchen.

Montres{moi des radiographies avec des fractures. 1167 and the smallest one 470. It took the judges an average of roughly three hours to judge the images for a single topic. Compared to the purely visual topics from 2004 (around one hour judgement per topic containing an average of 950 images) the judgement process took much longer per image as the semantic queries required to verify the text and often an enlarged version of the images. The longer time might also be due to the fact that in 2004 all images were pre{marked as irrelevant, and only relevant images required a change, whereas this year we did not have anything pre{marked. Still, this process is signi cantly faster than most text research judgements, as a large number of irrelevant images could be sorted out very quickly.

We use a ternary judgement scheme including relevant, partially{relevant, and non{relevant. For the o cial qrels, we only used images marked as relevant. We also had several topics judged by two persons, but still took only the rst judgements for the evaluations. Further analysis will follow in the nal conference proceedings when more knowledge is available on the used techniques as well. 3.4

Participants

The number of registered participants of ImageCLEF has multiplied over the last three years. ImageCLEF started with 4 participants in 2003, then in 2004 a total of 18 groups participated and in 2005 we have 36 registered groups. The medical retrieval task had 12 participants in 2004 when it was purely visual and 13 in 2005 as a mixture of visual and non-visual retrieval. A surprisingly small number of groups (13 of 28 registered groups) nally submitted results, which can be due to the short time span between delivery of the images and the deadline for results submission. Another point was the fact that several groups only registered very late as they had not had information about ImageCLEF beforehand, but they were still interested in the datasets also for future participations. As the registration to the task is free, they could simply register to get this access.

The following groups registered but were nally not able to submit results for a variety of reasons: { UNED, LSI, Valencia, Spain { Central University, Caracas, Venezuela { Temple University, computer science, USA { Imperial College, computing lab, UK { Dublin City university, computer science, Ireland { CLIPS Grenoble, France { University of She eld, UK { Chinese University of Honk Kong, China

Finally 13 groups (two of them from the same laboratory but di erent groups in Singapore) submitted results for the medical retrieval task, including a total of 134 runs. Only 6 manual runs were submitted. Here is a short list of their participation including a short description of the submitted runs: { National Chiao Tuna University, Taiwan: submitted 16 runs in total, all automatic. 6 runs were visual only and 10 mixed runs. They use simple visual features (color histogram, coherence matrix, layout features) as well as text retrieval using a vector{space model with word expansion using wordnet. { State university of New York (SUNY), Bu alo, USA: submitted a total of 6 runs, one visual and ve mixed runs. GIFT was used as visual retrieval system and SMART as textual retrieval system, while mapping the text to UMLS. { University and Hospitals of Geneva, Switzerland: submitted a total of 19 runs, all automatic runs. This includes two textual and two visual runs plus 15 mixed runs. The retrieval relied mainly on the GIFT and easyIR retrieval systems. { RWTH Aachen, computer science, Germany: submitted 10 runs, two being manual mixed retrieval, two automatic textual retrieval, three automatic visual retrieval and three automatic mixed retrieval. The Fire retrieval engine was used with varied visual features and a text search engine using English and mixed{language retrieval. { Daedalus and Madrid University, Spain: submitted 14 runs, all automatic. 4 runs were visual only and 10 were mixed runs; They mainly used semantic word expansions with EuroWordNet. { Oregon Health and Science University, Portland, OR, USA: submitted three runs in total, two manual runs, one for visual and one for textual retrieval and one automatic textual run. As retrieval engines GIFT and Lucene are being used. { University of Jaen, Spain: had a total of 42 runs, all automatic. 6 runs were textual, only, and 36 were mixed. GIFT is used as a visual query system and the LEMUR system is used for text in a variety of con gurations to achieve multilingual retrieval. { Institute for Infocomm research, Singapore: submitted 7 runs, all of them automatic visual runs; For their runs they rst manually selected visually similar images to train the features, which should rather be classi ed as a manual run, then. Then, they use a two{step approach for visual retrieval. { Institute for Infocomm research { second group , Singapore: submitted a total of 3 runs, all visual with one being automatic and two manual runs The main technique applied is the connection of medical terms and concepts to visual appearances. { RWTH Aachen { medical informatics, Germany: submitted two visual only runs with several visual features and classi cation methods of the IRMA project. { CEA, France: submitted ve runs, all automatic with two being visual, only and three mixed runs. The techniques used include the the PIRIA visual retrieval system and a simple frequency{ based text retrieval system. { IPAL CNRS/ I2R, France/Singapore: submitted a total of 6 runs, all automatic with two being text only and the other a combination of textual and visual features. For textual retrieval they map the text onto single axes of the MeSH ontology. They also use negative weight query expansion and mix visual and textual results for optimal results. { University of Concordia, Canada: submitted one visual run containing a query only for the rst image of every topic using only visual features. The technique applied is an association model between low{level visual features and high{level concepts mainly relying on texture, edge and shape features.

In Table 5 an overview of the submitted runs can be seen including the query dimensions. This section will give an overview of the best results of the various categories and will also do some more in depth analysis on a topic basis. More needs to follow based on the submissions of the papers from the participants.

Table 6 shows all the manual runs that were submitted with a classi cation into the technique used for the retrieval

In Table 7 are the best 5 results for textual retrieval only and the best ten results for visual and for mixed retrieval.

If we are looking at single topics it becomes clear that the systems vary extremely over the topics. If we calculate the average over the best system for each query we would be much closer to 0.5 than to what the best system actually achieved, 0.2821. So far, non of the systems optimised the feature selection based on the query input. 3.6

Discussion

The results show a few clear trends. Very few groups performed manual submissions using relevance judgements, which is most likely due to the need of resources for such evaluations. Still, relevance feedback has shown to be extremely useful in many retrieval tasks and the evaluation of it seems extremely necessary, as well. Surprisingly, in the submitted results, relevance feedback does not seem to have a much superior performance compared to the automatic runs. In the 2004 tasks the relevance feedback runs were often signi cantly better than without feedback.

It also becomes clear that the topics developed were much more geared towards textual retrieval than visual retrieval. The best results for textual retrieval are much higher than for visual retrieval only, and a few of the bad textual runs seem simply to have indexing problems. When analysing the topics in more details a clear division becomes clear between the developed visual and textual topics, but also some of the topics marked as visual had actually better results using a textual system. Some systems actually perform extremely well on a few topics but then extremely bad on other topics. No system is actually the best system for more than two of the topics.

The best results were clearly obtained when combining textual and visual features most likely due to the fact that there were queries for that either one of the feature sets would work well. 4

Automatic Annotation Task

4.1

Introduction, Idea, and Objectives

Automatic image annotation is a classi cation task, where an image is assigned to its correspondent class from a given set of pre-de ned classes. As such, it is an important step for content-based image retrieval (CBIR) and data mining [ 15 ]. The aim of the Automatic Annotation Task in ImageCLEFmed 2005 was to compare state-of-the-art approaches to automatic image annotation and to quantify their improvements for image retrieval. In particular, the task aims at nding out how well current techniques for image content analysis can identify the medical image modality, body orientation, body region, and biological system examined. Such an automatic classi cation can be used for multilingual image annotations as well as for annotation veri cation, e.g., to detect false information held in the header streams according to Digital Imaging and Communications in Medicine (DICOM) standard [16].

23 24 25 26 27 28 29 30 31 32 36 37 38 39 40 41 42 44 45 46 47 48 49 50 51 52 53 The database consisted of 9,000 fully classi ed radiographs taken randomly from medical routine at the Aachen University Hospital. 1,000 additional radiographs for which classi cation labels were unavailable to the participants had to be classi ed into one of the 57 classes, the 9,000 database images come from. Although only 57 simple class numbers were provided for ImageCLEFmed 2005. The images are annotated with complete IRMA code, a multi-axial code for image annotation. The code is currently available in English and German. It is planned to use the results of such automatic image annotation tasks for further, textual image retrieval tasks in the future.

Example images together with their class number are given in Figure 4. Table 8 gives the English textual description for each of the classes. 4.3

Participating Groups

In total 26 groups registered for participation in the automatic annotation task. All groups have downloaded the data but only 12 groups submitted runs. Each group had at least two di erent submissions. The maximum number of submissions per group was 7. In total, 41 runs were submitted which are brie y described in the following.

CEA: CEA from France, submitted three runs. In each run di erent feature vectors were used and classi ed using a k-Nearest Neighbour classi er (k was either 3 or 9). In the run labelled cea/pj-3.txt the images were projected along horizontal and vertical axes to obtain a feature histogram. For cea/tlep-9.txt histogram of local edge patterns features and colour features were created, and for cea/cime-9.txt quanti ed colours were used.

CINDI: The CINDI group from Concordia University in Montreal, Canada used multi-class SVMs (one-vs-one) and a 170 dimensional feature vector consisting of colour moments, colour histograms, cooccurence texture features, shape moment, and edge histograms.

Geneva: The medGIFT group from Geneva, Switzerland used various di erent settings for graylevels, and Gabor lters in their medGIFT image retrieval system.

Infocomm: The group from Infocomm Institute, Singapore used three kinds of 16x16 low-resolutionmap-features: initial gray values, anisotropy and contrast. To avoid over- tting, for each of 57 classes, a separate training set was selected and about 6,800 training images were chosen out of the given 9,000 images. Support Vector Machines with RBF (radial basis functions) kernels were applied to train the classi ers which were then employed to classify the test images. Miracle: The Miracle Group from UPM Madrid, Spain uses GIFT and a decision table majority classi er to calculate the relevance of each individual result in miracle/mira20relp57.txt. In mira20relp58IB8.txt additionally a k-nearest neighbour classi er with k = 8 and attribute normalisation is used.

Montreal: The group from University of Montreal, Canada submitted 7 runs, which di er in the used features used. They to estimated, which classes are best represented by which features and combined appropriate features. mtholyoke: For the submission from Mount Holyoke College, MA, USA, Gabor energy features were extracted from the images and two di erent cross-media relevance models were used to classify the data. nctu-dblab: The NCTU-DBLAB group from National Chiao Tung University, Taiwan used a support vector machine (SVM) to learn image feature characteristics. Based on the SVM model, several image features were used to predict the class of the test images. ntu: The Group from National Taiwan University used mean gray values of blocks as features and di erent classi ers for their submissions. rwth-i6: The Human language technology and pattern recognition group from RWTH Aachen University, Germany had two submissions. One used a simple zero-order image distortion model taking into account local context. The other submission used a maximum entropy classi er and histograms of patches as features. rwth-mi: The IRMA group from Aachen, Germany used features proposed by TAMURA et al to capture global texture properties and two distance measures for down-scaled representations, which preserve spatial information and are robust w.r.t. global transformations like translation, intensity variations, and local deformations. The weighing parameters for combining the single classi ers were guessed for the rst submission and trained on a random 8,000 to 1,000 partitioning of the training set for the second submission. ulg.ac.be: The ULg method is based on random sub-windows and decision trees. During the training phase, a large number of multi-size sub-windows are randomly extracted from training images. Then, a decision tree model is automatically built (using Extra-Trees and/or Tree Boosting), based on size-normalised versions of the sub-windows, and operating directly on their pixel values. Classi cation of a new image similarly entails the random extraction of sub-windows, the application of the model to these, and the aggregation of sub-window predictions. The error rates ranges between 12.6 % and 73.3 % (Table 9). Based on the training data, a system guessing the most frequent group for all 1,000 test images would result with 70.3 % error rate, since 297 radiographs of the test set were from class 12 (Table 10). A more realistic baseline of 36.8 % error rate is computed from an 1-nearest-neighbour classi er comparing down-scaled 32 32 versions of the images using the Euclidean distance.

For each class, a more detailed analysis including the number of training and test images as well as with respect to all 41 submitted runs, the average classi cation accuracy, the class most frequently misclassi ed, and the average percentage over all submitted runs of images being assigned to this class is given in Table 10. Obviously, the di culty of the 57 classes diversi es. The average classi cation accuracy range from 6.3 % to 90.7 %, and there is a tendency that classes with less training images are more di cult. For instance for class 32, 78 images were contained in the training but only one image in the test data. In 23 runs, this test image was misclassi ed (43.9 %). Five times, it was labelled to be from class 25 (12.2 %). Also, it can be seen that many images of the classes 7 and 8 have been classi ed to be of class 6. Similar experiments have been described in literature. However, previous experiments have been restricted to a small number of categories. For instance, several algorithms have been proposed for orientation detection of chest radiographs, where lateral and frontal orientation are distinguished by means of image content analysis [18,19]. For this two-class experiment, the error rates are below 1 % [20]. In a recent investigation, Pinhas and Greenspan report error rates below 1 % for automatic categorisation of 851 medical images into 8 classes [21]. In previous investigations of the IRMA group, error rates between 5.3% and 15% were reported for experiments with 1617 of 6 [22] and 6,231 of 81 classes [23], respectively. Hence, error rates of 12 % for 10,000 of 57 classes are plausible.

As mentioned before, classes 6, 7, and 8 were frequently confused. All show parts of the arms and thus look extremely similar (Fig. 4). However, a reason for the common misclassi cation in favour of class 6 might be that there are by a factor of 5 more training images from class 6 than from classes 7 and 8 together.

Given the con dence les from all runs, classi er combination was tested using the sum- and the product rule in such a manner that rst the two best con dence les were combined, then the three best con dence les, and so forth. Unfortunately, the best results was 12.9%. Thus, no improvement over the current best submission was possible using simple classi er combination techniques.

Having some results close to 10% error rate, classi cation and annotation of images might open interesting vistas for CBIR systems. Although the task considered here is more restricted than the Medical Retrieval Task and thus can be considered easier, techniques applied here will most probably be apt to be used in future CBIR applications, too. Therefore, it is planned to use the results of such automatic image annotation tasks for further, textual image retrieval tasks. 5

Conclusions

ImageCLEF has continued to attract researchers from a variety of global communities interested image retrieval using both low-level image features and associated texts. This year we have improved the ad-hoc medical retrieval by enlarging the image collection and creating more semantic submission rwth-i6/IDMSUBMISSION rwth_mi-ccf_idm.03.tamura.06.confidence rwth-i6/MESUBMISSION ulg.ac.be/maree-random-subwindows-tree-boosting.res rwth-mi/rwth_mi1.confidence ulg.ac.be/maree-random-subwindows-extra-trees.res geneva-gift/GIFT5NN_8g.txt infocomm/Annotation_result4_I2R_sg.dat geneva-gift/GIFT5NN_16g.txt infocomm/Annotation_result1_I2R_sg.dat infocomm/Annotation_result2_I2R_sg.dat geneva-gift/GIFT1NN_8g.txt geneva-gift/GIFT10NN_16g.txt miracle/mira20relp57.txt geneva-gift/GIFT1NN_16g.txt infocomm/Annotation_result3_I2R_sg.dat ntu/NTU-annotate05-1NN.result ntu/NTU-annotate05-Top2.result geneva-gift/GIFT1NN.txt geneva-gift/GIFT5NN.txt miracle/mira20relp58IB8.txt ntu/NTU-annotate05-SC.result nctu-dblab/nctu_mc_result_1.txt nctu-dblab/nctu_mc_result_2.txt nctu-dblab/nctu_mc_result_4.txt nctu-dblab/nctu_mc_result_3.txt nctu-dblab/nctu_mc_result_5.txt cea/pj-3.txt mtholyoke/MHC_CQL.RESULTS mtholyoke/MHC_CBDM.RESULTS cea/tlep-9.txt cindi/Result-IRMA-format.txt cea/cime-9.txt montreal/UMontreal_combination.txt montreal/UMontreal_texture_coarsness_dir.txt nctu-dblab/nctu_mc_result_gp2.txt montreal/UMontreal_contours.txt montreal/UMontreal_shape.txt montreal/UMontreal_contours_centred.txt montreal/UMontreal_shape_fourier.txt montreal/UMontreal_texture_directionality.txt Euclidean Distance, 32x32 images, 1-Nearest-Neighbor queries based on realistic information needs of medical professionals. The ad-hoc task has continued to attract interest and this year has seen an increase in the number of translated topics and those with translated narratives. The addition of the IRMA annotation task has provided a further challenge to the medical side of ImageCLEF and proven a popular task for participants, covering mainly the visual retrieval community. The user-centered retrieval task, however, remains with low participation, mainly due to the high level of resources required to run an interactive task. We will continue to improve tasks for ImageCLEF 2006 mainly based on feedback from participants.

A large number of participants only registered but nally did not submit results. This means that the resources are very valuable and already access to the resources is a readon to register. Still, only if we have participants submitting results with di erent techiques, there is really the possibility to compare retrieval systems and developed better retrieval for the future. So for 2006 we hope to receive much feedback for tasks and many people who register, submit results and participate at the CLEF workshop to discuss the presented techniques. 6

Acknowledgements

This work has been funded in part by the EU Sixth Framework Programme (FP6) within the Bricks project (IST contract number 507457) as well as the SemanticMining project (IST NoE 507505). The establishment of the IRMA database was funded by the German Research Community DFG under grand Le 1108/4. We also acknowledge the generous support of National Science Foundation (NSF) grant ITR-0325160. categorisation of medical images for content-based retrieval and data mining. Computerized Medical Imaging and Graphics 2005; 29(2): 143-155. 16. Guld MO, Kohnen M, Keysers D, Schubert H, Wein B, Bredno J, Lehmann TM. Quality of DICOM header information for image categorization. Procs SPIE 2002; 4685: 280-287. 17. Lehmann TM, Schubert H, Keysers D, Kohnen M, Wein BB. The IRMA code for unique classi cation of medical images. Procs SPIE 2003; 5033: 440-451. 185-189.

190-193. 19. Boone JM, Seshagiri S, Steiner RM. Recognition of chest radiograph orientation for picture archiving and communications systems display using neural networks. Journal of Digital Imaging 1992; 5(3): 20. Lehmann TM, Guld MO, Keysers D, Schubert H, Kohnen M, Wein BB. Determining the view position of chest radiographs. Journal of Digital Imaging 2003; 16(3): 280-291. 21. Pinhas A, Greenspan H. A continuous and probabilistic framework for medical image representation and categorization. Procs SPIE 2003; 5371: 230-238.

Proc. Bildverarbeitung fr die Medizin 2004 : 366-370 gorization of medical images. Procs SPIE 2004; 5371: 211-222. 22. Keysers D, Gollan C., Ney H. Classi cation of Medical Images using Non-linear Distortion Models. 23. Guld MO, Keysers D, Leisten M, Schubert H, Lehmann TM. Comparison of global features for cate

1. Clough , P.D. and Sanderson , M. ( 2003 ), The CLEF 2003 cross language image retrieval track , In Proceedings of Cross Language Evaluation Forum (CLEF) 2003 Workshop , Trondheim, Norway.

2. Clough , P. , Muller, H. and Sanderson , M. The CLEF 2004 Cross Language Image Retrieval Track, In Multilingual Information Access for Text, Speech and Images: Results of the Fifth CLEF Evaluation Campaign , Eds (Peters, C. , Clough , P. , Gonzalo , J. , Jones , G. , Kluck , M. and Magnini , B. ), Lecture Notes in Computer Science (LNCS) , Springer, Heidelberg, Germany, 2005 , Volume 3491 / 2005 , 597 - 613 .

3. . J. Cox , M. L.

Miller , S. M.

Omohundro , and P. N.

Yianilos . Pichunter: Bayesian relevance feedback for image retrieval . Proceedings of the 13th International Conference on Pattern Recognition , 3 : 361 { 369 , 1996 .

4. Grubinger , M. , Leung , C. and Clough , P.D. Towards a Topic Complexity Measure for Cross-Language Image Retrieval , In Proceedings of Cross Language Evaluation Forum (CLEF) 2005 Workshop , Vienna, Austria

5. Petrelli , D. and Clough , P.D. Concept Hierarchy across Languages in Text-Based Image Retrieval: A User Evaluation , In Proceedings of Cross Language Evaluation Forum (CLEF) 2005 Workshop , Vienna, Austria

6. Villena-Roman , R. , Crespo-Garc

, R.M., and Gonzalez-Cristobal , J.C. Boolean Operators in Interactive Search , In Proceedings of Cross Language Evaluation Forum (CLEF) 2005 Workshop , Vienna, Austria

C. S.

Candler ,

S. H.

Uijtdehaage , and

S. E.

Dennis. Introducing HEAL : The health education assets library . Academic Medicine , 78 ( 3 ): 249 { 253 , 2003 .

Glatz-Krieger ,

Glatz ,

Gysel ,

Dittler , and

M. J.

Mihatsch . Webbasierte Lernwerkzeuge fur die Pathologie { web{based learning tools for pathology . Pathologe , 24 : 394 { 399 , 2003 .

Hersh , H. Muller, P. Gorman, and

Jensen . Task analysis for evaluating image retrieval systems in the ImageCLEF biomedical image retrieval task . In Slice of Life conference on Multimedia in Medical Education (SOL 2005 ), Portland, OR , USA, June 2005 .

10.

W. R.

Hersh and

D. H.

Hickam . How well do physicians use electronic information retrieval systems? Journal of the American Medical Association , 280 ( 15 ): 1347 { 1352 , 1998 .

11.

Markkula and

Sormunen . Searching for photos { journalists' practices in pictorial IR . In J. P. Eakins,

D. J.

Harper , and J. Jose, editors, The Challenge of Image Retrieval, A Workshop and Symposium on Image Retrieval , Electronic Workshops in Computing, Newcastle upon Tyne , 5 { 6 February 1998 . The British Computer Society .

12. H. Muller,

Rosset ,

J.-P.

Vallee ,

Terrier , and

Geissbuhler . A reference data set for the evaluation of medical image retrieval systems . Computerized Medical Imaging and Graphics , 28 : 295 { 305 , 2004 .

13.

Rosset , H. Muller, M. Martins,

Dfouni ,

J.-P.

Vallee , and

Ratib . Casimage project { a digital teaching les authoring environment . Journal of Thoracic Imaging , 19 ( 2 ):1{ 6 , 2004 .

14.

J. W.

Wallis , M. M. Miller , T. R.

Miller , and T. H.

Vreeland . An internet{based nuclear medicine teaching le . Journal of Nuclear Medicine , 36 ( 8 ): 1520 { 1527 , 1995 .

15. Lehmann

, Gld

, Deselaers

, Keysers

, Schubert

, Spitzer

, Ney

, Wein

. Automatic 18 . Pietka

, Huang

. Orientation correction for chest images . Journal of Digital Imaging 1992 ; 5 ( 3 ): 5 .11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 5.22 5.23 5.24 5.25 5.26 5.27 5.28 Fig . 5. Example images given to participants for the ad-hoc retrieval task (1 of 2 images) .