MIRACLE at ImageCLEFmed 2007: Merging Textual and Visual Strategies to Improve Medical Image Retrieval Julio Villena-Román1,3, Sara Lana-Serrano2,3, José Carlos González-Cristóbal2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS - Data, Decisions and Language, S.A. jvillena@it.uc3m.es, slana@diatel.upm.es, josecarlos.gonzalez@upm.es Abstract This paper describes the participation of MIRACLE research consortium at the ImageCLEF Medical Image Retrieval task of ImageCLEF 2007. For this campaign, our challenge was to research on different merging strategies, i.e. methods of combination of textual and visual retrieval techniques. We have focused on the idea of performing all possible combinations of well-known textual and visual techniques in order to find which ones offer the best results in terms of MAP and analyze if the combined results may improve the individual ones. Our system consists of three different modules: the textual (text-based) retrieval module, which indexes the case descriptions to look for those descriptions which are more relevant to the text of the topic; the visual (content- based) retrieval component, which provides the list of case images that are more similar to the topic images; and, finally, the merging module, which offers different operators (AND, OR, LEFT, RIGHT) and metrics (max, min, avg, max-min) to combine and rerank the outputs of the two previous subsystems. These modules are built up from a set of basics components organized in four categories: (i) resources and tools for both general-domain and medical-specific vocabulary analysis, (ii) linguistic tools for text-based information retrieval, (iii) tools for image analysis and retrieval, and (iv) ad-hoc tools for result merging and reranking. We finally submitted 50 runs. The highest MAP was obtained with the baseline text-based experiment in English where only stemming plus stopword removal is performed. Neither tagging with UMLS medical concepts nor merging of textual and visual results proved to be of value to improve the precision with regards to the baseline experiment. However, the most interesting conclusion is that experiments that use the OR operator obtain higher MAP values than those with the AND operator. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.2 Information Storage; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital libraries. H.2 [Database Management]: H.2.5 Heterogeneous Databases; E.2 [Data Storage Representations]. Keywords Image retrieval, domain-specific vocabulary, thesaurus, linguistic engineering, information retrieval, indexing. 1. Introduction The MIRACLE team is a research consortium formed by research groups of three different universities in Madrid (Universidad Politécnica de Madrid, Universidad Autónoma de Madrid and Universidad Carlos III de Madrid) along with DAEDALUS, a small/medium size enterprise (SME) founded in 1998 as a spin-off of two of these groups and a leading company in the field of linguistic technologies in Spain. MIRACLE has taken part in CLEF since 2003 in many different tracks and tasks, including the main bilingual, monolingual and cross lingual tasks as well as in ImageCLEF [7] [8], Question Answering, WebCLEF and GeoCLEF tracks. This paper describes our participation in the ImageCLEFmed task of ImageCLEF 2007. The goal of this task (fully described in [9]) is to improve the retrieval of medical images from heterogeneous and multilingual document collections containing images as well as text. The task organizers provide a list of topic statements (a short textual description explaining the research goal) in English, French and German, and a collection of images (from one to three) for each topic. The objective is to retrieve as many relevant images as possible from the given visual and multilingual topics. ImageCLEFmed 2007 extends the experiments of past editions with a larger database and even more complex queries. Although this task certainly requires the use of image retrieval techniques and our areas of expertise do not include image analysis research, we do take part to promote and encourage multidisciplinary participation in all aspects of information retrieval, no matter whether it is text or content based. All experiments are fully automatic, thus avoiding any manual intervention. We submitted runs using only text (text-based retrieval) or only visual features (content-based retrieval) and also mixed runs using a combination of both. 2. System Description Our system is logically built up from three different modules: the textual (text-based) retrieval module, which indexes case descriptions in order to look for the most relevant ones to the text of the topic; the visual (content- based) retrieval component, which provides the list of case images that are more similar to the topic ones; and, finally, the result combination module, which uses different operators to combine the results of the two previous subsystems. Figure 1 gives an overview of the system architecture. Figure 1. Overview of the system. 2.1. Textual Retrieval The system consists of a set of different basic components organized in two categories: • Resources and tools for medical-specific vocabulary analysis • Linguistic tools for textual analysis and retrieval. Instead of using raw terms, the textual information of both topics and documents is parsed and tagged to unify all terms into concepts of medical entities. This is similar to a stemming or a lemma extraction process, but the output, instead of the stem or lemma, is the medical entity to which the term relates. The consequence of this process is that concept identifiers [5] are used instead of terms in the text-based process of information retrieval. For this purpose, a terminological dictionary was created by using a subset of the Unified Medical Language System (UMLS) metathesaurus (US National Library of Medicine) [12] and incorporating terms in English, Spanish, French and German (the four different languages involved in the ImageCLEFmed task [9]). This dictionary contains 4,327,255 entries matching 1,215,749 medical concepts. Table 1 shows the language coverage of terms (the same as UML). Table 1. Language distribution of terms. Lang #Terms EN 3,207,890 ES 1,116,086 FR 2,556 DE 723 For example: Tagged Topic (M7) Pathology [non hodgkins lymphoma] UML_C0024305 Pertinent Tagged document (PathoPic/000041_en) Primary [Non Hodgkin's lymphoma] UML_C0024305 [lymphoma of the heart] UML_C1332850 41 [NHL] UML_C0024305 UML_C0079745 UML_C1705385 Pertinent Tagged document (PathoPic/000689_en) [chronic lymphatic leukemia] UML_C0023434 UML_C0023458 689 [CLL] UML_C0023434 UML_C0023458 [NHL] UML_C0024305 UML_C0079745 UML_C1705385 The baseline approach to process the document collection is based on the following steps which are executed in sequence: 1. Text Extraction: Ad-hoc scripts are run on the files that contain information about the medical cases in order to extract the annotations and metadata enclosed between XML tags. Table 2 shows the metadata which was considered from each collection. Table 2. Metadata extracted from XML annotation files. Collection Lang Metadata CASImage FR Description, Diagnosis, Clinical Presentation, Keywords, Anatomy, Chapter, Title, Age Endoscopy EN Title, Subject, Description myPACS EN Title, Abstract, Keywords, Text-Caption, Discussion, Document-Type, Pathology, Anatomy, Pt-Sex, Months, Years, Days PathoPICS DE Diagnose, Synonyme, Beschreibung, Zusatzbefund, Klinik, Kommentar EN Diagnosis, Synonyms, Description, AddtlFindings, ClinicalFindings, Comment Peir EN Title, Description, RadiographType, DiseaseProcess, ClinicalHistory MIR EN Diagnosis, Brief_History, Images, Full_History, Radiopharm, Findings, Discussion, Followup, Teaching 2. Medical-vocabulary Recognition: All case descriptions and topics are parsed and tagged using a subset of Unified Medical Language metathesaurus [12] to identify and disambiguate medical terms. 3. Tokenization: This process extracts basic text components, detecting and isolating punctuation symbols. Some basic entities are also treated, such as numbers, initials, abbreviations, and years. So far, compounds, proper nouns, acronyms or other entities are not specifically considered. The outcomes of this process are only single words, years in numbers (e.g. 1995, 2004, etc.) and tagged entities. 4. Lowercase words: All document words are normalized by changing all uppercase letters to lowercase. 5. Filtering: All words recognized as stopwords are filtered out. Stopwords in the target languages were initially obtained from [11] and afterwards extended using several other sources [2] as well as our own knowledge and resources [8]. 6. Stemming: This process is applied to each one of the words to be indexed or used for retrieval. Standard stemmers from Porter [10] have been used. 7. Indexing and retrieval: The information retrieval engine applied for all textual indexing and retrieval task was Lucene [1]. No feedback or any other kind of expansion was used. Because the textual retrieval module is completely based on information about medical cases, the last step of module is to obtain the images that correspond to each case (block labeled as AnnotationToImage at Figure 2). 2.2. Visual Retrieval For this part of the system, we resorted to two publicly and freely available Content-Based Information Retrieval systems: GIFT (GNU Image Finding Tool) [4] and FIRE (Flexible Image Retrieval Engine) [3] [6]. They are both developed under the GNU license and allow to perform query by example on images, using an image as the starting point for the search process and relying entirely on the image contents. In the case of GIFT, the complete image database was indexed in a single collection, down-scaling each image to 32x32 pixels. For each ImageCLEFmed query, a visual query is made up of all the images contained in the query. Next, this visual query is used in GIFT to obtain the list of the most relevant images (i.e., images which are more similar to those included in the visual query), along with the corresponding relevance values. Although different search algorithms could be integrated as plug-ins in GIFT, only the provided separate normalization algorithm has been used in our experiments. On the other hand, we directly used the results of the FIRE system kindly provided by the organizers, with no further processing. 2.3. Merging The textual and image result lists are then merged by applying different techniques, which are characterized by an operator and a metric for computing the relevance (score) of the result. Table 3 shows the defined operators: union (OR), intersection (AND), difference (AND NOT), and external join (LEFT JOIN, RIGHT JOIN). Each of these operators selects which images are part of the final result set. Table 3. Combination operators. Operators OR A∪B AND A∩B LEFT (A ∪ B ) ∪ (A − B) RIGHT (A ∪ B ) ∪ (B − A) Then, results are reranked by computing a new relevance measure value based on their corresponding input results by using different metrics shown in Table 4. Table 4. Score computing metrics. Metrics max score = max(a, b) min score = min(a, b) avg score = avg(a, b) min(a, b) mm score = max(a, b) + min(a, b) * max(a, b) + min(a, b) 3. Experiment Set Experiments are defined by the choice of different combinations of the previously described modules, operators and score computation metrics. A wide set of experiments was submitted: 8 text-based runs covering the 3 different topic languages, 9 content-based runs (built with the combination of results from GIFT and FIRE), and also 33 mixed runs (built with the combination of textual and visual experiments). Table 5. Textual experiments. Run Identifier Language (1) Method TxtENN EN>all stem + stopwords TxtENT EN>all stem + stopwords + tagged with UMLS thesaurus TxtFRN FR>all stem + stopwords TxtFRT FR>all stem + stopwords + tagged with UMLS thesaurus TxtDEN DE>all stem + stopwords TxtDET DE>all stem + stopwords + tagged with UMLS thesaurus TxtXN all>all stem + stopwords TxtXT all>all stem + stopwords + tagged with UMLS thesaurus (1) [Query language] > [Annotation language]; “all” refers to the concatenation of text in all languages Table 6. Visual experiments. Run Identifier Method (1) VisG GIFT VisGFANDavg GIFT ANDavg FIRE VisGFANDmax GIFT ANDmax FIRE VisGFANDmin GIFT ANDmin FIRE VisGFANDmm GIFT ANDmm FIRE VisGFORavg GIFT ORavg FIRE VisGFORmax GIFT ORmax FIRE VisGFORmin GIFT ORmin FIRE VisGFORmm GIFT ORmm FIRE (1) The merging strategy is defined by [Operator] [Metric] Table 7. Mixed textual and visual retrieval experiments. Run Identifier Method Merging strategy MixGENT[Merging] VisG+TxtENT ANDmax, ANDmin, ANDavg, ORmax, ORmin, ORavg, ORmm, LEFTmax, LEFTmin, LEFTmm, RIGHTmax, RIGHTmin, RIGHTmm MixGFRT[Merging] VisG+TxtFRT ORmax, ORmm, LEFTmax, LEFTmm, ANDmin MixGDET[Merging] VisG+TxtDET ORmax, ORmm, LEFTmax, LEFTmm, ANDmin MixGFANDminENT[Merging] VisGFANDmin+TxtENT ORmax, ORmm, LEFTmax, LEFTmm, ANDmin MixGFORmaxENT[Merging] VisGFORmax+TxtENT ORmax, ORmm, LEFTmax, LEFTmm, ANDmin 4. Results Results are presented in the following tables. Each of them shows the run identifier, the number of relevant documents retrieved, the mean average precision (MAP), the R-precision and the precision at 10, 30 and 100 first results. Table 8 shows the results of the text-based experiments. The highest MAP is obtained by the baseline experiment in English where only stemming plus stopword removal is performed. Surprisingly for us, tagging with UMLS thesaurus has proved to be of no use with regards to the simplest strategy. This issue has to be further investigated in case that there is some problem with the generation of the result sets. Table 8. Results for textual experiments. RelRet MAP R-prec P10 P30 P100 TxtENN 2,294 0.3518 0.389 0.58 0.4556 0.36 TxtXN 2,252 0.299 0.354 0.4067 0.3756 0.2943 TxtENT 2,002 0.274 0.2876 0.45 0.3822 0.2697 TxtXT 1,739 0.2005 0.2118 0.3267 0.2889 0.2263 TxtFRN 898 0.1107 0.1429 0.2733 0.1989 0.133 TxtFRT 970 0.1082 0.1138 0.2533 0.1911 0.1297 TxtDET 694 0.0991 0.0991 0.23 0.1222 0.0837 TxtDEN 724 0.0932 0.1096 0.18 0.1356 0.097 Experiments using French and German languages achieve a very low precision (respectively, a decrease to 31% and 28% with regards to English). This result is similar to other experiments carried out in other CLEF tracks and may be attributed to deficient stemming modules. The evaluation for the content-based experiments is shown in Table 9. Table 9. Results for visual experiments(1). RelRet MAP R-prec P10 P30 P100 VisG 532 0.0186 0.0396 0.0833 0.0833 0.047 VisGFANDmm 165 0.0102 0.0255 0.0667 0.05 0.0347 VisGFANDmax 165 0.0099 0.0251 0.06 0.0511 0.0343 VisGFANDavg 165 0.0087 0.0214 0.0467 0.0556 0.0343 VisGFANDmin 165 0.0081 0.0225 0.0367 0.0478 0.0333 (1) Evaluations for some experiments with OR operator are missing In general, MAP values are very low, which reflects the complexity and difficulty of the visual-only retrieval for this task. The best value (5% of the top ranked textual experiment) is obtained with the baseline visual experiment, which just uses GIFT. However, probably due to an oversight by the task organizers, the evaluations for the experiments with the OR operator (4 runs) are missing in the Excel files provided. Thus, no definitive conclusion can be extracted about the usage of any merging strategy, as the restrictive AND operator filters out many images (165 instead of 532 relevant images retrieved). Finally, Table 10 in next page shows the evaluation for the mixed runs. Although the MAP of the best ranked mixed experiment is lower than the MAP of the best textual one (77%), we cannot conclude that the combination of textual and visual results with any kind of merging strategy fails to improve the precision because. The same as before, some experiments with OR operator (11 runs) are missing from the table, thus, it is impossible to extract any valuable conclusion on this issue. However, observe that the best ranked runs are those with the RIGHT operator, which implicitly includes an OR (see definition in Table 4). In addition, the use of this operator (visual RIGHT textual) shows that textual results are preferred over visual results (RIGHT prioritizes the second result list). Another conclusion that can be drawn from these results is that the textual retrieval is the best strategy for this task. We think that this is because many queries include semantic aspects such as medical diagnoses or specific details present in the image, which a purely visual retrieval cannot tackle. This issue will be considered for future participations. The best experiment at ImageCLEFmed 2007 reaches a MAP value of 0.3962, 112% better than ours. Despite this difference, MIRACLE participation is ranked 3rd out of over 12 groups, which is indeed considered to be a very good position. Table 10. Results for mixed textual and visual retrieval experiments(1). RelRet MAP R-prec P10 P30 P100 MixGENTRIGHTmin 2002 0.274 0.2876 0.45 0.3822 0.2697 MixGENTRIGHTmax 2045 0.2502 0.2821 0.3767 0.35 0.29 MixGENTRIGHTmm 2045 0.2486 0.2817 0.3733 0.3578 0.289 MixGFANDminENTORmm 1972 0.1427 0.1439 0.22 0.2 0.1793 MixGFANDminENTORmaxt 1972 0.1419 0.1424 0.2067 0.1911 0.177 MixGFRTORmm 697 0.0372 0.064 0.1433 0.1244 0.084 MixGFRTORmax 693 0.0322 0.0611 0.14 0.1233 0.0747 MixGENTLEFTmm 532 0.0279 0.0485 0.12 0.0944 0.0643 MixGDETLEFTmm 532 0.024 0.043 0.1 0.09 0.0577 MixGFRTLEFTmm 532 0.0236 0.0416 0.09 0.0889 0.058 MixGENTANDavg 162 0.0234 0.0341 0.17 0.1056 0.047 MixGENTANDmin 162 0.0229 0.0341 0.17 0.1056 0.047 MixGDETANDmin 247 0.0213 0.0415 0.12 0.0989 0.0447 MixGFRTANDmin 176 0.0209 0.037 0.1167 0.1044 0.0487 MixGFRTLEFTmax 532 0.0191 0.0398 0.0833 0.0856 0.0487 MixGDETLEFTmax 532 0.0189 0.0408 0.0867 0.0844 0.048 MixGENTLEFTmax 532 0.0186 0.0397 0.0833 0.0833 0.0473 MixGENTANDmax 162 0.0175 0.0332 0.1533 0.1044 0.047 MixGENTLEFTmin 532 0.0155 0.0339 0.0767 0.0822 0.0433 MixGFANDminENTANDmin 67 0.0114 0.0152 0.1233 0.0622 0.0207 MixGFANDminENTLEFTmm 165 0.0099 0.024 0.0533 0.0544 0.0363 MixGFANDminENTLEFTmax 165 0.0081 0.0225 0.0367 0.0478 0.0333 (1) Evaluations for some experiments with OR operator are missing 5. Conclusions and Future Work The highest MAP is obtained with the baseline text-based experiment in English where only stemming plus stopword removal is performed. Neither tagging with UMLS medical concepts nor merging of textual and visual results have proved to be of value to improve the precision with regards to the baseline experiment. However, evaluations for some of our experiments were missing, so this issue cannot be confirmed and has to be further investigated. In addition, experiments using French and German languages get a very low precision. This result is similar to other experiment carried out in other CLEF tracks and may be attributed to deficient stemming modules. We will invest more effort in these languages in future participations. Acknowledgements This work has been partially supported by the Spanish R+D National Plan, by means of the project RIMMEL (Multilingual and Multimedia Information Retrieval, and its Evaluation), TIN2004-07588-C03-01; and by the Madrid’s R+D Regional Plan, by means of the project MAVIR (Enhancing the Access and the Visibility of Networked Multilingual Information for the Community of Madrid), S-0505/TIC/000267. References [1] Apache Lucene project. On line http://lucene.apache.org [Visited 10/08/2007]. [2] CLEF 2005 Multilingual Information Retrieval resources page. On line http://www.computing.dcu.ie/ ~gjones/CLEF2005/Multi-8/ [Visited 10/08/2007]. [3] Deselaers, T.; Keysers; D.; Ney, H. FIRE - Flexible Image Retrieval Engine: ImageCLEF 2004 Evaluation. In CLEF 2004, LNCS 3491, Bath, UK, pp 688-698, September 2004. [4] GIFT: The GNU Image-Finding Tool. On line http://www.gnu.org/software/gift/ [Visited 10/08/2007]. [5] González, José C.; Villena, Julio; Moreno, Cristina; Martínez, J.L. Semiautomatic Extraction of Thesauri and Semantic Search in a Digital Image Archive. Integrating Technology and Culture: 10th International Conference on Electronic Publishing, ELPUB 2006, Bansko, Bulgaria, 14-16 June 2006. [6] FIRE: Flexible Image Retrieval System. On line http://www-i6.informatik.rwth-aachen.de/ ~deselaers/fire.html [Visited 10/08/2007]. [7] Martínez-Fernández, J.L.; Villena-Román, Julio; García-Serrano, Ana M.; Martínez-Fernández, Paloma. MIRACLE team report for ImageCLEF IR in CLEF 2006. Proceedings of the Cross Language Evaluation Forum 2006, Alicante, Spain. 20-22 September 2006. [8] Martínez-Fernández, J.L.; Villena-Román, Julio; García-Serrano, Ana M.; González-Cristóbal, José Carlos. Combining Textual and Visual Features for Image Retrieval. Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Vienna, Austria, Revised Selected Papers. Carol Peters et al (Eds.). Lecture Notes in Computer Science, Vol. 4022, 2006. ISSN: 0302-9743. [9] Müller, Henning; Deselaers, Thomas; Kim, Eugene; Kalpathy-Cramer, Jayashree; Deserno, Thomas; Clough, Paul; Hersh, William. Overview of the ImageCLEFmed 2007 Medical Retrieval and Annotation Tasks. Working Notes of the 2007 CLEF Workshop, Budapest, Hungary, September 2007. [10] Porter, Martin. Snowball stemmers and resources page. On line http://www.snowball.tartarus.org [Visited 10/08/2007]. [11] University of Neuchatel. Page of resources for CLEF (Stopwords, transliteration, stemmers …). On line http://www.unine.ch/info/clef [Visited 10/08/2007]. [12] U.S. National Library of Medicine. National Institutes of Health. On line http://www.nlm.nih.gov/research/umls/ [Visited 10/08/2007].