Text- and Content-based Approaches to Image Retrieval for the ImageCLEF 2009 Medical Retrieval Track Matthew Simpson, Md Mahmudur Rahman, Dina Demner-Fushman, Sameer Antani, George R. Thoma Lister Hill National Center for Biomedical Communications National Library of Medicine, NIH, Bethesda, MD, USA {simpsonmatt, rahmanmm, ddemner, santani}@mail.nih.gov Abstract This article describes the participation of the Image and Text Integration (ITI) group from the United States National Library of Medicine (NLM) in the ImageCLEF 2009 medical retrieval track. Our methods encompass a variety of techniques relating to document summarization and text- and content-based image retrieval. Our text-based approach utilizes the Unified Medical Language System (UMLS) synonymy of con- cepts identified in information requests and image-related text to retrieve semantically relevant images. Our content-based approaches utilize similarity metrics based on com- puted “visual concepts” to identify visually similar images. In this article we present an overview of these approaches, discuss our experiences combining them into multimodal retrieval strategies, and describe our submitted runs and results. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.7 Digital Libraries; I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Object Recognition General Terms Measurement, Performance, Experimentation Keywords Image Retrieval, CBIR, Medical Imaging, Ontologies, UMLS 1 Introduction This article describes the participation of the Image and Text Integration (ITI) group from the United States National Library of Medicine (NLM) in the ImageCLEF 2009 medical retrieval track. This is our second year participating in ImageCLEFmed. ImgeCLEFmed’09 [17] consists of two medical retrieval tasks. In the first task, a set of ad-hoc information requests are given, and the goal is to retrieve the most relevant images pertaining to each topic. In the second task, a set of case-based information requests are given, and the goal is to retrieve the most relevant articles describing case studies pertaining to the topic case. In the following sections, we describe our text-based approach (Section 2), which is suitable for both retrieval tasks, and several content-based approaches (Section 3) to the ad-hoc retrieval task. Our text-based approach relies on mapping information requests and image-related text to concepts in the Unified Medical Language System (UMLS) [13] Metathesaurus, and our content- based approaches analogously rely on mapping medical images to “visual concepts” using machine learning techniques. In Section 4, we suggest strategies for combining our text- and content-based approaches, describe our submitted runs, and present their results. For the ad-hoc retrieval task, our best run, a multimodal feedback approach, achieved a Mean Average Precision (MAP) of 0.38, and our best automatic run, a text-based approach, achieved a MAP of 0.35. For the case-based retrieval task our automatic text-based approach achieved a MAP of 0.34 and was ranked 1st among all case-based run submissions. 2 Text-based Image Retrieval In this section we describe our text-based approach to image retrieval. Effective text-based medical image retrieval requires (1) a document representation that contains the most pertinent informa- tion describing the content of the image and potential information needs and (2) a retrieval strategy that is appropriate for the biomedical domain. Our document representation consists of several automatically extracted search areas in addi- tion to the image captions provided in the ImageCLEFmed’09 [17] collection. These fields include the title of the article in which the image appears, the article’s abstract, a brief mention (one sen- tence) of the image from the article’s full text, and the Medical Subject Headings (MeSH terms) assigned to the article. MeSH is a controlled vocabulary created by NLM to index biomedical articles. We provide a summary of each caption according to a structured representation of infor- mation needs that are relevant to the principles of evidence-based practise [7]. This search area includes automatically extracted fields relating to anatomy, diagnosis, population group, etc. We use the Essie [12] search engine to index this collection of image documents and retrieve relevant images. Essie was originally developed by NLM to support the online registry of clinical research studies at ClinicalTrials.gov [15], and now it serves several other information retrieval systems at NLM. Key features of Essie that make it particularly well-suited to the medical retrieval track include its automatic expansion of query terms along synonymy relationships in the UMLS Metathesaurus and its ability to weight term occurrences according the location of the document in which they occur. For example, term occurrences in an image caption can be given a higher weight than occurrences in the abstract of the article in which the image appears. Essie also expands query terms to include morphological variants derived from the UMLS SPECIALIST Lexicon instead of stemming. To construct queries for each information request, we map topics to the UMLS using the MetaMap [1] tool and represent terms relating to image modality, clinical findings, and anatomy with their preferred UMLS names. Thus, each query consists of the conjunction of a set of UMLS concepts that are expanded by Essie during the retrieval process. For extracted modality terms that cannot be mapped to the UMLS, we perform an automatic term expansion based on a list of image modalities (originally created by the authors using RadLex1 as a starting point[6]) which we expanded using the UMLS synonymy and manually augmented with missing terms (mostly abbreviations) based on the authors’ experience creating the ITI modalities hierarchy. 2.1 Case-based Retrieval Task Our retrieval strategy for the case-based retrieval task is identical to that of the ad-hoc task. How- ever, since the retrieval unit of the case-based task is an entire article, to construct an appropriate document representation we perform a simple union of all the search areas for each image in the 1 http://radlex.org/viewer. article. That is, a case-based document consists of a title, abstract, MeSH terms, and the caption, mention, and structured caption summary of each image contained in the article. 3 Content-based Image Retrieval In content-based image retrieval (CBIR), access to information is performed at a perceptual level based on automatically extracted low-level features (e.g., color, texture, shape, etc.) [19]. The performance of a CBIR system depends on the underlying image representation, usually in the form of a feature vector. Due to the limitations of the low-level features in CBIR and motivated by a learning paradigm, we explore classification at both the global collection level and the local individual image level in our submitted runs for ImageCLEFmed’09 [17]. In addition to the off-line supervised learning approach, we incorporate users’ semantic perceptions interactively in the retrieval loop based on relevance feedback (RF) information. The following sections describe our feature representation schemes and the retrieval methods applied to the various visual and multimodal submitted runs. 3.1 Image Feature Representation To generate the feature vectors at different levels of abstraction, we extract both visual concept- based feature based on a “bag of concepts” model comprising color and texture patches from local image regions [21] and various low-level global features including color, edge, and texture. 3.1.1 Visual Concept-based Image Representation In the ImageCLEFmed’09 collection [18], it is possible to identify specific local patches in images that are perceptually and/or semantically distinguishable, such as homogeneous texture patterns in grey level radiological images and varying color and texture structures in microscopic pathology and dermoscopic images. The content of these local patches can be effectively modeled as “visual concepts” [21] by using supervised learning based classification techniques such as the Support Vector Machine (SVM). For concept model generation, we utilize a voting-based multi-class SVM known as one-against- one or pairwise coupling (PWC) [11]. In developing training samples for this SVM, only local image patches that map to visual concept models are used. To accurately automatically segment and unambiguously and consistently label image segments, a fixed-partition based approach is used to divide the entire image space into an (r × r) grid of non-overlapping regions. Manual selection is applied to limit such patches in the training set to those that have a majority of their area (80%) covered by a single concept. In order to train the SVMs based on the local concept categories, a set of L labels are assigned as C = {c1 , · · · , ci , · · · , cL }, where each ci ∈ C characterizes a local concept category. The training set of the local patches that comprise color and texture moment-based features, is annotated manually with the concept labels in a mutually exclusive way. Images in the data set are annotated with local concept labels by partitioning each image Ij into an equivalent r × r grid of l region vectors as {x1j , · · · , xkj , · · · , xlj }, where each xkj ∈