Multi-Modal Interactive Approach to ImageCLEF 2007 Photographic and Medical Retrieval Tasks by CINDI M. M. Rahman, Bipin C. Desai, Prabir Bhattacharya Dept. of Computer Science & Software Engineering, Concordia University 1455 de Maisonneuve Blvd., Montreal, QC, H3G 1M8, Canada mah rahm@cs.concordia.ca Abstract This paper presents the contribution of CINDI group to the ImageCLEF 2007 ad-hoc retrieval tasks. We experiment with multi-modal (e.g., image and text) in- teraction and fusion approaches based on relevance feedback information for image retrieval tasks of photographic and medical image collections. For a text-based image search, keywords from the annotated files are extracted and indexed by employing the vector space model of information retrieval. For a content-based image search, various global, semi-global, region-specific and visual concept-based features are extracted at different levels of image abstraction. Based on relevance feedback information, multi- ple textual and visual query refinements are performed and user’s perceived semantics are propagated from one modality to another with query expansion. The feedback information also dynamically adjusts intra and inter-modality weights in linear com- bination of similarity matching functions. Finally, the top ranked images are obtained by performing both sequential and simultaneous retrieval approaches. The analysis of results of different runs are reported in this paper. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.7 Digital Libraries; I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Object Recognition General Terms Algorithms, Performance, Experimentation Keywords Content-based image retrieval, Vector space model, Feature extraction, Query expansion, Rele- vance feedback, Data fusion. 1 Introduction For the 2007 ImageCLEF competition, CINDI research group has participated in two different tasks of ImageCLEF track: an ad-hoc retrieval from a photographic collection (e.g., IAPR data set) and ad-hoc retrieval from a medical collection (e.g., CASImage, MIR, PathoPic, Peir, endoscopic and myPACS data sets) [1, 2]. The goal of the ad-hoc task is given a multilingual statement describing a user information need along with example images, find as many relevant images as possible from the given collection. Our work exploits advantages of both text and image modalities by involving users in the retrieval loop for cross-modal interaction and integration. This paper presents our multi-modal retrieval methodologies, description of submitted runs, and analysis of retrieval results. 2 Text-Based Image Retrieval Approach This section describes the text-based image retrieval approach where a user submits a query topic using keywords to retrieve images which are associated with retrieved annotation files. For a text- based search, it is necessary to prepare the document collection consisting of annotated XML and SGML files into an easily accessible representation. Each annotation file in the collection is linked to image(s) either in a one-to-one or one-to-many relationships. To incorporate a keyword-based search on these annotation files, we rely on the vector space model of information retrieval [3]. In this model, a document is represented as a vector of words where each word is a dimension in an Euclidean space. The indexing is performed by extracting keywords from selected elements of the XML and SGML documents depending on the image collection. Let, T = {t1 , t2 , · · · , tN } denotes the set of keywords (terms) in the collection. A document Dj is represented as a vector in a N -dimensional space as fDj = [wj1 , · · · , wjk , · · · , wjN ]T . The element wjk = Ljk ∗ Gk denotes the tf-idf weight [3] of term tk , k ∈ {1, · · · , N }, in a document Dj . Here, the local weight is denoted as Ljk = log(fjk ) + 1, where fjk is the frequency of occurrence of keyword tk in a document Dj . The global weight Gk is denoted as inverse document frequency as Gk = log(M/Mk ), where Mk is the number of documents in which tk is found and M is the total number of documents in the collection. A query Dq is also represented as an N -dimensional vector fDq = [wq1 , · · · , wqk , · · · , wqN ]T . To compare Dq and Dj , the cosine similarity measure is applied as follows PN k=1 wqk ∗ wjk Simtext (Dq , Dj ) = qP qP (1) N 2 N 2 k=1 (wqk ) ∗ k=1 (wjk ) where wqk and wjk are the weights of the term tk in Dq and Dj respectively. 2.1 Textual Query Refinement by Relevance Feedback Query reformulation is a standard technique for reducing ambiguity due to word mismatch problem in information retrieval [4]. In the present work, we investigate interactive way to generate multiple query representations and their integration in a similarity matching function by applying various relevance feedback methods. The relevance feedback technique prompts the user for feedback on retrieval results and then use that information on subsequent retrievals with the goal of increasing retrieval performance [4, 5]. We generate multiple query vectors by applying various relevance feedback methods. For the first method, we use the well known Rocchio algorithm [6] as follows m o 1 X 1 X fD (Rocchio) = α fD +β fDj − γ f̂Dj (2) q q |R| |R̂| fDj ∈R f̂Dj ∈R̂ m o where fD q and fD q are the modified and the original query vectors, R and R̂ are the set of relevant and irrelevant document vectors and α, β, and γ are weights. This algorithm generally moves a new query point toward relevant documents and away from irrelevant documents in feature space [6]. For our second feedback method, we use the Ide-dec-hi formula as X m o fDq (Ide) = α fD q +β fDj − γ max(fDj ) (3) R̂ fDj ∈R where maxR̂ (fDj ) is a vector of the highest ranked non-relevant document. This is a modified version of the Rocchio’s formula which eliminates the normalization for the number of relevant and non-relevant documents and allows limited negative feedback from only the top-ranked non- relevant document. For the experimental purpose, we consider the weights as α = 1, β = 1, and γ = 1. We also perform two different query reformulation based on local analysis. Generally, local analysis considers the top k most highly ranked documents for query expansion without any assistance from the user [12, 3]. However, in this work, we consider only the user selected relevant images for further analysis. At first, a simpler approach of query expansion is considered based on identifying most frequently occurring five keywords from user selected relevant documents. m After selecting the additional keywords, the query vector is reformulated as fD q (Local1) by re- weighting its keywords based on the tf-idf weighting scheme and is re-submitted to the system as a new query. The other query reformulation approach is based on expanding the query with terms correlated to the query terms. Such correlated terms are those present in local clusters built from the relevant documents as indicated by the user. There are many ways to build a local cluster before performing any query expansion [12, 3]. For this work, a correlation matrix C(|Tl |×|Tl |) = [cu,v ] is constructed [8] in which the rows and columns are associated with terms in a local vocabulary Tl . The element of this matrix cu,v is defined as nu,v cu,v = (4) nu + nv − nu,v where, nu is the number of local documents which contain term tu , nv is the number of local documents which contain term tv , and nu,v is the number of local documents which contain both terms tu and tv . Here, cu,v measures the ratio between the number of local documents where both tu and tv appear and the total number of local documents where either tu or tv appear. If tu and tv have many co-occurrences in documents, then the value of cu,v increases, and the documents are considered to be more correlated. Now, given the correlation matrix C, we use it to build the local correlation cluster. For a query term tu ∈ Dq , we consider the u-th row in C (i.e., the row with all the correlations for the keyword tu ). From that row, we return three largest correlation values cu,l , u 6= l, and add corresponding terms tl for query expansion. The process is continued m for each query term and finally the query vector is reformulated as fD q (Local2) by re-weighting its keywords based on the tf-idf weighting scheme. 3 Content-based Image Retrieval Approach In content-based image retrieval (CBIR), access to information is performed at a perceptual level based on automatically extracted low-level features (e.g., color, texture, shape, etc.) [13]. The performance of a content-based image retrieval (CBIR) system depends on the underlying image representation, usually in the form of a feature vector. To generate feature vectors, various global, semi-global, region-specific, and visual concept-based image features are extracted at different levels of abstraction. The MPEG-7 based Edge Histogram Descriptor (EHD) and Color Layout Descriptor (CLD) are extracted for image representation at global level [14]. To represent EHD as vector f ehd , a histogram with 16 × 5 = 80 bins is obtained. The CLD represents spatial layout of images in a very compact form in YCbCr color space where Y is the luma component and Cb and Cr are the blue and red chroma components [14]. In this work, CLD with 10 Y , 3 Cb and 3 Cr coefficients is extracted to form a 16-dimensional feature vector f cld . The global distance measure between feature vectors of query image Iq and database image Ij is a weighted Euclidean distance measure and is defined as Disglobal (Iq , Ij ) = ωcld Discld (fIcld q , fIcld j ) + ωehd Disehd (fIehd q , fIehd j ), (5) where, Discld (fIcld q , fIcld j ) and Disehd (fIehd q , fIehd j ) are the Euclidean distance measures for CLD and EHD respectively and ωcld and ωehd are weights for each feature distance measure subject to 0 ≤ ωcld , ωehd ≤ 1 and ωcld + ωehd = 1 and initially adjusted with equal weights as ωcld = 0.5 and ωehd = 0.5. For semi-global feature vector, a simple grid-based approach is used to divide the images into five overlapping sub-images [16]. Several moment based color and texture features are extracted from each of the sub-images and later they are combined to form a semi-global feature vector. The mean and standard deviation of each color channel in HSV color space are extracted form each overlapping sub-region of an image Ij . Various texture moment-based features (such as energy, maximum probability, entropy, contrast and inverse difference moment) are also extracted from the grey level co-occurrence matrix (GLCM) [15]. Color and texture feature vectors are normalized and combined to form a joint feature vector frsgj of each sub-image r and finally they are combined as the semi-global feature vector for an entire image as f sg . The semi-global distance measure between Iq and Ij is defined as 5 1X Diss-global (Iq , Ij ) = Dsg (fIsgq , fIsgj ) = ωr Disr (frsgq , frsgj ) (6) r r=1 where, Disr (frsgq , frsgj ) is the Euclidean distance measure of the feature vector of region r and ωr are the weights for the regions, which are set as equal initially. Region-based image retrieval (RBIR) aims to overcome the limitations of global and semi- global retrieval approaches by fragmenting an image automatically into a set of homogeneous regions based on color and/or texture properties. Hence, we consider a local region specific feature extraction approach by fragmenting an image automatically into a set of homogeneous regions made up of (2 × 2) pixel blocks based on a fast k-means clustering technique. The image level distance between Iq and Ij is measured by integrating properties of all regions in the images. Suppose, there are M regions in image Iq and N regions in image Ij . Now, the image-level distance is defined as PM PN i=1 wriq Disriq (q, j) + k=1 wrkj Disrkj (j, q) Dislocal (Iq , Ij ) = (7) 2 where wriq and wrkj are the weights (e.g., number of image block as unit) for region riq and region rkj of image Iq and Ij respectively. For each region riq ∈ Iq , Disriq (q, j) is defined as the minimum Bhattacharyya distance [18] between this region and any region rkj ∈ Ij as Disriq (q, j) = min(Dis(riq , r1j ), · · · , Dis(riq , rNj )). The Bhattacharyya distance is computed based on mean color vector and covariance matrix of color channels in HSV color space of each region. The details of the segmentation, local feature extraction and similarity matching schemes were described in our previous work in [16]. We also extract visual concept-based image features that is analogous to a keyword-based representation in text retrieval domain. The visual concepts depict perceptually distinguishable color or texture patches in local image regions. For example, a predominant yellow color patch can be presented either in an image of the sun or in a sunflower image. To generate a set of visual concepts analogous to a dictionary of keywords, we consider a fixed decomposition approach to generate a 16 × 16 grid based partition of images. Therefore, sample images from a training set are equally partitioned into 256 non-overlapping smaller blocks. To represent each block as a feature vector, color and texture moment-based features are extracted as described for the semi- global feature. To generate a coodbook of prototype concept vectors from the block features, we use a SOM-based clustering technique [17]. The basic structure of a SOM consists of two layers: an input layer and a competitive output layer. The input layer consists of a set of input node vector X = {x1 , · · · xi , · · · xn }, xi ∈