-

Multi-Modal Interactive Approach to ImageCLEF 2007 Photographic and Medical Retrieval Tasks by CINDI

M. M. Rahman

rahm@cs.concordia.ca 0 1

Bipin C. Desai

0 1

Prabir Bhattacharya

0 1

General Terms

0 1

Algorithms, Performance, Experimentation

0 1455 de Maisonneuve Blvd. , Montreal, QC, H3G 1M8 , Canada 1 Dept. of Computer Science & Software Engineering, Concordia University

This paper presents the contribution of CINDI group to the ImageCLEF 2007 ad-hoc retrieval tasks. We experiment with multi-modal (e.g., image and text) interaction and fusion approaches based on relevance feedback information for image retrieval tasks of photographic and medical image collections. For a text-based image search, keywords from the annotated ¯les are extracted and indexed by employing the vector space model of information retrieval. For a content-based image search, various global, semi-global, region-speci¯c and visual concept-based features are extracted at di®erent levels of image abstraction. Based on relevance feedback information, multiple textual and visual query re¯nements are performed and user's perceived semantics are propagated from one modality to another with query expansion. The feedback information also dynamically adjusts intra and inter-modality weights in linear combination of similarity matching functions. Finally, the top ranked images are obtained by performing both sequential and simultaneous retrieval approaches. The analysis of results of di®erent runs are reported in this paper.

H 3 [Information Storage and Retrieval] H 3 1 Content Analysis and Indexing H 3 3 Information Search and Retrieval H 3 7 Digital Libraries I 4 8 [Image Processing and Computer Vision] Scene Analysis|Object Recognition

For the 2007 ImageCLEF competition, CINDI research group has participated in two di®erent tasks of ImageCLEF track: an ad-hoc retrieval from a photographic collection (e.g., IAPR data set) and ad-hoc retrieval from a medical collection (e.g., CASImage, MIR, PathoPic, Peir, endoscopic and myPACS data sets) [1, 2]. The goal of the ad-hoc task is given a multilingual statement describing a user information need along with example images, ¯nd as many relevant images as possible from the given collection. Our work exploits advantages of both text and image modalities by involving users in the retrieval loop for cross-modal interaction and integration. This paper presents our multi-modal retrieval methodologies, description of submitted runs, and analysis of retrieval results. 2

Text-Based Image Retrieval Approach

This section describes the text-based image retrieval approach where a user submits a query topic using keywords to retrieve images which are associated with retrieved annotation ¯les. For a textbased search, it is necessary to prepare the document collection consisting of annotated XML and SGML ¯les into an easily accessible representation. Each annotation ¯le in the collection is linked to image(s) either in a one-to-one or one-to-many relationships. To incorporate a keyword-based search on these annotation ¯les, we rely on the vector space model of information retrieval [3]. In this model, a document is represented as a vector of words where each word is a dimension in an Euclidean space. The indexing is performed by extracting keywords from selected elements of the XML and SGML documents depending on the image collection. Let, T = ft1; t2; ¢ ¢ ¢ ; tN g denotes the set of keywords (terms) in the collection. A document Dj is represented as a vector in a N -dimensional space as fDj = [wj1; ¢ ¢ ¢ ; wjk; ¢ ¢ ¢ ; wjN ]T . The element wjk = Ljk ¤ Gk denotes the tf-idf weight [3] of term tk; k 2 f1; ¢ ¢ ¢ ; N g, in a document Dj . Here, the local weight is denoted as Ljk = log(fjk) + 1, where fjk is the frequency of occurrence of keyword tk in a document Dj . The global weight Gk is denoted as inverse document frequency as Gk = log(M=Mk), where Mk is the number of documents in which tk is found and M is the total number of documents in the collection. A query Dq is also represented as an N -dimensional vector fDq = [wq1; ¢ ¢ ¢ ; wqk; ¢ ¢ ¢ ; wqN ]T . To compare Dq and Dj , the cosine similarity measure is applied as follows

Simtext(Dq; Dj ) = qPN k=1(wqk)2 ¤

PN k=1 wqk ¤ wjk qPN k=1(wjk)2 where wqk and wjk are the weights of the term tk in Dq and Dj respectively. 2.1

Textual Query Re¯nement by Relevance Feedback

Query reformulation is a standard technique for reducing ambiguity due to word mismatch problem in information retrieval [4]. In the present work, we investigate interactive way to generate multiple query representations and their integration in a similarity matching function by applying various relevance feedback methods. The relevance feedback technique prompts the user for feedback on retrieval results and then use that information on subsequent retrievals with the goal of increasing retrieval performance [4, 5]. We generate multiple query vectors by applying various relevance feedback methods. For the ¯rst method, we use the well known Rocchio algorithm [6] as follows fDmq (Rocchio) = ® f Doq + ¯ 1

X jRj fDj 2R

1 fDj ¡ ° ^ jRj ^fDj 2R

^ X ^ fDj where fDmq and f Doq are the modi¯ed and the original query vectors, R and R^ are the set of relevant and irrelevant document vectors and ®, ¯, and ° are weights. This algorithm generally moves a new query point toward relevant documents and away from irrelevant documents in feature space [6]. For our second feedback method, we use the Ide-dec-hi formula as fDmq (Ide) = ® f Doq + ¯ X fDj ¡ ° max(fDj ) ^ R ( 1 ) ( 2 ) ( 3 ) where maxR^ (fDj ) is a vector of the highest ranked non-relevant document. This is a modi¯ed version of the Rocchio's formula which eliminates the normalization for the number of relevant and non-relevant documents and allows limited negative feedback from only the top-ranked nonrelevant document. For the experimental purpose, we consider the weights as ® = 1, ¯ = 1, and ° = 1.

We also perform two di®erent query reformulation based on local analysis. Generally, local analysis considers the top k most highly ranked documents for query expansion without any assistance from the user [12, 3]. However, in this work, we consider only the user selected relevant images for further analysis. At ¯rst, a simpler approach of query expansion is considered based on identifying most frequently occurring ¯ve keywords from user selected relevant documents. After selecting the additional keywords, the query vector is reformulated as fDmq (Local1) by reweighting its keywords based on the tf-idf weighting scheme and is re-submitted to the system as a new query. The other query reformulation approach is based on expanding the query with terms correlated to the query terms. Such correlated terms are those present in local clusters built from the relevant documents as indicated by the user. There are many ways to build a local cluster before performing any query expansion [12, 3]. For this work, a correlation matrix C(jTlj£jTlj) = [cu;v] is constructed [8] in which the rows and columns are associated with terms in a local vocabulary Tl. The element of this matrix cu;v is de¯ned as cu;v =

nu;v nu + nv ¡ nu;v ( 4 ) where, nu is the number of local documents which contain term tu, nv is the number of local documents which contain term tv, and nu;v is the number of local documents which contain both terms tu and tv. Here, cu;v measures the ratio between the number of local documents where both tu and tv appear and the total number of local documents where either tu or tv appear. If tu and tv have many co-occurrences in documents, then the value of cu;v increases, and the documents are considered to be more correlated. Now, given the correlation matrix C, we use it to build the local correlation cluster. For a query term tu 2 Dq, we consider the u-th row in C (i.e., the row with all the correlations for the keyword tu). From that row, we return three largest correlation values cu;l; u 6= l, and add corresponding terms tl for query expansion. The process is continued for each query term and ¯nally the query vector is reformulated as fDmq (Local2) by re-weighting its keywords based on the tf-idf weighting scheme. 3

Content-based Image Retrieval Approach

In content-based image retrieval (CBIR), access to information is performed at a perceptual level based on automatically extracted low-level features (e.g., color, texture, shape, etc.) [13]. The performance of a content-based image retrieval (CBIR) system depends on the underlying image representation, usually in the form of a feature vector. To generate feature vectors, various global, semi-global, region-speci¯c, and visual concept-based image features are extracted at di®erent levels of abstraction. The MPEG-7 based Edge Histogram Descriptor (EHD) and Color Layout Descriptor (CLD) are extracted for image representation at global level [14]. To represent EHD as vector f ehd, a histogram with 16 £ 5 = 80 bins is obtained. The CLD represents spatial layout of images in a very compact form in YCbCr color space where Y is the luma component and Cb and Cr are the blue and red chroma components [14]. In this work, CLD with 10 Y , 3 Cb and 3 Cr coe±cients is extracted to form a 16-dimensional feature vector f cld. The global distance measure between feature vectors of query image Iq and database image Ij is a weighted Euclidean distance measure and is de¯ned as

Disglobal(Iq; Ij ) = !cldDiscld(fIcqld; fIcjld) + !ehdDisehd(fIeqhd; fIejhd); ( 5 ) where, Discld(fIcqld; fIcjld) and Disehd(fIeqhd; fIejhd) are the Euclidean distance measures for CLD and EHD respectively and !cld and !ehd are weights for each feature distance measure subject to 0 · !cld; !ehd · 1 and !cld + !ehd = 1 and initially adjusted with equal weights as !cld = 0:5 and !ehd = 0:5. For semi-global feature vector, a simple grid-based approach is used to divide the images into ¯ve overlapping sub-images [16]. Several moment based color and texture features are extracted from each of the sub-images and later they are combined to form a semi-global feature vector. The mean and standard deviation of each color channel in HSV color space are extracted form each overlapping sub-region of an image Ij . Various texture moment-based features (such as energy, maximum probability, entropy, contrast and inverse di®erence moment) are also extracted from the grey level co-occurrence matrix (GLCM) [15]. Color and texture feature vectors are normalized and combined to form a joint feature vector frsjg of each sub-image r and ¯nally they are combined as the semi-global feature vector for an entire image as f sg. The semi-global distance measure between Iq and Ij is de¯ned as 5 Diss-global(Iq; Ij ) = Dsg(fIsqg; fIsjg) = 1r X !rDisr(frsqg; frsjg) r=1 where, Disr(frsqg; frsjg) is the Euclidean distance measure of the feature vector of region r and !r are the weights for the regions, which are set as equal initially.

Region-based image retrieval (RBIR) aims to overcome the limitations of global and semiglobal retrieval approaches by fragmenting an image automatically into a set of homogeneous regions based on color and/or texture properties. Hence, we consider a local region speci¯c feature extraction approach by fragmenting an image automatically into a set of homogeneous regions made up of (2 £ 2) pixel blocks based on a fast k-means clustering technique. The image level distance between Iq and Ij is measured by integrating properties of all regions in the images. Suppose, there are M regions in image Iq and N regions in image Ij . Now, the image-level distance is de¯ned as

Dislocal(Iq; Ij ) =

PM k=1 wrkj Disrkj (j; q) i=1 wriq Disriq (q; j) + PN 2 ( 6 ) ( 7 ) where wriq and wrkj are the weights (e.g., number of image block as unit) for region riq and region rkj of image Iq and Ij respectively. For each region riq 2 Iq, Disriq (q; j) is de¯ned as the minimum Bhattacharyya distance [18] between this region and any region rkj 2 Ij as Disriq (q; j) = min(Dis(riq ; r1j ); ¢ ¢ ¢ ; Dis(riq ; rNj )). The Bhattacharyya distance is computed based on mean color vector and covariance matrix of color channels in HSV color space of each region. The details of the segmentation, local feature extraction and similarity matching schemes were described in our previous work in [16].

We also extract visual concept-based image features that is analogous to a keyword-based representation in text retrieval domain. The visual concepts depict perceptually distinguishable color or texture patches in local image regions. For example, a predominant yellow color patch can be presented either in an image of the sun or in a sun°ower image. To generate a set of visual concepts analogous to a dictionary of keywords, we consider a ¯xed decomposition approach to generate a 16 £ 16 grid based partition of images. Therefore, sample images from a training set are equally partitioned into 256 non-overlapping smaller blocks. To represent each block as a feature vector, color and texture moment-based features are extracted as described for the semiglobal feature. To generate a coodbook of prototype concept vectors from the block features, we use a SOM-based clustering technique [17]. The basic structure of a SOM consists of two layers: an input layer and a competitive output layer. The input layer consists of a set of input node vector X = fx1; ¢ ¢ ¢ xi; ¢ ¢ ¢ xng; xi 2 <d, while the output layer consists of a set of N neurons C = fc1; ¢ ¢ ¢ cj ; ¢ ¢ ¢ cN g, where each neuron cj is associated with a weight vector cj 2 <d. After the weight vectors are determined through the learning process, each output neuron cj resembles as a visual concept with the associated weight vector cj as code vector of a codebook. To encode an image, it is also decomposed into an even gird-based partition, where the color and texture moment-based features are extracted from each block. Now, for joint color and texture momentbased feature vector of each block, the nearest output node ck; 1 · k · N is identi¯ed by applying the Euclidean distance measure and the corresponding index k of the output node ck is stored for that particular block of the image. Based on this encoding scheme, an image Ij can be represented

V¡concept = [f1j ; ¢ ¢ ¢ ; fij ; ¢ ¢ ¢ fNj ]T, where each dimension corresponds to a concept as a vector fIj index in the codebook. The element fij represents the frequency of occurrences of ci appearing in Ij . For this work, codebooks of size of 400 (e.g.,20 £ 20 units) are constructed for the photographic and medical collection by manually selecting 2% images from each collection as training set. Since, the concept-based feature space is closely related to the keyword-based feature space of documents, we apply the cosine measure to compare image Iq and Ij as described in equation ( 1 ). 3.1

Visual Query Re¯nement by Relevance Feedback

This section presents the visual query re¯nement approach at di®erent levels of image representation. The query re¯nement is closely related to the approach in [9]. It is assumed that, all positive feedback images at some particular iteration belong to user perceived visual and/or semantic category and obey the Gaussian distribution to form a cluster in the feature space. We consider the rest of the images as irrelevant and they may belong to di®erent semantic categories. However, we do not consider the irrelevant images for query re¯nement. The modi¯ed query vector at a particular iteration is represented as the mean of the relevant image vectors where, R is the set of relevant image vectors and x 2 fglobal; sg; V ¡ conceptg. Next, the covariance matrix of the positive feature vectors is estimated as f xm = Iq 1 jRj fIl 2R

X fIxl

Cx = 1 jRj

X(fIxlm ¡ fIxq )(fIxlm ¡ fIxq )T

jRj ¡ 1 l=1 However, singularity issue will arise in covariance matrix estimation if fewer training samples or positive images are available compared to the feature dimension (as will be the case in user feedback images). So, we add regularization to avoid singularity in matrices as follows[19]:

C^x = ®Cx + (1 ¡ ®)I for some 0 · ® · 1 and I is the identity matrix. After generating the mean vector and covariance matrix for a feature x 2 fglobal; sg; V ¡ conceptg, we adaptively adjust the distance measure functions in equations ( 5 ) and ( 6 ) with the following Mahalanobis distance measures [18] for query image Iq and database image Ij as

Disx(Iq; Ij ) = (fIxqm ¡ fIxj )T C^x¡1(fIxqm ¡ fIxj ) The Mahalanobis distance di®ers from the Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant, i.e. not dependent on the scale of measurements [18]. We did not perform any query re¯nement for region-speci¯c feature due to its variable feature dimension for variable number of regions in each image. 4

Combination of Evidence by Dynamic Weight Update

In recent years, the category of work known as data fusion or multiple-evidence described a range of techniques in information retrieval whereby multiple pieces of information are combined to achieve improvements in retrieval e®ectiveness [10, 11]. These pieces of information can take many forms including di®erent query representations, di®erent document (image) representations, and di®erent retrieval strategies used to obtain a measure of relationship between a query and a ( 8 ) ( 9 ) ( 10 ) ( 11 ) document (image). Motivated by this paradigm, in Sections 2 and 3, we described multiple textual query and image representation schemes. This section presents an adaptive linear combination approach based on relevance feedback information. One of the most commonly used approaches in data fusion is the linear combination of similarity scores. For our multi-modal retrieval purpose, let us consider q as a multi-modal query which has an image part as Iq and a document part as annotation ¯le as Dq. In a linear combination scheme, the similarity between q and a multi-modal item j, which also has two parts (e.g., image Ij and text Dj ), is de¯ned as

SimI(Iq; Ij ) =

X !IF SimIIF (Iq; Ij )

Sim(q; j) = !I SimI(Iq; Ij ) + !DSimD(Dq; Dj ) where !I and !D are inter-modality weights within the text or image feature space, which subject to 0 · !I ; !D · 1 and !I + !D = 1. Now, the image based similarity is again de¯ned as the linear combination of similarity measures in di®erent level of image representation as where IF 2 fglobal; semi ¡ global; region; V ¡ conceptg and !IF are the weights within the different image representation schemes (e.g., intra-modality weights). On the other hand, the text based similarity is de¯ned as the linear combination of similarity matching based on di®erent query representation schemes.

SimD(Dq; Dj ) =

X !QF SimQDF (Dq; Dj )

D QF ( 12 ) ( 13 ) ( 14 ) ( 15 ) where QF 2 fRocchio; Ide; Local1; Local2g and !QF are the weights within the di®erent query representation schemes.

The e®ectiveness of the linear combination depends mainly on the choice of the di®erent inter and intra-modality weights. We use a dynamic weight updating method in linear combination schemes by considering both precision and rank order information of top retrieved K images. Before any fusion, the distance scores of each representation are normalized and converted to the similarity scores with a range of [0; 1] as Sim(q; j) = 1 ¡ maDx(Dis(iqs;(jq);¡j)m)¡inm(Din(isD(qis;j()q);j)) , where min(¢) and max(¢) are the minimum and maximum distance scores. In this approach, an equal emphasis is given based on their weights to all the features along with their similarity matching functions initially. However, the weights are updated dynamically during the subsequent iterations by incorporating the feedback information from the previous round. To update the inter-modality weights (e.g., !I and !D), we at ¯rst need to perform the multi-modal similarity matching based on equation ( 12 ). After the initial retrieval result with a linear combination of equal weights (e.g., !I = 0:5 and !D = 0:5), a user needs to provide a feedback about the relevant images from the top K returned images. For each ranked list based on individual similarity matching, we also consider top K images and measure the e®ectiveness of a query/image feature as E(D or I) =

PK i=1 Rank(i)

K=2 ¤ P(K) where Rank(i) = 0 if image in the rank position i is not relevant based on user's feedback and Rank(i) = (K ¡ i)=(K ¡ 1) for the relevant images. Here, P (K) = RK =K is the precision at top K, where Rk be the number of relevant images in the top K retrieved result. Hence, the equation ( 15 ) is basically the product of two factors, rank order and precision. The raw performance scores obtained by the above procedure are then normalized by the total score as E^(D) = !^D = E(D)+E(I) E(D) and E^(I) = !^I = E(D)+E(I) to generate the updated text and image feature weights respectively.

E(I) For the next iteration of retrieval with the same query, these modi¯ed weights are utilized for the multi-modal similarity matching function as

Sim(q; j) = !^I SimI(Iq; Ij ) + !^DSimD(Dq; Dj ) This weight updating process might be continued as long as users provide relevant feedback information or until no changes are noticed due to the system convergence.

In a similar fashion, to update the intra-modality weights (e.g., !DQF and !IIF ), we need to consider the top K images in individual result list. So, for image-based similarity in equation ( 13 ), we consider the result lists of di®erent image features of IF 2 fglobal; semi ¡ global; region; V ¡ conceptg and measure their weights by using equation ( 15 ) for the next retrieval iteration. On the other hand, for text-based similarity in equation ( 14 ), the top K images in result lists of di®erent query features of QF 2 fRocchio; Ide; Local1; Local2g are considered and text-level weights are determined in a similar way by applying equation ( 15 ). 5

Sequential approach with pre-¯ltering and re-ordering

This section describes the process about how to interact with both the modalities in a user's perceived semantical and sequential way. Since a query can be represented with both keywords and visual features, it can be initiated either by the keyword-based search or by the visual example image search. However, we consider a pre-¯ltering and re-ranking approach based on the image search in the ¯ltered image set which is obtained previously by the textual search. It would be more appropriate to perform a text-based search at ¯rst due to the higher level information content and latter use visual only search to re¯ne or re-rank the top returned images by the textual search. In this method, combining the results of the text and image based retrieval is a matter of re-ranking or re-ordering of the images in a text-based pre-¯ltered result set. The steps involved in this approach are as follows:

Step 1: Initially, for a multi-modal query q with a document part as Dq, perform a textual search with vector fDq and rank the images based on the ranking of the associated annotation ¯les by applying equation ( 1 ).

Step 2: Obtain user's feedback from top retrieved K = 30 images about relevant and irrelevant images for the textual query re¯nement.

Step 3: Calculate the optimal textual query vectors as fDmq (Rocchio); fDmq (Ide); fDmq (Local1) and fDmq (Local2).

Step 4: Re-submit the modi¯ed query vectors in the text engine and merge the results with an equal weighting in similarity matching in equation ( 14 ).

Step 5: Continue steps 2 to 4 with dynamically updated weights based on equation ( 15 ) until the user switch to visual only search.

Step 6: Extract di®erent features as fIgqlobal; fIsqg; fIloqcal, and fIVq¡concept for the multi-modal query q with an image part as Iq.

Step 7: Perform visual only search in top L = 1000 images retrieved by text-based search and rank them based on the similarity values by applying equation ( 13 ) with equal feature weighting.

Step 8: Obtain user's feedback from top retrieved K = 30 images about the relevant images and perform visual query re¯nement as fIxqm , where x 2 fglobal; sg; V ¡ conceptg at a particular iteration.

Step 9: At next iteration, calculate the feature weights based on equation ( 15 ) and apply it to equation ( 13 ) for ranked based retrieval result.

Step 10: Continue steps 8 and 9, until the user is satis¯ed or the system converges.

The process °ow diagram of the sequential search approach is shown in Figure 1. For this approach, the text-based search with query reformulation is performed ¯rst as shown in the ( 1 ) left portion of the ¯gure and image-based search is performed in the ¯ltered image set as shown in the ( 1 ) right portion of the ¯gure 1.

Simultaneous approach with linear combination

This section describes our approach of simultaneous multi-modal search. Here, textual and content-based search are performed simultaneously from the beginning and the results are combined with an adaptive linear combination scheme as described in Section4. The steps involved in this approach are as follows:

Step 1: Initially, for a multi-modal query q with a document part as Dq and an image part as Iq, extract textual query vector as fDq and di®erent image feature vectors as fIgqlobal; fIsqg; fIloqcal, and f V¡concept.

Step 2: Perform a multi-modal search to rank the images based on equation ( 12 ), where SimD(Dq; Dj) is initially performed through Simtext(Dq; Dj) equation ( 1 ) and SimI(Iq; Ij) is performed through equation ( 13 ) with initially equal weighting in both inter and intra-modality weights.

Step 3: Obtain user's feedback from top retrieved K = 30 images about relevant and irrelevant images for both textual and visual query re¯nement and for dynamically update the weights.

Step 4: Based on the feedback information, calculate the optimal textual query vectors as fDmq (Rocchio); fDmq (Ide); fDmq (Local1) and fDmq (Local2) and image query vectors as fIxqm , where x 2 fglobal; sg; V ¡ conceptg and update the inter and intra-modality weights based on equation ( 15 ).

Step 5: Re-submit the modi¯ed textual and image query vectors to the system and apply multimodal similarity matching based on equation ( 16 ), where SimD(Dq; Dj) is performed through equation ( 14 ) and SimI(Iq; Ij) is performed through equation ( 13 ).

Step 6: Continue steps 3 to 5, until the user is satis¯ed or the system converges.

The process °ow diagram of the above multi-modal simultaneous search approach is shown in Figure 2. For this approach, both text and image-based search are performed simultaneously as shown in left and right portions of Figure 2. 6.0.1

Analysis of the submitted runs The types and performances of the di®erent runs are shown in Table 1 and Table 2 for the ad-hoc retrieval of the photographic and medical collections respectively. In all these runs, only English is used as the source and target language without any translation for the text-based retrieval approach. We submitted ¯ve di®erent runs for the ad-hoc retrieval of the photographic collection, where ¯rst two runs are based on text only search and last three runs are based on mixed modality search as shown in Table 1. For the ¯rst run \CINDI-TXT-ENG-PHOTO", we performed only a manual text-based search without any query expansion as our base run. This run achieved a MAP score of 0.1529 and ranked 140th out of 476 submissions (e.g., within the top 30%). Our second run \CINDI-TXT-QE-PHOTO" achieved the best MAP score (0.2637) among all our submitted runs and ranked 21st for this year competition. In this run, we performed two iterations of manual feedback for textual query expansion and combination based on dynamic weight update schemes for text only retrieval as described in Sections 2 and 4. The rest of the runs are based on multimodal approach, where in the third run \CINDI-TXT-QE-IMG-RF-RERANK", we performed the sequential approach with pre-¯ltering and re-ordering as described in subsection 5 with two iterations of manual feedback in both text and image-based searches. However, the re-ordering approach did not improve the result as a whole (e.g., ranked 32nd) in terms of MAP score (0.2336) as compared to the only textual query expansion approach of our best run. The main reason might be due to the fact that the majority of the query topics are more semantically oriented, where visual search is not suitable or feasible at all. However, this run might perform well where queries have both textual and distinct visual properties, such as query topic number 15 as \night shots of cathedrals" or query topic number 24 as \snowcapped building in Europe". For the fourth run \CINDI-TXTIMG-FUSION-PHOTO", we performed a simultaneous retrieval approach without any feedback information with a linear combination of weights as !D = 0:7 and !I = 0:3 and for the ¯fth run \CINDI-TXTIMG-RF-PHOTO", two iterations of manual relevance feedback are performed as described in Section 6. However, these two runs did not perform well in terms of MAP score as compared to the sequential approach due to early combination and nature of the queries as described earlier.

For the image retrieval task in the medical collections, we submitted seven runs this year. However, due to few errors (such as duplicate entry and reference image as 0.jpg in the result set), three of our runs could not produce performance report by evaluating with the trec-eval program. This is mainly due to reason of directly using reference images from the annotation ¯les instead of using the link XML ¯le as provided. We are currently ¯xing this problem and later analyze and report the results of these runs. Table 2 shows the o±cial result of the four runs out of our seven submitted runs. In the ¯rst run \INDI-IMG-FUSION", we performed only a visual only search based on various image feature representation schemes as described in Section 3 without any feedback information and with a linear combination of equal feature weights. For the second run \CINDI-IMG-FUSION-RF", we performed only one iteration of manual feedback for visual query re¯nement and combined the similarity matching functions based on the dynamic weight updating scheme. For this run we achieved a MAP score of 0.0372, which is slightly better then the score (0.0333) achieved by the ¯rst run without any relevance feedback information. However, compared to the the text-based approaches the performances are very low as it happened in previous years of ImageCLEFmed. For the third run \CINDITXT-IMAGE-LINEAR", we performed a simultaneous retrieval approach without any feedback information with a linear combination of weights as !D = 0:7 and !I = 0:3 and for the fourth run \CINDI-TXT-IMG-RF-LINEAR", two iterations of manual relevance feedback are performed similar to the last two runs of photographic retrieval task. From Table 2, it is clear that combining both modalities for the medical retrieval task is far better then using only a single modality (e.g., only image) and we achieved the best MAP score as 0.1483 among all our submissions for this task. 7

Conclusion

This paper presents the ad-hoc image retrieval approaches of CINDI research group for ImageCLEF 2007. We submitted several runs with di®erent combination of methods, features and parameters. We investigated with cross-modal interaction and fusion approaches for the retrieval of the photographic and medical image collections. The description of the runs and analysis of the results are discussed in this paper.

[1]

Grubinger ,

Clough ,

Hanbury , and H. MuÄller, Overview of the ImageCLEF 2007 Photographic Retrieval Task , Working Notes of the 2007 CLEF Workshop , Budapest, Hungary, Sep. 2007 .

[2]

MuÄller , T. Deselaers, E. Kim,

Kalpathy ,

Jayashree , M. Thomas,

Clough , W. Hersh, Overview of the ImageCLEFmed 2007 Medical Retrieval and

Annotation

Tasks , Working Notes of the 2007 CLEF Workshop , Budapest, Hungary, Sep. 2007 .

[3]

Baeza-Yates and

Ribiero-Neto , Modern Information Retrieval, Addison Wesley, 1999 .

[4]

Salton and

Buckley , Improving retrieval performance by relevance feedback , Journal of the American Society for Information Science , vol. 41 ( 4 ), pp. 288 { 297 , 1990 .

[5]

Rui ,

T. S.

Huang , Relevance Feedback: A Power Tool for Interactive Content-Based Image Retrieval, IEEE Circuits Syst . Video Technol., vol. 8 , pp. 644 { 655 , 1999 .

[6]

J.J.

Rocchio , Relevance feedback in information retrieval . In The SMART Retrieval System - Experiments in Automatic Document Processing , pp. 313 { 323 , Englewood Cli®s, NJ, Prentice Hall, Inc. 1971 .

[7]

Ide , New experiments in relevance feedback , In The SMART retrieval system - Experiments in Automatic Document Processing , pp 337 { 354 . 1971 .

[8]

Ogawa ,

Morita , and

Kobayashi , A fuzzy document retrieval system using the keyword connection matrix and a learning method , Fuzzy Sets and Systems , vol. 39 pp. 163 { 179 , 1991 .

[9]

Ishikawa ,

Subramanya and

Faloutsos , MindReader: Querying Databases Through Multiple Examples, 24th Internat. Conf. on Very Large Databases , New York, pp. 24 { 27 , 1998 .

[10]

E.A.

Fox and

J.A.

Shaw , Combination of Multiple Searches, Proc. of the 2nd Text Retrieval Conference (TREC-2) , NIST Special Publication 500-215 , pp. 243 - 252 , 1994 .

[11]

J.H.

Lee , Combining Multiple Evidence from Di®erent Properties of Weighting Schemes , Proc. of the 18th Annual ACM-SIGIR , pp. 180 { 188 , 1995 .

[12]

Attar and

A.S.

Fraenkel , Local feedback in full-text retrieval systems , Journal of ACM , vol. 24 ( 3 ), pp. 397 { 417 , 1977 .

[13]

Smeulder ,

Worring ,

Santini ,

Gupta ,

Jain , Content-Based Image Retrieval at the End of the Early Years , IEEE Trans. on Pattern Anal. and Machine Intell ., vol. 22 , pp. 1349 { 1380 , 2000 .

[14]

B.S.

Manjunath ,

Salembier , T. Sikora , (eds.), Introduction to MPEG-7 - Multimedia Content Description Interface , John Wiley Sons Ltd. pp. 187 { 212 , 2002 .

[15] R.M. Haralick , Shanmugam, and I. Dinstein , Textural features for image classi¯cation , IEEE Trans System, Man, Cybernetics , vo;. 3, pp. 610 { 621 , 1973 .

[16] M.M. Rahman , B.C.

Desai , and P.

Bhattacharya , A Feature

Level

Fusion in Similarity Matching to Content-Based Image Retrieval , Proc. 9th Internat Conf Information Fusion , 2006 .

[17]

Kohonen , Self-Organizing

Maps

, Springer-Verlag, Heidelberg. 2nd ed. 1997 .

[18]

Fukunaga , Introduction to Statistical Pattern Recognition, 2nd ed. Academic Press, 1990 .

[19]

Friedman , Regularized Discriminant Analysis , Journal of American Statistical Association , vol. 84 , pp. 165 { 175 , 2002 .