=Paper=
{{Paper
|id=Vol-1172/CLEF2006wn-ImageCLEF-RahmanEt2006
|storemode=property
|title=CINDI at ImageCLEF 2006: Image Retrieval & Annotation Tasks for the General Photographic and Medical Image Collections
|pdfUrl=https://ceur-ws.org/Vol-1172/CLEF2006wn-ImageCLEF-RahmanEt2006.pdf
|volume=Vol-1172
|dblpUrl=https://dblp.org/rec/conf/clef/RahmanSDB06a
}}
==CINDI at ImageCLEF 2006: Image Retrieval & Annotation Tasks for the General Photographic and Medical Image Collections==
CINDI at ImageCLEF 2006: Image Retrieval & Annotation Tasks for the General Photographic and Medical Image Collections M. M. Rahman, Varun Sood, Bipin C. Desai, Prabir Bhattacharya Dept. of Computer Science & Software Engineering, Concordia University 1455 de Maisonneuve Blvd., Montreal, QC, H3G 1M8, Canada mah rahm@cs.concordia.ca Abstract This paper presents our techniques used and their analysis for the runs made and the results submitted by the CINDI group for the task of the image retrieval and au- tomatic annotation of ImageCLEF 2006. For the ah-hoc image retrieval from both the photographic and medical image collections, we have experimented with cross-modal (image and text) interaction and integration approaches based on the relevance feed- back in the form of textual query expansion and visual query point movement with adaptive similarity matching functions. Experimental results show that our approaches performed well compared to initial visual or textual only retrieval without any user interactions or feedbacks. We are ranked first and second and achieved the highest MAP score (0.3850) for the ad-hoc retrieval in the photographic collection (IAPR) among all the submissions. For the automatic annotation tasks for both the medical (IRMA) and object collections (LTU), we have experimented with a classifier combi- nation approach, where several probabilistic multi-class SVM classifiers with features at different levels as inputs are fused with several combination rules to predict the final probability score of each category as image annotation. Analysis of the results of the different runs we made for both the image retrieval and annotation tasks are reported in this paper. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.7 Digital Libraries; I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Object Recognition General Terms Algorithms, Machine learning, Performance, Experimentation Keywords Content-based image retrieval, Vector space model, Feature extraction, Query expansion, Rele- vance feedback, Classification, Support vector machine 1 Introduction For the 2006 ImageCLEF workshop, CINDI research group has participated in four different tasks of ImageCLEF track: an ad-hoc retrieval from a photographic collection, ad-hoc retrieval from a medical collection, and automatic annotation of the medical and the object data sets [1, 2]. This paper presents the methodologies, results and analysis of the runs of each of the tasks separately. 2 Ad-hoc retrieval from photographic collection Our main goal of the ad-hoc retrieval task is to investigate the effectiveness of combining text and image by involving user in the retrieval loop in the form of relevance feedback. We have experimented with a cross-modal approach of image retrieval which integrates visual information based on purely low-level image content and semantical information from the associated annotated text files. The advantages of both modalities are exploited by involving the users in the retrieval loop for the cross-modal interaction and integration in similarity matching. For the text-based retrieval, the keywords from the annotated files are extracted and indexed with the help of the vector space model paradigm [3]. In order to perform the query expansion on the textual search, additional keywords are extracted for the query based on the positive feedbacks from the user. For the content-based search, a query point movement and an adjustment of the similarity matching functions are performed based on the estimation of the mean and covariance matrix from the feature vectors of the positive feedback images. Finally, a ranked-based ordered list of images is obtained by a pre-filtering approach which integrats the scores from both the text and image search-based result lists. 2.1 Text retrieval approach For the keyword-based search on the annotated text files, we have utilized a simple but effective information retrieval (IR) tool by Raymond Mooney at the Texas University [4]. However, we have performed several modifications to the original library according to the experimental requirements, such as allowing recursive indexation of the text files that are stored in directories, expanding the stop word list by adding several common words specific to the experimental domain and modifying the term weighting scheme with the query expansion. For the text-based indexing, keywords are extracted from all the associated annotation files by ignoring all the tags as stop words. The indexing technique is based on the popular vector space model (VSM) of IR [3]. In this model, texts and queries are represented as vectors in a N -dimensional space, where N is the number of keywords in the collection. So, each document j can be represented as a vector as: Dj =< w1j , · · · , wN j > (1) The element wij represents the weights of the keyword wi appearing in document j and can be weighted in a variety of ways. One common scheme is term-frequency-inverse document frequency (TF-IDF) weighting. Both global weight and local weight are considered in this approach [3]. A global weight indicates the overall importance of that component in the feature vector across the whole image collection. A local weight is applied to each element indicating the relative importance of the component within its vector. The local weight is denoted as Li,j = log(fi,j ) + 1, where fi,j is the frequency of occurrence of keyword wi in document j. The global weight is the inverse document frequency and denoted by Gi where Gi = log(M/Mi ) + 1, for i = (1, · · · , , N ), where Mi be the number of documents in which wi is found and M is the total number of documents in the collection. Finally, the element wij is expressed as the product of local and global weight: hence wij = Li,j ∗ Gi [3]. The vector space model is based on the assumption that similar documents will be represented by similar vectors in the N -dimensional vector space. In particular, similar documents are expected to have small angles between their corresponding vectors. Hence, the cosine similarity measure is adopted between feature vectors of the query document q and database document j as follows [3]: PN i=1 wiq ∗ wij Stext (q, j) = Stext (Dq , Dj ) = qP qP (2) N 2 N 2 i=1 (wiq ) ∗ i=1 (wij ) where, Dq and Dj are the query and document vector respectively. The advantage of the VSM includes a ranked result of the retrieved documents (as well as the associated images) which would be useful when we fuse the results from both the keyword and content-based image retrieval. 2.2 Content-based image retrieval approach The performance of a Content-based image retrieval approach(CBIR) system depends on the underlying image representation, usually in the form of a feature vector [5]. Based on the previous experiments [6], we have found that the image features at different levels are complementary in nature and together they could contribute to effectively distinguish the images of different semantic categories. Hence, to generate the feature vectors, we have extracted the low-level global, semi-global and region specific local features for the image representation at different levels of abstraction. 2.2.1 Feature extraction and similarity matching In this work, the MPEG-7 based Edge Histogram Descriptor (EHD) and Color Layout Descriptor (CLD) are extracted for image representation at the global level [7]. To represent the global shape feature, the spatial distribution of edges are utilized by the EHD descriptor. A histogram with 16 × 5 = 80 bins is obtained, corresponding to a feature vector f EHD , having a dimension of 80 [7]. The CLD represents the spatial layout of the images in a very compact form [7]. It is obtained by applying the discrete cosine transform (DCT) on the 2-D array of local representative colors in YCbCr color space. In this work, CLD with 10 Y , 3 Cb and 3 Cr coefficients is extracted to form a 16-dimensional feature vector f CLD . Now, for comparing the query image Q and the target image T in the database based on the global features, a weighted Euclidean distance measure is utilized as DISglobal (Q, T ) = ωCLD DCLD (Q, T ) + ωEHD DEHD (Q, T ), (3) CLD where, DCLD (Q, T ) = ||fQ − fTCLD ||2 and DEHD (Q, T ) = ||fQ EHD − fTEHD ||2 are the Euclidean distance measures for CLD and EHD feature vector respectively and ωCLD and ωEHD are weights for each feature distance measure subject to ωCLD + ωEHD = 1 and adjusted as ωCLD = 0.4 and ωEHD = 0.6 in the experiment. For semi-global feature vector, a simple grid-based approach is used to divide the images into five overlapping sub-images [6]. Several moment based color and texture features are extracted from each of the sub-images and later they are combined to form a semi-global feature vector. For moment-based color feature, the first (mean) and second (standard deviation) central moments of each color channel in HSV color space are extracted. Texture features are extracted from the grey level co-occurrence matrix (GLCM) [8]. GLCM is defined as a sample of the joint probability density of the gray levels of two pixels separated by a given displacement. Second order moments, such as energy, maximum probability, entropy, contrast and inverse difference moment are measured based on the GLCM. Color and texture feature vectors are normalized and combined to form a joint feature vector of 11-dimensions (6 for color and 5 for texture) for each sub-region to finally generate a 55-dimensional (5 × 11) semi-global feature vector f SG . For the semi-global distance measure between Q and T , we also utilized the Euclidean distance measure as SG DISsemi-global (Q, T ) = ||fQ − fTSG ||2 (4) We have also considered a local region specific feature extraction approach by fragmenting an image automatically into a set of homogeneous regions based on a fast k-means clustering technique. To represent each region with local features, we consider information on weight (i.e, number of pixels) and color-texture as in [6]. Color feature fRc i of each region Ri is a 3-D vector and is represented by the K-means cluster center, i.e., the average value for each of the three color channels in HSV space of all the image pixels in this region. Texture feature of each region is measured in an indirect way by considering the cross-correlation among color channels due to the off diagonal of the 3 × 3 covariance matrix of the region Ri . To compute the region specific distance measure between two regions Ri and Rj of Q and T respectively, we apply the Bhattacharyya distance metric [9] as follows: " #−1 1 (CQRi + CTRj ) D(Ri , Rj ) = (fRc i − fRc j )T 8 2 (CQR +CTR ) i j 1 2 (fRc i − fRc j ) + ln q (5) 2 |CQRi ||CTRj | where fRc i and fRc j are the region feature vectors, and CQRi and CTRi are the covariance matrices of region Ri and Rj of query image Q and target image T respectively. Finally, the image-level distance between Q and T is measured as PM PN i=1 wQRi Ri (T ) + j=1 wTRj Rj (Q) DISlocal (Q, T ) = (6) 2 where wQRi and wTRj are the weights for region i of image Q and region j of image T respectively. For each region i ∈ M in Q, Ri (T ) is defined as the minimum distance between this region and any region j ∈ N in image T and in a similar way Rj (Q) is computed. The overall image level similarity is measured by fusing of a weighted combination of individual similarity measures. Once the distance functions are measured as above, they are normalized and converted to similarity measure, which in general is the converse of a distance function. After the similarity measures of each representation are determined as Sglobal (Q, T ), Ssemi-global (Q, T ), and Slocal (Q, T ), we aggregate or fuse them into a single similarity matching function as follows: Simage (Q, T ) = wg Sglobal (Q, T ) + wsg Ssemi-global (Q, T ) + wl Slocal (Q, T ) (7) Here, wg , wsg and wl are non-negative weighting factors of different feature level similarities with normalization wg + wsg + wl = 1. For the retrieval experiments, they are selected as wg = 0.4, wsg = 0.3 and wl = 0.3. 2.3 Cross-modal interaction with relevance feedback In a practical multi-modal image retrieval system, the user at first might want to search images with keywords as it is more convenient and semantically more appropriate. However, a short query (e.g., query topic) with few keywords might not be enough to incorporate the user perceived semantics to the retrieval system. Hence, a query expansion process is required to add additional keywords and modify the weight of the keywords in the original query vector. In this paper, a simpler approach of query expansion is considered based on identifying useful terms or keywords from the associated annotation files for the images. The approach of the textual query expansion based on the relevance feedback (RF) is as follows: the user provides the initial query topic and the system extracts from it a set of keywords as the initial textual query vector Dq(0) . This query vector is used to retrieve K most similar images from associated text documents based on the cosine similarity measure described in section 2.1. If the user is not satisfied with the result, then the system will allow the user to select a set of relevant or positive images close to the semantics of the initial textual query topic. Next, the system will extract all the keywords from the annotation files associated with the positive feedback images. After extracting the additional keywords, the query vector will be adjusted as Dq(i) at iteration i by re-weighting its keywords by following the TF-IDF and re-submitted to the system as the query for the next iteration. This process may continue for several iterations until the user is satisfied with the result. However, since we have a multi-modal system, it will not be wise to perform query expansion by just using one particular modality (e.g., only text). Visual features of images also play an important part in distinguishing different semantical/visual categories. Therefore, we also need to perform RF with content-based image search for better precision [11]. In this scenario, like textual query expansion, user might provide the initial image query vector fQ(0) to retrieve K most similar images based on the similarity measure function in equation (7). In the next iteration (either from the texual or image-based feedback), user might select a set of relevant images compared to the initial query image. It is assumed that, all the positive feedback images P os(fQ(i) ) at some particular iteration i will belong to the user perceived semantic category and obey the Gaussian distribution to form a cluster in the feature space. Let, NPos be the number of positive feedback images at iteration i and fTj ∈ ℜd is be the feature vector that represents j-th image for j ∈ {1, · · · , NPos }, then the new query point at iteration i is 1 PNPos estimated as fQ(i) = NPos j=1 fTj as the mean vector of positive images and covariance matrix 1 PNPos is estimated as C(i) = NPos −1 j=1 (fTj − fQ(i) )(fTj − fQ(i) )T . However, singularity issue will arise in covariance matrix estimation if fewer than d + 1 training samples or positive images are available as will be the case in user feedback images. So, we add regularization to avoid singularity in matrices as follows[12]: Ĉ(i) = αC(i) + (1 − α)I (8) for some 0 ≤ α ≤ 1 and I is the d × d identity matrix. After generating the mean vector and covariance matrix of the positive images, we adaptly adjust the Euclidean distance measures of various feature representation with the following Ma- halanobis distance measure [9]: DISMaha (Q, T ) = (fQ(i) − fT )T Ĉ−1 (i) (fQ(i) − fT ) (9) Here, fT denotes the feature vector of target database image T in general for different image representation (e.g., global and semi-global). The Mahalanobis distance differs from the Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant, i.e., it is not dependent on the scale of measurements. If the covariance matrix is the identity matrix then it is the same as the Euclidean distance [9]. Basically, at each iteration of relevance feedback, we generate several mean vectors and co- variance matrices for each of the representation separately and use it in the distance measures. Finally, we obtain a ranked based retrieval by applying the fusion-based similarity function of equation (7). So, the above relevance feedback approach performs both the query point movement and similarity matching adjustment at the same time. 2.4 Integration of the textual and visual results We have considered a pre-filtering and merging approach based on the text and image result lists obtained by the text retrieval after query expansion and image retrieval by applying adaptive distance measure (equation (9)) after the relevance feedback. In this multi-modal integration ap- proach, combining the result of the text and image based retrieval is a matter of re-ranking or re-ordering of the images based on a weighted combination of scores from both the modalities. Instead of purely merging the results, we basically perform a pre-filtering step with the text query at first as it can generate results more closely to the user perceived semantics. The steps involved in the proposed interaction and integration approaches are as follows: Step 1: Perform an initial text-based search with a query vector Dq(0) for a query topic q(0) at iteration i = 0 and rank the associated images based on the ranking of the text (annotation) documents by applying Stext of equation(2). Figure 1: Process flow diagram of the integration approach Table 1: Results of the ImageCLEFphoto Retrieval task Run ID Language Mod A/M FB QE MAP Cindi-Text-Eng Eng Text Automatic Without No 0.1995 Cindi-TXT-EXP Eng Text Manual With Yes 0.3749 Cindi-Exp-RF Eng Text+Image Manual With Yes 0.3850 Step 2: Consider top K = 30 most similar images from the retrieval interface and obtain user feedback about positive or relevant images (e.g., associated annotation files) for the textual query expansion. Step 3: Resubmit the modified query vector Dq(i) by re-weighting the keywords at iteration i. Continue the iterations by incrementing i, until the user is satisfied or the system converges. Step 4: Perform visual only search on the result list of the first L = 2000 images obtained from step 3 with the initial query image Q(0). Step 5: Obtain the user feedback of the relevant images and perform the image only search with the new query Q(i) at iteration i with equation (9) and equation (7). Continue the iterations by incrementing i, until the user is satisfied or the system converges. Step 6: Aggregate the text and image based scores by fusing the similarity measures as: S(Q, T ) = wtext Stext (., .) + wimage Simage (., .) (10) where, wtext = 0.7 and wimage = 0.3 are selected for the experiment. Step 7: Finally, rank the images in descending order of similarity values and return the top 1000 images. Fig. 1 shows the process flow diagram of the proposed multi-modal intercation and integration approaches. 2.5 Analysis of the results We have submitted three runs for the ad-hoc retrieval of the IAPR collection as shown in Table 1. In all these runs, English queries and example images are used as our initial source queries. In the first run with ID “Cindi-Text-Eng”, we performed only the automatic text-based search without any feedback as our base run. For the second and third runs with ID “Cindi-TXT-EXP” and “Cindi-Exp-RF” respectively, we performed manual feedback in the text only modality and in combination of the text and image modalities (with only one or two iterations for each modality) as discussed in the previous sections. From Table 1, it is clear that the MAP scores are almost doubled in both the cases with the feedback and integration of text and image has achieved the best performance. In fact, these two runs ranked first and second in terms of the MAP score among the 157 submissions in the photographic retrieval task. Our group performed manual submissions using relevance judgement from the user, which along with the integration of the both modalities could be the reason for our good results. 3 Ad-hoc retrieval from medical image collections For the ah-hoc image retrieval task in the medical collections (e.g., CaseImage, PEIR, MIR and PathoPic datasets), we have experimented with a similar cross-modal approach performed for photographic retrieval. However, for the text-based indexing and search, we have utilized the Lucene search engine [10], an open source project under the Apache software foundation. We have also performed a different query expansion and merging algorithm to obtain a text-based result list and finally merge with the image-based result list to obtain the final ranked result by applying similar weighting scheme as discussed in section 2.4 for photographic retrieval. To use the textual information for image retrieval in the medical collections, each image has to be attached to at least one (possibly empty) text document. The text-based indexing is started by extracting keywords from the XML documents by parsing them using Xerces2 Java Parser, an open source project under the Apache software foundation. Every element of the XML document is indexed as a separate field in Lucene. Separate fields make it easier to search for the contents based on criteria or simply searching on all the indexed elements. Before indexing, the stop words (we have added additional domain specific stop words to the list) are removed from the description of the elements. Once the index creation process is completed, the keyword-based searching can be performed using the Lucene API. For content-based indexing, we use the same approach as described in section 2 for the pho- tographic collection. However, we have extracted a low-resolution scaled-specific image feature in addition to the global, semi-global and local region-specific features. Since, images in the different medical collections vary in sizes, resizing them into a thumbnail of a fixed size might reduce some noise due to the artifacts presents in the images, although it may introduce distortion. These approaches are extensively used in face or finger-print recognition and have proven to be effective. For the scaled-based feature vector f Scaled , each image is converted to a gray-level image and down scaled to 64 × 64 regardless of the original aspect ratio. Next, the down-scaled image is partitioned further with a 16 × 16 grid to form small blocks of (4 × 4) pixels. The average gray value of each block is measured and concatenated to form a 256-dimensional feature vector. By measuring the average gray value of each block, it can cope with global or local image deformations to some extent and adds robustness with respect to translations and intensity changes. We also utilize the Euclidean distance measure to compare Q and T for f Scaled and the fusion- based similarity function is slightly adjusted due to the added scaled-specific feature as follows: Simage (Q, T ) = wg Sglobal (Q, T ) + wsc Sscaled (Q, T )wsg Ssemi-global (Q, T ) + wl Slocal (Q, T ) (11) For the medical retrieval experiments, the weights are adjusted as wg = 0.4, wsc = 0.2, wsg = 0.25 and wl = 0.15. 3.1 Query expansion and integration of the results A search process can start either by entering some text (query topic) in the text field or by providing a query image (e.g., “query-by-example”) to the system. If the user starts a keyword- based search process, then the system will search the index, find the XML documents where those keywords occur based on the similarity matching on the query and document vectors, and finally retrieve the images corresponding to the XML documents. Resulting images are then displayed as sorted in descending order of the similarity scores of the associated XML documents. If the user starts the search process with the visual approach (i.e. “query by example”), various low-level image features will be computed on-line and the resulting images will be displayed sorted Figure 2: Query expansion and merging approach by similarity score obtained from equation (11). After the initial search, user can make use of the relevance feedback system, which can work on both the text and image modalities simulteneously to display the results in the subsequent passes as discussed in section 2.3. After obtaining the initial results, user can select the relevant images as a positive feedback to the system which indicates the type of images the user is looking for. The system then runs two separate queries on the text and image-based systems for the selected feedback images. For the query expansion in the text-based system, the system finds the corresponding XML documents from the positive feedback images. Next, it extracts the top n most frequent keywords from each XML documents. Hence, in the next iteration of RF, user will submit seperate new queries to the system using the new keywords found in each document. This will result in m different lists of results where m is equal to the number of documents sent as positive feedback. After getting m separate lists of results, we merge these into a single list and display the text-based result. Merging of the results is based on the assumption that if a particular image is occurring in most of the lists, then it should have a higher rank or priority then images less frequent in the lists. So, we upgrade the rank of this image by increasing the average similarity score based on how many lists contain that image. For example, if there are 10 lists of results, and a particular image presents in 8 of the lists, we will add more weight to this image than the image which occurs in 3 lists. More technically, if the image presents in all the list then a boost of 0.3 as additional score will be given to that image score. If the image exists in 50% or above of the lists then a boost of 0.2 and for less than 50%, a boost of 0.1 is given. For the images exist only once among all the lists, no boost is provided but the images are added to the list of the final result with their original score. After this, the list is sorted with the new scores and displayed as the final text-based result list as shown in the Fig. 2. The different boosting scores are selected by experimenting on a small sample database, which provided better results. When the content-based system receives the list of positive images as relevance feedback, we can perform similar query point movement and similarity measure adjustment techniques as described in section 2.3, to return a new image-based result list on the text-based pre-filtered images. Once we have separate lists of results (e.g., one from the text-based system and another from the image- based system), we merge the lists using the similar weighting scheme as described in section 2.4 for the photographic image retrieval. Table 2: Results of the Medical Retrieval task Run ID Topic System MAP R-prec B-pref CINDI-Fusion-Visual Automatic Visual 0.0753 0.1311 0.166 CINDI-Visual-RF Feedback Visual 0.0957 0.1347 0.1796 CINDI-Text-Visual-RF Feedback Mixed 0.1513 0.1969 0.2397 3.2 Analysis of the results We have submitted three runs for the ad-hoc medical retrieval as shown in Table 2. In all these runs, English queries and example images are used as our initial source queries. In the first run with ID “CINDI-Fusion-Visual”, we performed only the automatic visual only search without any feedback. Our group ranked first in this run category (automatic+visual) based on the MAP score (0.0753) out of five different groups and 11 submissions. For the second run with ID “CINDI- Visual-RF”, we performed the manual feedback in the image only modality. For this category (e.g., visual only run with RF), only our group has participated this year and achieved better MAP score (0.0957) then without RF as shown in Table 2. For the third run with ID “CINDI- Text-Visual-RF”, we performed the manual feedback in both the modalities and merge the result lists as discussed in the previous section. For this category (e.g., mixed with RF), we have achieved a moderate MAP score of 0.1513. From the scores, it is clear that combining both modalities is far better then using only a single modality (e.g., only image). 4 Automatic annotation tasks The aim of the automatic annotation task is to compare the state-of-the-art approaches to image classification and annotation and to quantify their improvements for image retrieval. We inves- tigate a supervised learning-based approach to associate the low-level image features with their high-level semantic categories for the image categorization or annotation of the medical (IRMA) and object (LTU) data sets. Specially, we explore the classifier combination approach of several probabilistic multi-class support vector machine (SVM) classifiers. Instead of using only one in- tegrated feature vector, we utilize the features at the different levels of image representation as inputs to the SVM classifiers and use several classifier combination approaches to predict the final image category as well as probability or membership score of each category as image annotation. 4.1 Probabilistic multi-class SVM with pairwise coupling SVM is an emerging machine learning technology that has already been used successfully for image retrieval and classification purposes [13]. It performs classification between two classes by finding a decision surface that is based on the most informative points of the training set. Briefly, one can say that SVM constructs a decision surface between samples of two classes, maximizing the margin between them. SVM was originally designed for binary classification problem. A number of methods have been proposed for extension to multi-class problem to separate L mutually exclusive classes essentially by solving many two-class problems and combining their predictions in various ways [14]. In the experiments, we utilize a multi-class classification method by combining all pairwise comparisons of binary SVM classifiers, known as one-against-one or pairwise coupling (PWC) [14]. PWC constructs binary SVM’s between all possible pairs of classes. Hence, this method uses L∗(L−1)/2 binary classifiers, each of which provides a partial decision for classifying a data point. During the testing of a feature vector f , each of the L ∗ (L − 1)/2 classifier votes for one class. The winning class is the one with the largest number of accumulated votes. Although the voting procedure requires just pairwise decisions, it only predicts a class label [16]. However, to Figure 3: Block diagram of the classifier combination process. annotate or represent each image with a category specific confidence score, probability estimation is required. In our experiments, the probability estimation approach in [14] for the multi-class classification by PWC is utilized. In this context, given the observation or feature vector f , the goal is to estimate the posterior probability as pk = P (y = k | f ), k = 1, · · · , L (12) 4.2 Multiple SVM classifier combination The development of a multiple expert or classifier combination based system has received increasing attention and has been a popular research topic. In general, classifier combination is defined as the instances of the classifiers with different structures trained on distinct feature spaces [15]. Feature descriptors at different levels of image representation are in diversified forms and often complementary in nature. It is rather unwise to concatenate them together to form a single feature vector as the input for a single classifier. Hence, multiple classifiers are needed to deal with the different features, which results in a general problem of how to combine those classifiers with different features to yield the improved performance. In the experiments, we consider expert combination strategies of the SVM classifiers with different low-level features as inputs based on three popular classifier combination rules (e.g., sum, product and max rules) [15]. Since the outputs of the classifiers are to be used in combination, the confidence or membership scores from the probabilistic SVM’s in the range of [0, 1] for each category serve this purpose. In these combination rules, a priori probabilities are assumed to be equal and the decision is made by the following formula in terms of the a posteriori probabilities yielded by the respective classifiers: P combine P combine (y = k | f ) = PL k , k = 1, · · · , L (13) combine k=1 Pk Here, Pkcombine is the combined output of the classifiers about the likelihood of the sample vector f belonging to the category k ∈ L. Pkcombine and is obtained by using the following two combination rules [15]: In product rule, it is assumed that the representations used are conditionally statistically independent, where R experts or classifiers are combined as follows R Y Pkcombine = P (y = k | f m ) (14) m=1 where, P (y = k | f m ) denotes the posterior probability of the class k on the input f m for the classifier m. Similarly, for the sum and max rules, it can be stated as follows: R X Pkcombine = P (y = k | f m ) (15) m=1 Table 3: Performance of the object annotation task (LTU dataset) Run ID Feature Method Error rate(%) Cindi-SVM-Product CLD+EHD+Semi-global SVM (Product) 83.2 Cindi-SVM-SUM CLD+EHD+Semi-global SVM (Sum) 85.2 Cindi-SVM-EHD EHD SVM 85.0 Cindi-Fusion-Knn CLD+EHD+Semi-global+Local K-NN 87.1 Table 4: Performance of the medical annotation task (IRMA dataset) Run ID Feature Method Error rate(%) cindi-svm-product CLD+EHD+Scaled+Semi-global SVM (Product) 24.8 cindi-svm-sum CLD+EHD+Scaled+Semi-global SVM (Sum) 24.1 cindi-svm-max CLD+EHD+Scaled+Semi-global SVM (Max) 26.1 cindi-svm-ehd EHD SVM 25.5 cindi-fusion-KNN9 CLD+EHD+Scaled+Semi-global+Local K-NN 25.6 R Pkcombine = max P (y = k | f m ) (16) m=1 The sum rule is developed under stricter assumptions than the product rule. In addition to the conditional independence assumption in the product rule, the sum rule assumes that the probability distribution will not deviate significantly from the a priori probabilities [15]. The multi-class SVM classifiers as experts on different feature descriptors as described in section 2.2 for retrieval, are combined with the above rules and finally classify the image to the category with the highest obtained probability value and annotate the images with the probability or membership scores as shown in the process diagram in Fig. 3. 4.3 Analysis and results of the runs To perform SVM-based classification, we utilze the LIBSVM software package [16]. For the train- ing of both the data sets, RBF kernel functions are utilized with different kernel γ and cost C parameters found out experimentally with 5-fold cross validation (CV). We have submitted four runs for the object annotation task as shown in Table 3. First three of these runs are experimented with the proposed multi-class SVM and classifier combination ap- proach with different feature inputs and the last run with ID “Cindi-Fusion-Knn” is experimented with a K-NN (K=9) classifier by using the fusion-based similarity matching function in equation (7). Our best run (e.g., “Cindi-SVM-Product”) in this task ranked third among all the submis- sions. Although the accuracy rate is much lower at this moment due to the complexity of the dataset in general. For the medical annotation task, we have submitted five runs as shown in Table 4. First four of these runs are experimented with the proposed multi-class SVM and classifier combination approach and the last run with ID “cindi-fusion-KNN9” is experimented with a K-NN (K=9) classifier by using the fusion-based similarity matching function in equation (11). Our best run (e.g., “cindi-svm-sum”) in this task ranked 13th among all the submissions and 6th among all the groups. 5 Conclusion This report has examined the image retrieval and annotation approaches of CINDI research group for ImageCLEF 2006. We have participated in all the four sub-tasks and submitted several runs with different combination of methods, features and parameters. We have experimented with a cross-modal interaction and integration approach for the retrieval of the photographic and medical image collections and a supervised classifier combination-based approach for the automatic anno- tation of the object and medical datasets. The analysis and the results of the runs are discussed in this paper. References [1] P. Clough, M. Grubinger, T. Deselaers, A. Hanbury, H. Müller, “Overview of the ImageCLEF 2006 photo retrieval and object annotation tasks,” CLEF working notes, Alicante, Spain, Sep., 2006. [2] H. Müller, T. Deselaers, T. Lehmann, P. Clough, W. Hersh, “Overview of the ImageCLEFmed 2006 medical retrieval and annotation tasks”, CLEF working notes, Alicante, Spain, Sep., 2006. [3] R. Baeza-Yates and B. Ribiero-Neto, Modern Information Retrieval, Addison Wesley, 1999. [4] R. J. Mooney, “Intelligent Information Retrieval and Web Search ”, Online Courseware, University of Texas, Austin, USA, avilable at http://www.cs.utexas.edu/users/mooney/ir- course/ [5] A. Smeulder, M. Worring, S. Santini, A. Gupta, R. Jain, “Content-Based Image Retrieval at the End of the Early Years,” IEEE Trans. on Pattern Anal. and Machine Intell., vol. 22, pp. 1349–1380, 2000. [6] M. M. Rahman, B.C. Desai, P. Bhattacharya, “A Feature Level Fusion in Similarity Matching to Content-Based Image Retrieval”, Proc. 9th Internat Conf. Information Fusion, 2006. [7] B. S. Manjunath, P. Salembier, T. Sikora, (eds.)Introduction to MPEG-7- Multimedia Content Description Interface, John Wiley Sons Ltd. pp. 187-212, 2002. [8] S. Aksoy, R. M. Haralick, “Texture Analysis in Machine Vision”, Chapter Using Texture in Image Similarity and Retrieval, Series on Machine Perception and Artificial Intelligence., World Scientific, 2000. [9] K. Fukunaga, Introduction to Statistical Pattern Recognition , 2nd ed. Academic Press, 1990. [10] Lucene search engine, abailable at http://lucene.apache.org/java/docs/ [11] Y. Rui, T. S. Huang, “Relevance Feedback: A Power Tool for Interactive Content-Based Image Retrieval”, IEEE Circuits Syst. Video Technol., vol. 8, 1999. [12] J. Friedman, “Regularized Discriminant Analysis”, Journal of American Statistical Associa- tion, vol. 84, pp. 165–175, 2002. [13] V. Vapnik Statistical Learning Theory, New York, NY, Wiley; 1998. [14] T. F. Wu, C. J. Lin, R.C. Weng, “Probability Estimates for Multi-class Classification by Pairwise Coupling”, J Machine Learning Research vol. 5, pp. 975–1005, 2004. [15] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, “On combining classifiers”, IEEE Trans Pattern Anal Machine Intell vol. 20(3), pp. 226–239, 1998. [16] C. C. Chang, C. J. Lin, “LIBSVM : a library for support vector machines”, Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm, 2001.