DEMIR at ImageCLEFMed 2011: Evaluation of Fusion Techniques for Multimodal Content-based Medical Image Retrieval Adil Alpkocak, Okan Ozturkmenoglu, Tolga Berber, Ali Hosseinzadeh Vahid and Roghaiyeh Gachpaz Hamed Dokuz Eylul University Department of Computer Engineering DEMIR Dokuz Eylul Multimedia Information Retrieval Research Group Tinaztepe, 35160 Izmir, Turkey alpkocak@cs.deu.edu.tr, okan.ozturkmenoglu@deu.edu.tr, tberber@cs.deu.edu.tr, {ali_h_vahid, ramisa_84}@yahoo.com Abstract. This paper present the details of participation of DEMIR (Dokuz Eylul University Multimedia Information Retrieval) research team to the context of our participation to the ImageCLEF 2011 Medical Retrieval task. This year, we evaluated fusion and re-ranking method which is based on the best low level feature of images with best text retrieval result. We improved results by examination of different weighting models for retrieved text data and low level features. We tested multi–modality image retrieval in ImageCLEF 2011 medical retrieval task and obtained the best seven ranks in mixed retrieval, which includes textual and visual modalities. The results clearly show that proper fusion of different modalities improve the overall retrieval performance. Keywords: Information Retrieval, Weighting-schemes, Re-ranking, Medical Imaging, Content-based Image Retrieval, Medical Image Retrieval. 1 Introduction In this paper we present the experiments performed by Dokuz Eylul University Multimedia Information Retrieval (DEMIR) Group, Turkey, in the context of our participation to the ImageCLEF 2011 Medical Image retrieval task [1]. The main focus of this work is to improve results by evaluation of different weighting models in text retrieval and then choose the best low-level feature of images for fusion with text only results. During the combination of text and low-level features, we check the variation of methods to gain the best result. On the other hand, we performed the experiments for narrowing down the data collection by defining and filtering out of irrelevant documents. Also we checked the weighted querying system performance in retrieval systems by weighting the special words in queries. 2 Adil Alpkocak, Okan Ozturkmenoglu, Tolga Berber, Ali Hosseinzadeh Vahid and Roghaiyeh Gachpaz Hamed Fig. 1. Basic block diagram of retrieval system. After analyze the visual and textual features set we used (Section 2), we describe the multimodal fusion techniques for multimodal information (Section 3). After we present experiments on ImageCLEF 2010 Medical and Wikipedia Retrieval tracks data (Section 4), then Section 5 concludes the paper by pointing out the open issues and possible avenues of further research in the area of multimodal re-ranking and fusion techniques for content-based image retrieval. 2 The Feature Set The data collection of ImageCLEF 2011 Medical retrieval has textual and visual information. Participants will be given a set of 30 textual queries with 2-3 sample images for each query. The queries will be classified into textual, visual and mixed, based on the methods that are expected to yield the best results.[1] We performed our experiments using ImageCLEF 2010 Medical and Wikipedia Retrieval track’s text and image data. We check the variation of retrieval methods on textual and visual information to gain the best result. 2.1 Textual Features Since the choice of the weighting model may crucially affect the performance of any information retrieval system, first of all we decided to work on evaluating the relative merits and drawbacks of different weighting models using Terrier IR Platform [2], open source search engine written in Java and is developed at the School of Computing Science, University of Glasgow. We performed our experiments on textual features using ImageCLEF 2010 Medical track collection. We started from a traditional bag-of-words representation of pre-processed texts that pre-processing includes stemming (Porter stemmer [3] for English) and stop words removal. DFR- BM25 model’s MAP score is not the best one, but the all weighting model’s number of relevant retrieved score results are close to each other and considering achievements of this model [11], we submitted our DEMIR at ImageCLEFMed 2011: Evaluation of Fusion Techniques for Multimodal Content-based Medical Image Retrieval 3 textual base point run using this model on ImageCLEF 2011 Medical retrieval task data collection as RUN_1 . Fig. 2. MAP scores of weighting models for textual features Fig. 3. Number of relevant retrieved document in different weighting models for textual features 2.2 Visual Features We Used Selection of low-level features is one of the major aspects of a typical content-based information retrieval (CBIR) system. We call these low-level features because most 4 Adil Alpkocak, Okan Ozturkmenoglu, Tolga Berber, Ali Hosseinzadeh Vahid and Roghaiyeh Gachpaz Hamed of them are extracted directly from digital representations of objects in the database and have little or nothing to do with human perception. Thanks to Img(Rummager) application [4], is developed in the Automatic Control Systems & Robotics Laboratory at the Democritus University of Thrace-Greece, and we extracted features explained below for all images in ImageCLEF 2011 test collection and query examples:  EHD: This Edge Histogram Descriptor proposed for MPEG-7 expresses only the local edge distribution in the image and is designed to contain only 80 bins for this purpose. The EHD basically represents the distribution of 5 types of edges in each local area called a sub-image that is defined by dividing the image space into 4x4 non-overlapping blocks. Thus, the image partition always yields 16 equal-sized sub-images regardless of the size of the original image. Edges in the sub-images are categorized into 5 types: vertical, horizontal, 45-degree diagonal, 135-degree diagonal and non-directional edges. Thus, the histogram for each sub-image represents the relative frequency of occurrence of the 5 types of edges in the corresponding sub-image and contains 5 bins [7].  CEDD: This feature combines EHD with color histogram information and named “Color and Edge Directivity Descriptor”. CEDD size is limited to 54 bytes per image, rendering this descriptor suitable for use in large image databases. Important attribute of the CEDD is the low computational power needed for its extraction, in comparison to the needs of the most MPEG-7 descriptors [4].  FCTH: This feature fuzzy version of CEDD feature which contains fuzzy set of color and texture histogram and named “Fuzzy Color and Texture Histogram”. This feature contains results from the combination of 3 fuzzy systems including histogram, color and texture information. FCTH size is limited to 72 bytes per image, and also suitable for use in large image databases [5].  BTDH: This feature is very similar to FCTH feature. The main difference from FCTH feature is using brightness instead of color histogram. This feature is originally developed for radiology images which do not contain color data [6]. After extracting features, we gain an n-dimensional feature space per feature. For query processing, we had to map all of the objects in the database and the query onto this space and then evaluate the similarity difference between the vector corresponding to the query and the vectors representing the data. We selected the Euclidean distance, one of commonly used similarity and distance functions for measuring distances between points in the 3D space, as distance/similarity function and based on obtained similarity scores; we found that CEDD and FCTH are the best descriptors for image retrieval based on low level features only. Therefore we submitted our visual only base point run for CEDD feature. Moreover we use these features for multimodal fusion in next experiments. DEMIR at ImageCLEFMed 2011: Evaluation of Fusion Techniques for Multimodal Content-based Medical Image Retrieval 5 Fig. 4. Comparison of low level feature performance on ImageCLEF 2010 Wikipedia Retrieval task. 3 Fusion Techniques in Multimodal Information Retrieval Multimedia fusion is referred to as integration of multiple media, their associated features, or the intermediate decisions in order to perform an analysis task, has gained much attention of many researchers in recent times. The fusion of multiple modalities can provide complementary information and increase the accuracy of the overall decision making process [8]. The fusion of different modalities is generally performed at two levels: feature level or early fusion and decision level or late fusion. Some researchers have also followed a hybrid approach by performing fusion at the feature as well as the decision level. In the feature level or early fusion approach, the features, some distinguishable properties of a media stream, extracted from input data are first combined and then sent as input to a single analysis unit that performs the analysis task. In the decision level or late fusion approach, the analysis units first provide the local decisions D1 to Dn that are obtained based on individual features F1 to Fn. Then a decision fusion unit combines local decisions to make a fused decision vector that is analyzed further to obtain a final decision D about the task or the hypothesis. To achievement the advantages of both the feature level and the decision level fusion strategies, several researchers have opted to use a hybrid fusion strategy, which is a combination of both feature and decision level strategies. The decision level fusion strategy has many advantages over feature fusion. For instance, the decisions (at the semantic level) usually have the same representation. Therefore, the fusion of decisions becomes easier. Moreover, the decision level fusion strategy offers scalability (i.e. graceful upgrading or degradation) in terms of the 6 Adil Alpkocak, Okan Ozturkmenoglu, Tolga Berber, Ali Hosseinzadeh Vahid and Roghaiyeh Gachpaz Hamed modalities used in the fusion process, which is difficult to achieve in the feature level fusion. Another advantage of late fusion strategy is that it allows us to use the most suitable methods for analyzing each single modality and this provides much more flexibility than the early fusion. Because of these profits, we exerted Linear Weighted Fusion, one of the simplest and most widely used methods on our extracted CEDD and FCTH similarity scores and similarity scores that gained from text retrieval as explained in previous chapters. We applied Fagin’s Combination Algorithms [9] for Ranked Input Sets putting on two score aggregation function defined as “Average” and “Weighted Average”. The average function is applied by taking mean of individual similarity scores of any object. On the other hand, the weighted average function is applied in the same manner but differing on multiplying each individual similarity with a weight value. The weight assignment to individual scores provides an importance level for each feature defined in a whole query [10]. After comparison of several studies we decided to multiply textual feature by 3 and CEDD feature by 2 to gain the best fusion result based on weighted average combination method. Before fusion operation takes place, normalization should be applied to get accurate and correct results since different modalities results a different ranges of similarity values [12]. Here, we applied Min-Max normalization on similarity values to ensure that the range of these features is between 0 and 1. The following equations will ensure the range of this feature from 0 to 1. Suppose the range for a feature is from to . Then the normalized feature is defined as follows: (1) Min-Max normalization is a process of taking data measured in its units and transforming it to a value between 0.0 and 1.0. The lowest (min) value is set to 0.0 and the highest (max) value is set to 1.0. This provides an easy way to compare values that are measured using different scales (i.e., textual, shape, visual, density etc.) or different units of measure (i.e., Euclidean or non-metric space values). After normalization of the similarity values, we combined the different modalities in ranked results. 4 Experimentations We submitted 10 runs to ImageCLEF Medical Retrieval task, in three different categories. The first category includes the runs for baseline retrieval in single modality, numbered as 1 and 3 are baseline retrievals for textual-only and visual-only retrieval, respectively. The second groups of runs to evaluate re-ranking affects to base line, numbered as 8 is re-indexed the baseline retrieval result and re-ranked in textual modality. The last group includes mixed retrieval experiments with fusion of different modalities, numbered as 2, 4, 5, 6, 7, 9 and 10. As illustrated in Table 1, it is obvious that results of mixed runs are better than textual or visual only. Moreover DEMIR at ImageCLEFMed 2011: Evaluation of Fusion Techniques for Multimodal Content-based Medical Image Retrieval 7 results of weighted average combination method are better than normal average method in all approach. Below, we provide a short description of each runs, shortly.  RUN_1: This run is our baseline retrieval result for textual modality. In this run, we removed the stop words, applied Porter stemmer algorithm and used the DFR_BM25 weighting model on text retrieval engine system, Terrier. Let the subscript indicates the arbitrary run ID, the similarity of first run, S1, is defined as follows: (2)  RUN_3: Our baseline retrieval result is this run for visual modality. We used the CEDD feature in visual modality because its performance is better than other features, also you can see in Figure 4. (3)  RUN_8: This run of our group on textual feature is based on our proposed a two- level re-ranking approach in for move relevant documents upward. Re-ranking is a method to reorder the initially retrieved documents with the aim to increase precision. Basically, relevant documents with low similarity scores are re-weighted and reordered. In this run, we propose a new re-ranking approach which includes the narrowing-down phase of search space. Result sets of each query and corresponding base similarity scores are inputs for re-ranking operation. Firstly, we selected relevant documents using initial similarity scores. In other word, we filtered out non-relevant documents based on initial similarity scores. For this we selected first 1000 relevant documents if it existed. Then we constructed a new VSM using this small document sets. This operation drastically reduced both the number of documents and the number of terms. In short, this level shrinks down the initial VSM data into more manageable size. Then we calculated similarity score of new VSM and submitted the results as RUN_8. As illustrated in Table 1, unlike the achievements of this approach in ImageCLEF 2010 Wikipedia retrieval task, all factors of retrieval system decline in contrast to our textual base line run. (4)  RUN_2: Another narrowing down approach that we examine this year is based on medical image modality classification. Result sets of each query and corresponding base similarity scores and their class based on any classification algorithm are inputs for this approach. We also expanded query structure by assignment a type for example images of each query. A query can have a more than one type. In the narrowing down phase we filtered out non relevant images that its class was not the same as corresponding query type. We applied this method filtering the modality classification using GIFT system and 1NN approach and submitted RUN_2 as results. As obtained from Table 1, although MAP in this method is decreased but there are a considerable improvement in P@10 and P@ 20 values in contrast to textual base line. 8 Adil Alpkocak, Okan Ozturkmenoglu, Tolga Berber, Ali Hosseinzadeh Vahid and Roghaiyeh Gachpaz Hamed (5)  RUN_5: In this run, we combined the multiplied textual feature by 3 with the multiplied visual retrieval result using CEDD feature by 2, divided total score with the rated value 5. (6)  RUN_4: We combined the baseline textual retrieval result with visual retrieval result using CEDD feature and get average score. (7)  RUN_7: Another approach that we experimented in text retrieval this year is evaluation of effects of weighting to special words in queries. For this purpose we selected the medical modality names in queries (i.e., CT, PET, X-RAY, MRI etc.) and weighted them by 2.5 using query language of Terrier. Although result of this approach decline in compare to baseline too, but they are better than result of re- ranking methods. Due to limitation of submitted runs of participant, we did not submitted weighted text retrieval results as a new run but we fused them with low level feature of images to obtain better performance. In this run, we combined the multiplied weighted textual feature by 3 with the multiplied visual retrieval result using CEDD feature by 2, divided total score with the rated value 5. (8)  RUN_10: After we combined the multiplied RUN_8 result by 3 with the multiplied visual retrieval result using CEDD feature by 2 and divided total score with the rated value 5. (9)  RUN_6: After we combined the weighted textual retrieval result with visual retrieval result using CEDD feature and get average score. (10) RUN_9: After we combined the RUN_8 result with visual retrieval result using CEDD feature and get average score. (11) DEMIR at ImageCLEFMed 2011: Evaluation of Fusion Techniques for Multimodal Content-based Medical Image Retrieval 9 Table 1. Runs of DEMIR group in ImageCLEFMed 2011. RunID Rank Type MAP P10 P20 Rprec bpref rel_ret 5 1 Mixed 0.2372 0.3933 0.3550 0.2881 0.2738 1597 4 2 Mixed 0.2307 0.3967 0.3400 0.2706 0.2606 1595 7 3 Mixed 0.2014 0.3400 0.3233 0.2587 0.2481 1455 10 4 Mixed 0.1983 0.4067 0.3350 0.2397 0.2428 1349 6 5 Mixed 0.1972 0.3367 0.3083 0.2489 0.2383 1443 9 6 Mixed 0.1853 0.3667 0.3283 0.2309 0.2230 1338 2 7 Mixed 0.1645 0.3967 0.3350 0.2340 0.2198 890 1 15 Text 0.1942 0.3400 0.2933 0.2242 0.2215 1444 8 49 Text 0.1452 0.3033 0.2633 0.1683 0.1859 1288 3 12 Visual 0.0174 0.1067 0.0833 0.0434 0.0602 569 5 Conclusion In this year, we examined effects of different weighting models on text retrieval and found that the role of proper weighting model selection is to improve the performance of text retrieval systems. Also, we compare MAP of different extracted low-level features normalized similarity scores and due to this comparison we select CEDD and FCTH descriptors as suitable features to utilize for fusion to textual results. Also due to analogy of combination methods in our previous studies, we acquire choosing a suitable combination method for fusion improved the results. The results clearly show that combining text-based and content-based image retrieval results with a proper fusion technique improves the performance. References 1. Medical Image Retrieval Task 2011, http://www.imageclef.org/2011/medical 2. The Terrier IR Platform, http://terrier.org/docs/v2.2.1/ 3. Porter, M.F.: An algorithm for suffix stripping, Program: electronic library and information systems, vol. 14, iss. 3, pp. 130--137 (1980) 4. Chatzichristofis, S.A., Boutalis, Y.S., Lux, M.: Img(Rummager): An Interactive Content Based Image Retrieval System. In: 2nd International Workshop on 10 Adil Alpkocak, Okan Ozturkmenoglu, Tolga Berber, Ali Hosseinzadeh Vahid and Roghaiyeh Gachpaz Hamed Similarity Search and Applications, pp. 151--153. IEEE Computer Society, Washington (2009) 5. Chatzichristofis, S.A., Boutalis, Y.S.: FCTH: Fuzzy Color and Texture Histogram - A Low Level Feature for Accurate Image Retrieval. In: 9th International Workshop on Image Analysis for Multimedia Interactive Services, vol., no., pp.191--196. Klagenfurt, Austria (2008) 6. Chatzichristofis, S.A., Boutalis Y.S.: Content based radiology image retrieval using a fuzzy rule based scalable composite descriptor. Multimedia Tools and Applications, vol. 46, iss. 2, pp. 493--519 (2009) 7. Won C. S., Park D. K., Park S.J.: Efficient Use of MPEG-7 Edge Histogram Descriptor, ETRI Journal, vol. 24, no. 1 (2002) 8. Pradeep K. Atrey, Anwar Hossain M.: Multimodal fusion for multimedia analysis, Multimedia Systems, vol 16, pp. 345--379 (2010) 9. Fagin R, Lotem A, Naor M.: Optimal aggregation algorithms for middleware, In: Journal of Computer and System Sciences, vol. 66, pp. 614--656 (2003) 10. Croft, W.B.: Combining Approaches to Information Retrieval, In: Advances in Information Retrieval, vol. 7, pp. 1--36 (2002) 11. He, Ben., Ounis, Iadh: Term Frequency Normalisation Tuning for BM25 and DFR Models., Advances in Information Retrieval, vol. 3408, pp. 200--214 (2005) 12. Ulker T.: Analysis and comparison of combination algorithms for joining ranked inputs, MSc Thesis, Dokuz Eylül University Department of Computer Engineering, Izmir, Turkey (2003)