The medGIFT Group in ImageCLEFmed 2011 Dimitrios Markonis, Ivan Eggel, Alba G. Seco de Herrera, Henning Müller University of Applied Sciences Western Switzerland (HES–SO) Sierre, Switzerland dimitrios.markonis@hevs.ch Abstract. This article presents the participation of the medGIFT group in ImageCLEFmed 2011. Since 2004, the group has participated in the medical image retrieval tasks of ImageCLEF each year. The main goal is to provide a baseline by using the same technology each year, and to search for further improvements in retrieval quality. There are three types of tasks for ImageCLEFmed 2011: modality classi- fication, image–based retrieval and case–based retrieval. The medGIFT group participated in all three tasks. For the image–based and case–based retrieval tasks, two existing retrieval engines were used: the GNU Image Finding Tool (GIFT) for visual retrieval and Apache Lucene for text. For the modality classification, a purely visual approach was used with GIFT for the visual retrieval and a kNN (k–Nearest Neighbors) classifier for the classification. Results show that the best text runs outperform the best visual runs by a factor of 10 in terms of mean average precision. Baselines provided by Apache Lucene and GIFT are ranked above the average among text runs and visual runs respectively in image–based retrieval. In the case– based retrieval task the Lucene baseline is the second best automatic run for text retrieval, and our mixed and visual runs are the best overall. For modality classification, GIFT and the kNN–based approach perform slightly better than the average of the visual approaches. 1 Introduction ImageCLEF is the cross–language image retrieval track1 of the Cross Language Evaluation Forum (CLEF). ImageCLEFmed is part of ImageCLEF focusing on medical images [4, 5]. The medGIFT2 research group has participated in Image- CLEFmed using the same technology as baselines since 2004. Additional modi- fications of the basic techniques were attempted to improve results. Visual and textual baseline runs have been made available to other participants of Image- CLEFmed. The visual baseline is based on GIFT3 (GNU Image Finding Tool, [6]) whereas Lucene4 was used for text retrieval. 1 http://www.imageclef.org/ 2 http://www.hevs.ch/medgift/ 3 http://www.gnu.org/software/gift/ 4 http://lucene.apache.org/ This year, the bag–of–visual–words approach [1] was also used using local descriptors also called visual words. This widely used method is applied as fol- lows: a training set of images is chosen and a number of local descriptors (in the case of SIFT, Scale Invariant Feature Transform, 128–dimensional vectors) are extracted from each image of this set. The descriptors are then clustered using a clustering method (such as k–means) and the centroids of the clusters are used as visual words. Based on this the visual vocabulary — the set of all the visual words — is created. Local features are then also extracted from each image in a database. The images are finally indexed as histograms of the visual word oc- currences (bags–of–visual–words) by assigning the nearest visual word to each feature vector. When an image is queried, a similarity measure is used to com- pare the query image histogram and the database images histograms, providing a similarity score. In order to include spatial information to this representation, several approaches have been proposed [2, 3], improving the performance. 2 Methods This section describes the basic techniques that we used for retrieval in Image- CLEFmed 2011. 2.1 Retrieval Tools Reused This section details the existing retrieval tools that were reused for text and visual retrieval. Text Retrieval The text retrieval approach in 2011 is based on Lucene using standard settings. 4 text runs were submitted, 2 for case–based retrieval and 2 for image–based retrieval. For case– and image–based retrieval, captions and full text were used. The full text approach used all texts as obtained in the data set. Links, metadata, scripts and style information were removed and only the remaining text was indexed. For image captions, an XML file containing captions of all the images was indexed. No specific terminologies such as MeSH (Medical Subject Headings) were used. Visual Retrieval GIFT is a visual retrieval engine based on color and texture information [6]. Colors are compared in a color histogram using a simple his- togram intersection. Texture information is described by applying Gabor filters and quantizing the responses into 5 strengths. This different from the previous years’ use of 10 strengths because of the size of this year’s data set that can cause problems for GIFT. The image is rescaled to 256x256 and partitioned into fixed regions to extract features both global and local features. GIFT uses a stan- dard tf/idf (term frequency/inverse document frequency) strategy for feature weighting. It also allows image–based queries with multiple input images. GIFT has been used for the ImageCLEFmed tasks since 2004. Each year the default setting has been used to provide a baseline. For classification, GIFT has been used to produce the distance (similarity) value followed by a nearest neighbor (1NN) classification. For the description of the images when using visual words, we used the SIFT implementation in the fiji5 image processing package. In order to create the visual vocabulary, our implementation of the density–based clustering algorithm DEN- CLUE [7] was used. The reason for this choice are the features and the nature of the dataset that needs to be clustered. The data set to be clustered is large– scale (1000 training images produce approximately 2’500’000 descriptors) and high dimensional (SIFT descriptors are 128–dimensional). The DENCLUE algo- rithm is highly efficient for clustering large–scale datasets, can detect arbitrarily shaped clusters and handles outliners and noise well. Moreover, opposed to other density–based clustering algorithms it performs well for high–dimensional data. However, when using a density–based clustering algorithm care has to be taken for data sets containing clusters of different densities. To deal with this, the pa- rameter ξ that controls the significance of the candidate cluster in respect to its density was set to zero. In order to create a pipeline for easy component–based evaluation for this method the outputs of every intermediate step were stored in CSV files and mySQL tables. These use a large amount of storage resources but speed up the procedure of tuning and evaluating components of the method once the ground truth is available. Due to the characteristics of this architecture it was possible to use only vocabularies with a small number of visual words (1̃00) and a n × n partition was used with maximum n = 2. The third submitted approach combines the modality classification and the image retrieval tasks. Using the GIFT assignment of modalities the histograms of visual words where indexed in mySQL tables based on their classes. In this indirect manner, the approaches of GIFT and bag–of–words were combined as well. The visual word histogram of the query image was first compared to the indexed histograms of the training set using a histogram intersection. The classes of the 5 nearest neighbors were acquired. Then, the same histogram was com- pared again but only with the indexed histograms in the tables of these classes. The 1000 nearest images were acquired as the results for the topic images. For topics that contained more than one query image the combSUM technique was used as is explained in the next section. Fusion Techniques In 2009, the ImageCLEF@ICPR fusion task was organized to compare fusion techniques using the best ImageCLEFmed visual and textual results [8]. Studies such as [9] show that combSUM (1) and combMNZ(2) pro- posed by [10] in 1994 are robust fusion strategies. With the data from the Image- CLEF@ICPR fusion task, combMNZ performed slightly better than combSUM, the difference was small and not statistically significant. In general, rank–based 5 http://fiji.sc/wiki/index.php/Fiji fusion worked better than score–based fusion. X Nk ScombSUM (i) = Sk (i) (1) k=1 ScombMNZ (i) = F (i) ∗ ScombSUM (i) (2) where F (i) is the frequency of image i being returned by one input system with a non–zero score, and S(i) is the score assigned to image i. In ImageCLEFmed2011, the fusion approach using scored–based combSUM was used in three cases: – fusing textual and visual runs to produce mixed runs; – fusing results from various images which belong to the same topic for the bag–of–visual–word approaches, (GIFT handles queries with several images automatically); – fusing GIFT and bag–of–visual–word approaches. 2.2 Image Collection 230’089 medical images were available for ImageCLEFmed 2011. Among them 1’000 images with modality labels were used as training data and another 1’000 images were selected as test data for the modality classification. Details about the setup and collections of the ImageCLEFmed tasks can be found in the overview paper [11]. 3 Results This section describes our results for the three medical tasks. 3.1 Modality Classification One run was submitted to the modality classification task using GIFT. For runs of various natures (textual, visual, mixed) the best accuracy and average accu- racy are shown in Table 1. It can be observed that GIFT, using 1NN classification Table 1. Results of the runs for the modality classification task. run best accuracy average accuracy mixed runs 0.8691 0.7188 textual runs 0.7041 0.5903 visual runs 0.8359 0.6878 GIFT 1NN 0.6220 performed worse than the average accuracy. This was expected, as neither k for the kNN was optimal, nor the optimal GIFT feature configuration was used, due to the dataset size. Results also show that visual runs achieve performance close to the mixed runs showing the importance of visual characteristics in modality classification. The analysis of text results are not absolutely reliable though, as only two exclusively textual runs were submitted. 3.2 Image–based Retrieval In total 8 runs were submitted to the image–based retrieval task by the medGIFT group. Using the GIFT baseline and the 2 textual baselines, 2 mixed runs were produced using the combSUM approach. One run was the fusion of GIFT and the 2–step approach described in Section 2.1. Results are shown in Table 2. Mean average precision (MAP), binary preference (Bpref), and early precision (P10, P30) are shown as measures. For the full text retrieval, the score of a text was extended to all images of this text, for the caption–based retrieval it was extended to all images of this caption. In terms of mean average precision Table 2. Results of the medGIFT runs and the best runs for the image–based topics. run run type MAP P10 P30 Rprec Bpref num rel ret best mixed run Manual 0.2372 0.3933 0.3550 0.2881 0.2738 1597 mixed captions ib Automatic 0.1176 0.2800 0.2100 0.1575 0.1614 705 mixed full ib Automatic 0.0857 0.2900 0.2700 0.1300 0.1308 830 best visual run Automatic 0.0338 0.1500 0.1317 0.0625 0.0717 717 gift visual ib Automatic 0.0274 0.1467 0.1367 0.0581 0.0807 731 visual ib Automatic 0.0252 0.1267 0.1200 0.0554 0.0752 709 bovw visual ib Automatic 0.0126 0.0867 0.0800 0.0315 0.0437 324 bovw s2 visual ib Automatic 0.0076 0.0900 0.0650 0.0182 0.0279 213 best textual run Automatic 0.2172 0.3467 0.3017 0.2369 0.2402 1471 image-based captions Automatic 0.1742 0.3000 0.2683 0.2096 0.2179 1261 image-based fulltext Automatic 0.0921 0.2167 0.2150 0.1264 0.1506 1211 (MAP), the best textual run (0.2172) outperforms the best visual run (0.0338) by a factor of 7, which shows a big performance gap between the two approaches. However, it is significantly smaller than the gap in ImageCLEF 2010. The average score of all textual runs is 0.1644, whereas the average score of all visual retrieval runs is 0.0146. The performance of the baseline produced by Apache Lucene based on image caption information (HES–SO–VS CAPTIONS) is slightly above the average. On the other hand, GIFT performed surprisingly well, considering the non–optimal configuration and age of the tool. The bag– of–visual–words approaches did not demonstrate good results, most likely due to the lack of parameter tuning and lack of using training data. For the 2–step approach the initial results were not as good as the initial results, but with parameter tuning, already better results could be obtained. However, using the component–based architecture that was developed, further research will be easier to perform. Merging of textual runs with visual runs reduces the performance of the textual runs, which is again due to the non–optimal technique using scores and not ranks. The two mixed runs submitted by the medGIFT group are based on a simple merging approach and are punished by the large performance gap between textual and visual runs. Case–Based Retrieval The medGIFT group submitted four visual runs, one textual run and one mixed run for the case–based retrieval task. The visual runs were obtained by processing a case–based fusion of the results of querying all images of a case using the combSUM strategy. Text runs were performed using the full text and for the caption–based retrieval the results of all captions of a text were combined using combSUM. Based on visual and textual runs, mixed runs were produced by using the combSUM strategy. Table 3 shows the MedGIFT runs and if our run was not the best also the performance of the best run. Table 3. Results of the medGIFT runs and the best runs for the case–based topics. run run type MAP P10 P20 Rprec Bpref num rel ret mixed GIFTLucene full Automatic 0.0754 0.1667 0.1556 0.1227 0.0958 121 best textual run Automatic 0.1297 0.1889 0.1500 0.1588 0.1212 144 case based fulltext Automatic 0.1293 0.2000 0.1444 0.1509 0.1122 141 case based captions Automatic 0.0437 0.1111 0.0833 0.0816 0.0540 90 gift visual Automatic 0.0204 0.0444 0.0333 0.0336 0.0292 45 Best performance in terms of MAP (0.1297) was obtained by purely textual retrieval. The Lucene baseline (fulltext) is the second best run (0.1293) among all automatic runs and the difference to the best runs is statistically not significant. MedGIFT was the only lab that submitted purely visual runs and even though the best result (0.0204 by GIFT) is lower than the best textual run, the difference is not as bad as for the image–based task. The mixed run of GIFT and the Lucene fulltext achieved the best results (0.0754) in the mixed runs. This run also has the best P5 so very early precision of all all runs but on the other hand its P10 is already lower than the best textual run for P10, which is also a run of the medGIFT group. This combination decreased the performance of the corresponding textual run for MAP, so there is still a potential in improving the current systems by not using a direct fusion but rather a reordering of the results. 4 Conclusions Based on the results of the medGIFT participation several lessons can be learned, often similar or at least in line with previous years. The baseline run of Lucene using captions performed better in the image–based task while the fulltext– based approach showed good results in case–based retrieval task. In general, the baseline of GIFT performs well in image–based and case–based retrieval although on a lower level than the text retrieval approaches. The same cannot be said for the modality classification but this was probably due to the poor classification rule that was extremely simple without any use of training data. For visual classification several very good and optimized systems exist that reach a much better performance. As the datasets grow larger, aspects of system scalability such as the trade–off of memory usage, speed and quality have to be taken into account for future content–based image retrieval systems. Concerning the bag–of–visual–words runs, further testing and work is re- quired to fully exploit the advantages of the methods used. While larger vocabu- laries and finer partitions may improve the result, a better classifier can enhance the 2–step approach, which already delivered better results than the approach presented in this text. Finally, we can see that the majority of the mixed runs decreased the perfor- mance compared to the textual runs when combined. This indicates that special care needs to be taken for the fusion of unbalanced runs in terms of perfor- mance, as the textual and visual runs are obtaining very different performance. In the based, rank–based measure have shown to be better than score–based approaches and this was mistakenly not taken into account. 5 Acknowledgments The research leading to these results has received funding from the European Union’s Seventh Framework Programme under grant agreement 257528 (KHRES- MOI), 249008 (Chorus+) and 258191 (Promise). References 1. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object match- ing in videos. In: Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2. ICCV ’03, Washington, DC, USA, IEEE Computer Society (2003) 1470–1477 2. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2. CVPR ’06, Washington, DC, USA, IEEE Computer Society (2006) 2169–2178 3. Philbin, J., Chum, O., Isard, M.: Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA (June 2007) 1–8 4. Clough, P., Müller, H., Sanderson, M.: The CLEF cross–language image retrieval track (ImageCLEF) 2004. In Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B., eds.: Multilingual Information Access for Text, Speech and Images: Result of the fifth CLEF evaluation campaign. Volume 3491 of Lecture Notes in Computer Science (LNCS)., Bath, UK, Springer (2005) 597–613 5. Müller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Said, R., Bakke, B., Kahn Jr., C.E., Hersh, W.: Overview of the CLEF 2010 medical image retrieval track. In: Working Notes of CLEF 2010 (Cross Language Evaluation Forum). (September 2010) 6. Squire, D.M., Müller, W., Müller, H., Pun, T.: Content–based query of image databases: inspirations from text retrieval. Pattern Recognition Letters (Selected Papers from The 11th Scandinavian Conference on Image Analysis SCIA ’99) 21(13–14) (2000) 1193–1198 B.K. Ersboll, P. Johansen, Eds. 7. Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise. In: Conference on Knowledge Discovery and Data Mining (KDD). Volume 5865., AAAI Press (1998) 58–65 8. Müller, H., Kalpathy-Cramer, J.: The ImageCLEF medical retrieval task at icpr 2010 — information fusion to combine viusal and textual information. In: Proceed- ings of the International Conference on Pattern Recognition (ICPR 2010). Lecture Notes in Computer Science (LNCS), Istanbul, Turkey, Springer (August 2010) in press. 9. Zhou, X., Depeursinge, A., Müller, H.: Information fusion for combining visual and textual image retrieval. In: International Conference on Pattern Recognition, ICPR’10, Los Alamitos, CA, USA, IEEE Computer Society (2010) 10. Fox, E.A., Shaw, J.A.: Combination of multiple searches. In: Text REtrieval Conference. (1993) 243–252 11. Kalpathy-Cramer, J., Müller, H., Bedrick, S., Eggel, I., Seco de Herrera, A., Tsikrika, T.: The CLEF 2011 medical image retrieval and classification tasks. In: Working Notes of CLEF 2011 (Cross Language Evaluation Forum). (September 2011)