-

FCSE at Medical Tasks of ImageCLEF 2013

Ivan Kitanovski

Ivica Dimitrovski

Suzana Loskovska

suzana.loshkovskag@finki.ukim.mk 0 0 Faculty of Computer Science and Engineering, University of Ss Cyril and Methodius Rugjer Boshkovikj 16 , 1000 Skopje , Macedonia

This paper presents the details of the participation of FCSE (Faculty of Computer Science and Engineering) research team in ImageCLEF 2013 medical tasks (modality classi cation, ad-hoc image retrieval and case-based retrieval). For the modality classi cation task we used SIFT descriptors and tf idf weights of the surrounding text (image caption and paper title) as features. SVMs with 2 kernel and one-vsall strategy were used as classi ers. For the ad-hoc image retrieval task and case-based retrieval we adopted a strategy which uses a combination of word-space and concept-space approaches. The word-space approach uses the Terrier IR search engine to index and retrieve the text associated with the images/cases. The concept-space approach uses Metamap to map the text data into a set of UMLS (Uni ed Medical Language System) concepts, which are later indexed and retrieved by the Terrier IR search engine. The results from the word-space and concept-space retrieval are fused using linear combination. For the compound gure separation task, we used unsupervised algorithm based on breadth- rst search strategy using only visual information from the medical images. The selected algorithms were tuned and tested on the data from ImageCLEF 2012 medical task and based on the selected parameters we submitted the new experiments for ImageCLEF 2013 medical task. We achieved very good overall performance: the best run for the modality classi cation ranked 2nd in the overall score, the best run for the ad-hoc image retrieval ranked 3rd.

In this paper we present the experiments performed by the Faculty of Computer Science and Engineering (FSCE) team for the medical tasks at ImageCLEF 2013. Our group participated in all medical subtasks. To acquire the optimal parameters we evaluated our approaches on the ImageCLEF 2012 dataset and then based on those parameters we submitted the runs for ImageCLEF 2013.

The paper is organized as follows: Section 2 describes our approach for the modality classi cation task, section 3 shows the algorithm for the compound separation task, section 4 presents the ad-hoc image retrieval task, section 5 contains the details for the case-based retrieval task.

Introduction Modality classi cation task

Imaging modality is an important information on the image for medical retrieval. In user studies, clinicians have indicated that modality is one of the most important llters that they would like to be able to limit their search by. Using the modality information, the retrieval results can often be improved signi cantly. The ImageCLEF 2013 medical modality classi cation task is a standardized benchmark for systems to automatically classify medical image modality from PubMed journal articles [ 1 ]. The 2013 dataset has 31 calsses (the same number of classes and the same classi cation hierarchy as in 2012) but larger number of compound gures are present making the task signi cantly harder but corresponding much more to the reality of biomedical journals [ 1 ].

Our approach uses visual features with combination of textual features extracted from the surrounding text content of the images. SVMs with 2 kernel were used as a classi ers. The algorithms are explained in details in the remainder of this section. 2.2

Visual features

Collections of medical images can contain various images obtained using different imaging techniques. Di erent feature extraction techniques are able to capture di erent aspects of an image (e.g., texture, shapes, color distribution...) [ 2 ]. Texture is especially important, because it is di cult to classify medical images using shape or gray level information. E ective representation of texture is needed to distinguish between images with equal modality and layout. Local image characteristics are fundamental for image interpretation: while global features retain information on the whole image, the local features capture the details. They are thus more discriminative concerning the problem of inter and intra-class variability [ 3 ].

The bag-of-visual-words approach is commonly used in many state of the art algorithms for image classi cation [ 4 ]. The basic idea of this approach is to sample a set of local image patches using some method (densely, randomly or using a key-point detector) and calculate a visual descriptor on each patch (SIFT descriptor, normalized pixel values). The resulting distribution of descriptors is then quanti ed against a pre-speci ed visual codebook which converts it to a histogram. The main issues that need to be considered when applying this approach are: sampling of the patches, selection of the visual patch descriptor and building the visual codebook.

We use dense sampling of the patches, which samples an image grid in a uniform fashion using a xed pixel interval between patches. We use an interval distance of 6 pixels and sample at multiple scales ( = 1:2 and = 2:0). Due to the low contrast of some of the medical images (for example, radiographs), it would be di cult to use any detector for points of interest. Also, it has been pointed by Zhang et al. [ 4 ], that a dense sampling is always superior to any strategy based on detectors for points of interest. We calculate a opponentSIFT descriptor for each image patch [ 5 ], [ 6 ]. OpponentSIFT describes all the channels in the opponent color space using SIFT descriptors. The information in the O3 channel is equal to the intensity information, while the other channels describe the color information in the image. These other channels do contain some intensity information, but due to the normalization of the SIFT descriptor they are invariant to changes in light intensity [ 6 ].

The crucial aspect of the bag-of-visual-words approach is the codebook construction. An extensive comparison of codebook construction variables is given by van Gemert et al. [ 7 ]. We employ k-means clustering on 250K randomly chosen descriptors from the set of images available for training. k-means partitions the visual feature space by minimizing the variance between a prede ned number of k clusters. Here, we set k to 500 and thus de ne a codebook with 500 codewords [ 3 ].

Dense sampling gives an equal weight to all key-points, irrespective of their spatial location in the image. To overcome this limitation, we follow the spatial pyramid approach [ 8 ]. We used a spatial pyramid of 1x1, 2x2, and 1x3 regions. Since every region is an image in itself, the spatial pyramid can easily be used in combination with dense sampling. The resulting vector with 4000 bins ((1x1 + 2x2 + 1x3)x500) was obtained by concatenation of the eight histograms (each histogram is L1 normalized). Fig. 1 shows an example of the histograms extarcted from an image for the spatial pyramids of 1x1, 2x2 and 3x1. 2.3

Textual features

Images in the collection belong to a medical article, so they can be indexed using the surrounding text content. The text representation adopted in this work included information from the title of the paper and the image caption, which can be found in the XML le corresponding to each image in the data set. With that, a text corpus for the image collection was built, and standard text processing operations were applied, including tokenization, stemming, and stop-word removal using Terrier IR [ 9 ]. We calculate the weight for each term in each document using T F IDF weighting model. The calculated weights were adopted as textual features. 2.4

Feature fusion schemes

Di erent features (in our case visual and textual) bringing di erent information about the content of the images clearly outperform single feature approaches [ 10 ], [ 3 ]. Following these ndings, we combine the two di erent features described above using high level feature fusion scheme. The fusion schemes is depicted in Fig. 2. s e s s a l c

The high level fusion scheme averages the predictions from the individual classi ers trained on the separate descriptors. 2.5

Classi er setup

We used the libSVM implementation of SVMs (Support Vector Machines) [ 11 ] with probabilistic output [ 12 ] as classi ers. To solve the multi-class classi cation problems, we employ the one-vs-all approach. Each of the SVMs was trained with a 2 kernel. Namely, we build a binary classi er for each modality/class: the examples associated with that class are labeled positive and the remaining examples are labeled negative. This results in an imbalanced ratio of positive versus negative training examples. We resolve this issue by adjusting the weights of the positive and negative class [ 6 ]. In particular, we set the weight of the positive class to #pos+#neg and the weight of the negative class to #pos+#neg , #pos #neg with #pos the number of positive instances in the train set and #neg the number of negative instances. We also optimize the cost parameter C of the SVMs using an automated parameter search procedure [ 6 ]. For the parameter optimization, we used the dataset from 2012. After nding the optimal C value, the SVM is trained on the 2013 set of training images.

Results and discussion

In this section, we present and discuss the results obtained from the experimental evaluation of the proposed method. First, we compare and evaluate the performance of the proposed method for the ImageCLEF 2012 dataset. Next, we present the results obtained for this year, ImageCLEF 2013 dataset.

The rst three rows in Table 1 show the results of our method applied on the ImageCLEF 2012 dataset. These results include visual, textual and mixed runs. From the presented results, we can note that the better predictive performance of the visual run compared to the textual run. The high level feature fusion scheme helps in increasing the predictive performance. Furthermore, from the presented results, we can also note that our method has a very high accuracy/performance. Compared with the results from the groups that participate in the ImageCLEF 2012 medical task [ 13 ] our visual run is second best, the textual and mixed runs are ranked rst. The mixed run with accuracy of 77:0 will be ranked rst in the overall ranking if we have submitted this run in the last years modality classi cation task.

The second three rows in Table 1 shows the results of our method applied on this year modality classi cation task. These results include also visual, textual and mixed runs. The accuracy of 78:04 obtained with the mixed run is second best in the overall ranking. The high level feature fusion scheme increases the predicitve performance for this year dataset also. 3

Compound gure separation

Compound gures contain gures of several types, they cannot be classi ed into unique classes and need to be separated before a detailed classi cation into the gure types can be performed. In this work, a unsupervised technique of compound gure separation is proposed and implemented based on breadthrst search strategy using only visual information from the medical gures. All pixel values in the gure are examined/traversed searching for enclosed region separated with white border/pixels. The sensitivity of the border is controlled by threshold parameter. The regions smaller than prede ned value are discarded.

In some of the gures the separating borders between the contained sub gures are in black color, therefore before applying our algorithm we invert the output gure. For the given test dataset our algorithm correctly classi ed 68:59% of the gures. 4

Ad-hoc image retrieval

In this section, we give an overview of the application of our methods to adhoc medical image retrieval and present the results of our submitted runs. We participated only in the textual retrieval. 4.1

Proposed approach

The approach uses the image caption and the title of medical article in which it is referenced i.e. surrounding text. The approach seeks to combine word-space and concept-space approaches with the goal to achieve better overall retrieval performance.

The word-space component indexes and retrieves the surrounding text of the medical images in a traditional way. The surrounding text of the medical images is rst preprocessed performing stop words removal and stemming, and creating a standard inverted index. In the retrieval phase, the system pre-processes the query and applies stop words removal and stemming to the query as well. Weighting models are applied to calculate the score for the relevancy of every medical article in respect to the given query. Once the score is calculated the documents are sorted and returned.

The concept-space component works by analyzing the text by the presented medical concepts. The rst step is to map the surrounding text of the medical images to medical concepts. The mapping can be done using a variety of toolkits, services or libraries such as [ 14 ], Meshup [ 15 ] etc. The problem in this approach arises in the way documents will be indexed and then evaluated in the retrieval phase with respect to queries. Classical information retrieval models, directly or indirectly, depend on the number of terms which the document and query share to compute the relevance score [ 9 ]. But, the number of terms which a query and document share in the word-space could be very di erent in the concept-space. For example, if a query and the document share one term "x-ray" in word-space, they can share up to six terms in concept-space [ 16 ]. On the other hand, if they share a phrase of two terms "lung x-ray" in word-space, then they will share only one term in concept-space.

The results from both components are then normalized and passed to a fusion component (the diagram is depicted on Figure 3). It can use any of the known strategies for late fusion [ 17 ]. In this study, we used a simple linear combination of the normalized results.

Imageqcaptionq/qMedicalqarticles

Text query

Preprocessng

(stemming,qstop wordsqremoval) Text data

Indexingqand Retrieval

Concept-space results

Normalization

Normalized results

Mappingqtext toqconcepts

Concept Query data data

Indexingqand Retrieval

Word-space results

Normalization

Normalized results Text query

Fusion

Finalq results For the word-space approach Terrier IR [ 9 ] is used as a search engine. For the preprocessing stage, Porter stemmer [ 18 ] and stop words are applied. In the retrieval phase, several weighting models were evaluated: PL2 [ 19 ], BM25 [ 19 ], BB2 [ 19 ], DFR-BM25 [ 19 ], TF-IDF [ 20 ], DirichletLM [ 21 ]. Additional experiment was performed with query expansion on the best performing model to test its maximum output.

The concept-space approach requires a mapping mechanism to match the text data to medical concepts. In this approach, Metamap is used as mapping tool and the extracted medical concepts are UMLS [ 14 ] concepts. The mapping is performed only on the surrounding text of the medical images. After the concepts are extracted, new arti cial text is generated containing only the UMLS concepts. The same process is repeated for the queries. Once the arti cial text is constructed it is passed to the search engine for indexing. Terrier IR indexes the arti cial text, with no additional preprocessing (no stemming and stop words removal). The retrieval is performed by passing the arti cial queries to the search engine. In this phase, the same weighting models are applied as in the word-space approach. Basically, the concept-space approach can be viewed as a word-space approach with more complex preprocessing.

Before the fusion phase, the results from the word-space and concept-space are normalized using min-max normalization [ 22 ]. The normalized results are then passed to the fusion component which applies linear combination. This kind fusion provides modularity and control over the extent in which components in uence the nal result. 4.3

Evaluating on ImageCLEF 2012

The proposed framework was rst evaluated on the ImageCLEF 2012 dataset. This phase is used to nd the optimal weighting models and appropriate parameters. The results of the word-space assessment are depicted on Table 2. The results show that the BM25 model provides the best performance for the wordspace retrieval. An additional experiment was performed with the best model by assigning weights to key words in the queries using Terri query language (For example. words such as "MRI", "CT" etc. are given 1.5 weight). The results for the experiment with the word weights (BM25-ww) show an increase in performance.

The results of the concept-space assessment are depicted on Table 3. In this case the best results are provided with the DirichletLM model.

The results of the mixed assessment are depicted on Table 4. The mixed assessment is consisted of two experiments. The rst one is by combining the best word-space and concept-space approaches. The second experiment is done by combining the word-space with word weights and concept-space approaches. Based on the results obtained from the experiments over the ImageCLEF 2012 dataset, the runs for the ImageCLEF 2013 ad-hoc retrieval task was submitted. Another experiment was made, only now using ImageCLEF 2013 data and submitted the results only from the best performing techniques. For word-space text-based retrieval we submitted the run using BM25 weighting model word weights and for the concept-space text-based retrieval we submitted the run using DirchletLM weighting model. Finally, for the mixed retrieval we submitted the linear combination of the two previous spaces. The results from our runs on ImgeCLEF 2013 are presented on Table 5. In this section, we give an overview of the application of our methods to casebased retrieval and present the results of our submitted runs. We participated only in the textual retrieval of the cases. The proposed approach for this task is similar to the ad-hoc retrieval task, with the di erence that in this case the retrieval unit is a medical article, not an image.

Two approach combines the word-space and concept-space, just as with the adhoc retrieval. For the word-space component, we index the entire text of the medical articles, which includes the title, abstract, article text and captions of the images in the article (we refer to this as "fulltext"). The indexing and retrieval is done using Terrier IR and several weighing models are applied to analyze their performance for this type of task. For the concept-space component, only the title and abstract of the medical article are used for extraction of medical concepts. The tool for medical concept extraction is Metamap and the extracted results are UMLS concepts. The rest of the process for the concept-space approach is identical to the concept-space ad-hoc retrieval. The nal result is provided with the late fusion of both components using linear combination. 5.2

Evaluating on ImageCLEF 2012

The proposed framework was again evaluated on the ImageCLEF 2012 dataset. The results of the word-space assessment are depicted on Table 6. The results show that the BM25 model provides the best performance for the word-space case-based retrieval. An additional experiment was performed with the best model by adding query expansion. The results for the experiment with the query expansion (BM25-qe) show that the query expansion increase retrieval performance by roughly 4%.

The results of the concept-space assessment are depicted on Table 7. In this case the best results are provided with the DirichletLM model. An additional experiment was performed using query expansion on the best performing model, which provides an improvement of roughly 2%.

The results of the mixed assessment are depicted on Table 8. The mixed assessment is consisted of two experiments. The rst one is by combining the best word-space and concept-space approaches. The second experiment is done by combining the word-space and concept-space approaches, both with added query expansion. Using the models and optimal parameters learned with the experiments over the ImageCLEF 2012 dataset, the experiments over the ImageCLEF 2013 dataset were performed. The best results were provided with in the case of the mixed experiment using query expansion.

1. de Herrera , A.G.S. , Kalpathy-Cramer , J. , Fushman , D.D. , Antani , S. , Muller, H.: Overview of the imageclef 2013 medical tasks . In: Working notes of CLEF 2013 . ( 2013 )

2. Dimitrovski , I. , Loskovska , S. : Content-based retrieval system for X-ray images . In: International Congress on Image and Signal Processing . ( 2009 ) 2236 { 2240

3. Tommasi , T. , Orabona , F. , Caputo , B. : Discriminative cue integration for medical image annotation . Pattern Recognition Letters 29 ( 15 ) ( 2008 ) 1996 { 2002

4. Zhang , J., Marszalek , M. , Lazebnik , S. , Schmid , C. : Local features and kernels for classi cation of texture and object categories: A comprehensive study . International Journal of Computer Vision 73 ( 2 ) ( 2007 ) 213 { 238

5. Lowe , D.G. : Distinctive image features from scale-invariant keypoints . International Journal of Computer Vision 60 ( 2 ) ( 2004 ) 91 { 110

6. van de Sande, K. , Gevers , T. , Snoek , C. : Evaluating color fescriptors for object and scene recognition . IEEE Transactions on Pattern Analysis and Machine Intelligence 32 ( 9 ) ( 2010 ) 1582 { 1596

7. van Gemert, J.C. , Veenman , C.J. , Smeulders , A.W.M. , Geusebroek , J.M. : Visual word ambiguity . IEEE Transactions on Pattern Analysis and Machine Intelligence 99 ( 1 ) ( 2010 )

8. Lazebnik , S. , Schmid , C. , Ponce , J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories . In: IEEE conference on Computer Vision and Pattern Recognition . ( 2006 ) 2169 { 2178

9. Ounis , I. , Amati , G. , Plachouras , V. , He , B. , Macdonald , C. , Johnson , D.: Terrier information retrieval platform . In: Advances in Information Retrieval , Springer ( 2005 ) 517 { 519

10. Tommasi , T. , Caputo , B. , Welter , P. , Guld, M. , Deserno , T. : Overview of the clef 2009 medical image annotation track . In: Multilingual Information Access Evaluation II. Multimedia Experiments { LNCS 6242 , Springer Berlin/Heidelberg ( 2010 ) 85 { 93

11. Chang , C.C. , Lin , C.J.: LIBSVM: a library for support vector machines . ( 2001 ) Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

12. Lin , H.T. , Lin , C.J. , Weng , R.C. : A note on Platt's probabilistic outputs for support vector machines . Machine Learning 68 ( 2007 ) 267 { 276

13. Muller, H., de Herrera , A.G.S. , Kalpathy-Cramer , J. , Demner-Fushman , D. , Antani , S. , Eggel , I. : Overview of the imageclef 2012 medical image retrieval and classi cation tasks . In: CLEF (Online Working Notes/Labs/Workshop). ( 2012 )

14. Aronson , A.R.: E ective mapping of biomedical text to the umls metathesaurus: the metamap program . In: Proceedings of the AMIA Symposium , American Medical Informatics Association ( 2001 ) 17

15. Trieschnigg , D. , Pezik , P. , Lee , V. , De Jong, F., Kraaij , W. , Rebholz-Schuhmann , D. : Mesh up: e ective mesh text classi cation for improved document retrieval . Bioinformatics 25 ( 11 ) ( 2009 ) 1412 { 1418

16. Abdulahhad , K. , Chevallet , J.P. , Berrut , C. , et al.: Mrim at imageclef2012. from words to concepts: A new counting approach . In: Notebook Papers of Labs and Workshop (CLEF). ( 2012 )

17. Muller, H., de Herrera , A.G.S. , Kalpathy-Cramer , J. , Fushman , D.D. , Antani , S. , Eggel , I. : Overview of the imageclef 2012 medical image retrieval and classi cation tasks . Working Notes of CLEF ( 2012 )

18. Macdonald , C. , Plachouras , V. , He , B. , Lioma , C. , Ounis , I.: University of glasgow at webclef 2005: Experiments in per- eld normalisation and language speci c stemming . In: Accessing Multilingual Information Repositories . Springer ( 2006 ) 898 { 907

19. Amati , G., Van Rijsbergen , C.J. : Probabilistic models of information retrieval based on measuring the divergence from randomness . ACM Transactions on Information Systems (TOIS) 20(4) ( 2002 ) 357 { 389

20. Hiemstra , D.: A probabilistic justi cation for using tf idf term weighting in information retrieval . International Journal on Digital Libraries 3 ( 2 ) ( 2000 ) 131 { 139

21. Zhai , C. , La

erty

, J.: A study of smoothing methods for language models applied to information retrieval . ACM Transactions on Information Systems (TOIS) 22(2) ( 2004 ) 179 { 214

22. Jain , A. , Nandakumar , K. , Ross , A. : Score normalization in multimodal biometric systems . Pattern recognition 38 ( 12 ) ( 2005 ) 2270 { 2285