-

Multi-disciplinary modality classi cation for medical images

Viktor Gal

0 2

Illes Solt

solt@tmit.bme.hu 1

Tom Gedeon

tom.gedeon@anu.edu.au 2

Mike Nachtegael

mike.nachtegaelg@ugent.be 0 0 Department of Applied Mathematics and Computer Science, Ghent University , Belgium 1 Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics , Hungary 2 School of Computer Science, The Australian National University , Australia

Modality is a key facet in medical image retrieval, as a user is likely interested in only one of e.g. radiology images, owcharts, and pathology photos. While assessing image modality is trivial for humans, reliable automatic methods are required to deal with large un-annotated image bases, such as gures taken from the millions of scienti c publications. We present a multi-disciplinary approach to tackle the classication problem by combining image features, meta-data, textual and referential information. Our system achieved an accuracy of 96.86 % in cross-validation on the ImageCLEF 2011 training dataset having 18 imbalanced modality classes, and an accuracy of 90.2 % on the ImageCLEF 2010 dataset having 8 well-balanced modality classes. We evaluate the importance of the individual feature sets in detail, and provide an error analysis pointing at weaknesses of our method and obstacles in the classi cation task. For the bene t of the image classi cation community, we make the results of our feature extraction methods publicly available at http://categorizer.tmit.bme.hu/~illes/imageclef2011modality.

image classi cation image feature extraction image modality text mining

Imaging modality is an important aspect of the image for medical retrieval [ 6 ]. In user-studies, clinicians have indicated that modality is one of the most important lters that they would like to be able to limit their search by. However, this modality is typically extracted from the caption and is often not correct or present. Studies have shown that the modality can be extracted from the image itself using visual features [ 13, 10, 7 ]. Therefore, In this paper, we propose to use both visual and textual features for medical image representation, and combine the di erent features using normalised kernel function in SVM.

The proposed algorithm is evaluated in the context of the ImageCLEF 2011 Modality Classi cation task[ 9 ], which uses a dataset of 988+1024 images taken from PubMed articles.

The rest of this paper is organised as follows. In Section 1, we describe in detail our experimental setting. In Section 3, we present and compare di erent runs we submitted. We discuss the submitted runs and the results in Section 4 and we conclude in Section 5. 2

Methods

2.1

Evaluation setting

In this section, we describe in detail our experimental setting.

The ImageCLEF 2011 Modality Classi cation task used split-validation measuring the accuracy of the systems. On the training dataset, we performed strati ed 10-fold cross-validation to evaluate feature sets and classi ers.

Feature extraction

Caption text Figures in scienti c publications often have descriptive captions that provide information on the modality of the image. \Contrast-enhanced axial computed tomographic scan", \HRCT showing extensive areas of consolidation with air bronchogram" are examples of captions of images assigned to the `CT' modality class. However, the caption may be missing or may not hint at the modality, e.g. \E. coli that satisfy the similarity threshold values." As the examples suggest, the linguistic constructs expressing modality can have a high variation. Considering these remarks, we extract binary features from caption texts as follows. We de ne a set of regular expressions to be matched against the caption text, a match results in a value of 1. Regular expressions were initially created for each word having a high information gain for any of the modality classes and were later manually re ned to capture linguistic variations (e.g. f?MRI?) and multi-word phrases (e.g. error bars?).

MeSH terms Scienti c articles indexed by Medline/PubMed are tagged with MeSH terms (medical subject headings) by eld experts. MeSH terms can be seen as a thesaurus for the life sciences containing entries like `Human', `Liver Neoplasms' and `Magnetic Resonance Imaging', entries can be further quali ed by e.g. `methods', `pathology'. We hypothesise that the article's MeSH terms and its gures' modality are correlated, and hence de ne features corresponding to individual MeSH terms and quali ers. A unique identi er for the article (e.g. PMID or DOI) is required to retrieve its MeSH annotations, however, such identi ers can be absent. As the number of MeSH terms, quali ers and their combinations far exceeds the number of modality labels, we perform feature selection by keeping only those that are present for at least a prede ned number of articles in the training set.

Colour histogram Using colour histograms in content-based image retrieval system has been successfully applied in the past, for a detailed review see [ 16 ]. Based on these studies we have chosen to use HSV colour-space based histogram, and quantised the hue and the saturation to three and the value to four levels.

Based on this we de ned f hist feature vector, where each element of the vector represents the normalised number of pixels in a given histogram bin. Mean of pixels Through manually supervised error analysis on the training set, we identi ed that the images in Graphic 1st-level group are mainly having a white background. Hence, we have de ned a simple feature fmean = Ij , that represents the mean value of the pixels in an image. By simply thresholding these values one could identify the images that belong to the Graphic group with a very high accuracy.

Axis recognition The previously mentioned mean of pixels method gave a strong support for recognising images in the Graphic top-level group, but as it consists of two sub-groups, Graphs and Drawing, thus a new feature was required to di erentiate the images belonging to one or the other category. By manually observing the images in these two categories one can easily point out the main di erence by using a simple edge detector: the images belonging to the Graphs category are mainly consisting of horizontal and vertical lines (i.e. the x-y axis of a graph), whereas the images in Drawing category are mostly diagrams, where the orientation of the lines is random.

Based on this idea we have de ned the following feature. Let LIj be the set of all the detected lines and GLIj be the number of good lines in an arbitrary image Ij , where a given line is a good line if it's orientation is horizontal or vertical and it is within a given margin of the picture's border. The latter condition is for not to count the borders of an image as good lines.

Using these two sets we de ned a feature flines(Ij ) = jGLIj j jLIj j (1)

In order to detect the lines and their orientation in an image we used a simple Hough transform [ 4 ].

Skin detection The images in the Dermatology category was one of the most di cult recognise. As not only it was the least represented category in the whole training set, i.e. there are only seven examples (see Table 1) for this category, but the images in this set are simple photographs (of various skin abnormalities) thus they have very similar characteristics to the general photo labeled images. Hence, most of the previously de ned features failed to distinguish the images in Dermatology set from the others.

Using a simple skin detector algorithm[ 2 ] we de ned a new feature fskin(Ij ) for and image Ij fskin(Ij ) = SD(Ij ) (2) where the function SD( ) calculates the skin-segmented binary image of an input image, and Ik{as previously de ned{is the mean value of image Ik. Meta-data We determine whether an image post-processing software was used by analysing meta-data stored in JPEG les' EXIF section. For this, we analyze the `Comment' eld, to nd mentions of commonly used image manipulation software (e.g. Adobe Photoshop, MS Paint). We also extract from the EXIF whether the image is stored as gray-scale only.

Radiopaedia Radiopaedia (http://radiopaedia.org) is a community wiki for radiology images and patient cases. Images are tagged by users with the body system (e.g. Heart, Musculoskeletal) depicted, but unfortunately for us, not with the type of radiology method used to create the image. Leveraging the mutual information between body systems and radiology methods, we derived features for modality classi cation by taking the output probabilities of a classi er trained to predict body systems shown in the image. Bag of visual-words The state-of-the-art content based image retrieval systems has been signi cantly improved by the introduction of SIFT[ 11 ] features and the bag-of-words image representation [ 12, 8, 3, 14 ].

The bag-of-visual-words image representation is based on the bag of words (BoW) model in natural language processing (NLP). BoW in NLP is a popular method for representing documents In this model a document is simply represented by the number of di erent words that are in the document. The idea behind this is, that documents on the same topic have similar words with similar number of occurrences in them (see LDA[ 1 ]).

In case of and image, the basic idea of bag-of-words model is that a set of local image patches is sampled using some method{e.g. densely or using a key-point detector{and a vector of visual descriptors is evaluated on each patch independently. In this paper we used the well known SIFT descriptor on each patch. The SIFT descriptor computes a gradient orientation histogram within the support region. For each of eight orientation planes, the gradient image is sampled over a four y four grid of locations, hence resulting in a 128-dimensional feature vector for each region. In order to make the descriptor less sensitive to small changes in the position of the support region and put more emphasis on the gradients that are near the centre of the region a Gaussian window function is used to assign a weight to the magnitude of each sample point.

After acquiring these SIFT features for all the images in the dataset, the nal step is to convert vector represented patches to "codewords" (analogy to words in text documents), which also produces a "codebook" (analogy to a word dictionary). A codeword can be considered as a representative of several similar patches. In our case we performed k-means clustering over all the vectors. Codewords are then de ned as the centres of the learned clusters. Thus, each patch in an image is mapped to a certain codeword through the clustering process and the image can be represented by the histogram of the codewords.

In our bag-of-visual-words model we used the the tf-idf weighting[ 15 ] scheme, that has proven to be a very successful approach for image retrieval. The tf part of the weighting scheme represents the number of features described by a given visual word. The frequency of visual word in the image provides useful information about repeated structures and textures. While, the idf part captures the informativeness of visual words{visual words that appear in many di erent images are less informative than those that appear rarely.

Other systems The challenge organisers generously supplied participants with predictions of their in-house system. This classi cation was automatic for the test set, but confusingly enough, the ground truth labels were used for the train set. In order to exploit this valuable resource, we used it as an input to our classi er by introducing arti cial smoothing to avoid over tting on this particular otherwise noise free indicator variable. Also note that while split evaluation is sound in this setting, the cross-validation evaluation of those two runs is awed (being over-optimistic) due to information leakage.

Classi cation

Based on the numerical and binary features of the images obtained through feature extraction, we perform vector space classi cation to predict modality classes of unseen images. Among the classi cation algorithms available in Weka [ 5 ], we found the support vector machine SMO to have the best standalone performance over the full feature space in cross-validation on ImageCLEF 2011 training dataset. We used SMO with default settings for the rest of the experiments unless stated otherwise. 3

Results

In this section, we provide the nal results of the ve submitted runs for the modality classi cation tasks. Table 2 shows both the correctly classi ed percentage for the di erent features set compositions. Comparing the result of our best submitted run and the best submitted run of the modality classi cation task, one can see that there is very small (0.88%) di erence between the two runs.

The performance of the runs broken down for the individual classes is show in Table 3 and in Figure 1. As can be seen on Figure 1, the systems performs well on higher support classes, while performance drops to zero for some more rare classes. This behaviour is tolerated by the challenge main evaluation metric accuracy, in contrast to a more pessimistic evaluation like F-measure. Table 2 shows, which features have been used in the di erent runs. It is important to see that omitting Caption text features results in almost about a ten percent accuracy loss, see the di erence between the runs #3 and #4.

Using MeSH and Radiopaedia features gained us about one percent in accuracy.

The in-house modality classi er of the challenge organisers proved to be superior in predicting the `Dermatology' class (Table 3, however, its inferior performance on higher support classes prevented it from being bene tial in combination (Table 2). 4.1

Other experiments

Motivated by the grouping of modality labels by the challenge organisers, we experimented with hierarchical classi cation. In particular, we applied a hierarchical greedily ascending classi er scheme wrapping the baseline classi er. In this scheme, classi cation is rst performed on the hierarchies uppermost level (here groups), then the most probable hierarchy node is selected where classi cation continues recursively. For hierarchical classi cation, cross-validation results were inferior to those obtained from the baseline ( at) classi er. 5

Conclusion

In this paper, we proposed to extract di erent visual and textual features for medical image representation, and fusion the di erent extracted visual feature and textual feature for modality classi cation. To extract visual features from the images, we used some state-of-art methods like bag-of-visual words and some standard ones like colour histogram and introduced some heuristic representations of the images specialised for the ImageCLEF2011 medical modality classication task.

With the suggested feature extraction algorithms in this paper and the SVM classi er we have achieved to 2nd place on the ImageCLEF2011 medical image modality classi cation task.

Acknowledgements References

Viktor Gal was supported by Marie Curie Initial Training Networks (ITN) Ref. 238819 (FP7-PEOPLE-ITN-2008).

1. David

Blei , Andrew Y.

Ng , and Michael I.

Jordan . Latent dirichlet allocation . J. Mach. Learn. Res. , 3 : 993 { 1022 , March 2003 .

Chai and

K N

Ngan . Face segmentation using skin-color map in videophone applications. Circuits and Systems for Video Technology , IEEE Transactions on, 9 ( 4 ): 551 { 564 , 1999 .

Chum ,

Philbin ,

Sivic , and

Isard. Total Recall : Automatic query expansion with a generative feature model for object retrieval . In 2007 IEEE 11th International Conference on Computer Vision , pages 1 {8 . IEEE, October 2007 .

Duda . Use of the Hough transformation to detect lines and curves in pictures . Communications of the ACM , 1972 .

Mark

Hall , Eibe Frank, Geo rey Holmes, Bernhard Pfahringer,

Peter

Reutemann , and Ian

Witten . The WEKA data mining software: an update . SIGKDD Explorations , 11 ( 1 ): 10 { 18 , 2009 .

6. William R Hersh , Henning Muller, Je ery R Jensen, Jianji

Yang , Paul N Gorman, and Patrick

Ruch . Advancing Biomedical Image Retrieval: Development and Analysis of a Test Collection . Journal of the American Medical Informatics Association , 13 ( 5 ): 488 { 496 , 2006 .

Jain . Image retrieval using color and shape . Pattern Recognition , 29 ( 8 ): 1233 { 1244 , August 1996 .

Jegou ,

Harzallah , and

Schmid . A contextual dissimilarity measure for accurate and e cient image search . In Computer Vision and Pattern Recognition , 2007 , IEEE Conference on, (CVPR '07) , pages 1 {8 , 2007 .

Jayashree

Kalpathy-Cramer , Henning Muller, Steven Bedrick, Ivan Eggel, Alba Garcia Seco de Herrera, and

Theodora

Tsikrika . The CLEF 2011 medical image retrieval and classi cation tasks . In CLEF 2011 working notes , Amsterdam, The Netherlands, 2011 .

10.

Abolfazl

Lakdashti and

Moin . A New Content-Based Image Retrieval Approach Based on Pattern Orientation Histogram . In Andre Gagalowicz and Wilfried Philips, editors, Computer Vision/Computer Graphics Collaboration Techniques , pages 587 { 595 . Springer Berlin / Heidelberg, Berlin, Heidelberg, 2007 .

11. David

Lowe.

Object recognition from local scale-invariant features . In Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2, ICCV ' 99 , pages 1150 {, Washington, DC, USA, 1999 . IEEE Computer Society.

12.

Nister and

Stewenius . Scalable Recognition with a Vocabulary Tree . In Computer Vision and Pattern Recognition , 2006 IEEE Computer Society Conference on, pages 2161 { 2168 , 2006 .

13.

Pentland ,

R W

Picard , and

Sclaro . Photobook: Content-based manipulation of image databases . International Journal of Computer Vision , 18 ( 3 ): 233 { 254 , 1996 .

14. J. Philbin , O.

Chum , M.

Isard , J.

Sivic , and

Zisserman . Lost in quantization: Improving particular object retrieval in large scale image databases . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2008 .

15.

Josef

Sivic and

Andrew

Zisserman . Video Google: A Text Retrieval Approach to Object Matching in Videos . In 9th IEEE International Conference on Computer Vision (ICCV 2003 ), pages 1470 { 1477 . IEEE Computer Society, 2003 .

16.

Veltkamp . A survey of content-based image retrieval systems . Content-based image and video retrieval , 2002 .