Introduction

A Comparative Study of Similarity Measures for Content-Based Medical Image Retrieval

John Collins

johncoll@mail.sfsu.edu 0

Kazunori Okada

kazokada@sfsu.edu 0 0 San Francisco State University , 1600 Holloway Avenue, San Francisco, CA 94132 , USA

This note summarizes methodologies employed in our submissions for the medical retrieval subtask of 2012 ImageCLEF competition. Our work aims to provide a systematic comparison of various similarity measures in the Medical CBIR application context. Our system consists of the standard bag-of-words features with SIFT. Computed features are then compared by using various plug-in similarity measures, including di usion distance and information-theoretic metric learning. This note provides the results of our experimental validation using the 2011 ImageCLEF dataset.

ImageCLEF CBIR M-CBIR Content-Based Image Retrieval Medical

Introduction

ImageCLEF[1{3] is a public standardized competition which focuses attention on, among other things, Medical CBIR (hereafter M-CBIR): CBIR[4{9] in which all images are taken from gures in medical publications. This note focuses on a subtask of M-CBIR 2012, the medical image retrieval task with image data alone without other text-based data. Previous work on M-CBIR has led to the development of an array of speci c/general and local/global features. For examples, see SIFT [ 10, 11 ], SURF [ 12, 13 ] and Gabor Wavelets [ 14 ]. Despite the relative maturity of feature design studies, similarity measures in CBIR have not been investigated thoroughly. Previous studies in this regard [15{17] are still few and the lack is especially evident in the M-CBIR sub eld.

Addressing this shortcoming, this paper presents a comparative study of MCBIR with a comprehensive list of similarity measures of many types. Our study shows that well known measures tend to outperform more complex measures with the notable exception of the Di usion Distance [ 18 ]. Further, we show that learning a metric from a set of training data is worthwhile, our best result coming from a combination of a metric learning transformation combined with the Di usion Distance. This paper is organized as follows. Sections 2 and 3 will outline, respectively, our methods of feature extraction and representation, and our comparative study of similarity measures. Sections 4 and 5 will summarize our results and their interpretation. 2

Feature Extraction and Image Representation In this section we describe the process and the individual steps involved in transforming an image to a feature vector, which consists of the following three steps. First, we identify and extract SIFT features from all of the dataset images. Second, we create a codebook of K representative features using K-means clustering. Third, we generate a single vector per image as a normalized histogram of such representative features. Beyond this basic three-step procedure we experiment with a number of standard transformations on the feature codebooks for better retrieval performance. 2.1

Image Representation

From each image, we extract a variable number of features which we classify into K types using the codebook resulting from the bag-of-words model described below. An image is then represented by the frequency distribution of feature types in the image and is, by construction, a vector of length K. Before calculating similarities, each vector is normalized so that it is a probability distribution. 2.2

SIFT: Scale Invariant Feature Transform

SIFT [ 10, 11 ] is a proprietary algorithm that describes regions of interest within an image as a feature which is both scale and rotation invariant. The positions of these features, called keypoints, are determined by nding extrema of di erenceof-Gaussian images which are robust across multiple scales. Such regions are then turned into 128-element SIFT feature vectors using local directional gradients around the keypoint. We include the 4 extra parameters consisting of the 2 spatial coordinates of the keypoint's position within the image, the scale parameter and the dominant-orientation parameter for a total of 132 dimensions. 2.3

Bag-of-Words

In order to generate an xed-length vector for each image, we cluster all features together in space using K-means clustering with a prede ned vector-length K. Before clustering, each SIFT feature-vector is centered and scaled using Z-Score normalization. In our case we chose K to be 1000 where this number was taken from an earlier report in the same competition [ 19 ]. Each SIFT feature can then be matched with one of the 1000 labels, 0-999, corresponding to the cluster centers. We refer to this set of centers and the corresponding labels as a codebook. This bag-of-words method yields the frequency distribution of these labels, 0-999, which describes an image. The notion of a bag-of-words comes from textual data mining and was originally proposed as a way of representing a text document by it's word frequency distribution, ignoring order. In the analogy here, SIFT vectors are word instances and the K centers returned from K-means clustering are the true words. Instead of instances being exact copies of that word as in the text mining case, in the image context a word instance is ascribed to represent the center to which it is closest in distance. 2.4

Data Transformations

The following standard transformations were examined with the goal of improved performance.

PCA: Principal Components Analysis PCA[ 20 ] is a technique used mainly for dimension reduction. For a space X, It seeks to nd the linear combination Y = Pin=1 ix(i) for column vectors x(i) of X such that the dimensions of Y are not correlated (linearly independent). Moreover, dimensions in Y are ordered from most to least important, where importance is de ned in terms of variance. In practice, the transformed data in Y is often used for dimension reduction since one gets a variance-maximal m-dimensional representation of X by taking the rst m dimensions of Y . How small to make m is data dependent and is typically chosen to cover at least 95% or 99% of the data's variance.

We experimented by varying the number of dimensions in PCA with both 2011 and 2012 ImageCLEF competition datasets and the results are shown in Figure. 1. We found the variance spread of these two datasets to be quite large. Overall, using our image representation, the 2012 codebook captured more variance in fewer components than did the 2011 codebook. However, in both cases we found that it took most of the components to cover an adequate amount of variance.

Tf-Idf: Term Frequency - Inverse Document Frequency This idea, like

bag-of-words, comes from textual data mining. The goal is to penalize a vector for words (features) whenever they are common across the entire data set. Term Frequency (Tf) for an observation x is just the value at term i's position, i.e. xi. Inverse Document Frequency (Idf) is calculated by Idft = log jfd2DjD:tj2dgj where D is the dataset of observations and fd 2 D : t 2 dg is the number of observations which are non-zero in the i-th position. For Tf-Idf, we transform d 2 D by d Idf . In our case, we do not explicitly measure the presence or non-presence of a feature but rather the 100 d re 75 u t p a C e c n ira 50 a V e g a t n e rc 25 e P 0 0 200 2011 principal compoenents 2012 principal compoenents 400 600 800 Number Of Principal Components 1000 count of each feature. Thus, Tf-Idf provides for us a weighting of our images which penalizes features if they are very common in the data set and awards features otherwise.

In the course of our study we experimented not just with PCA and Tf-Idf, but also with nestings of these operations. In short, for our dataset, X, we compute the following data transformations. 1. PCA(X) 2. Tf-Idf(X) 3. PCA(Tf-Idf(X)) 4. Tf-Idf(PCA(X)) 3

Database Ranking by Similarity Comparison Given a query image, the goal here is to calculate the similarities or distances between it and each of the images in the database. Then the rst image returned will be the most similar, the second return will be the second most similar, and so on. In some cases a query may consist of multiple images. In this case, we calculate the average similarity of the query parts to each database image as the representative score. The subjectivity inherent to the idea of similarity is re ected in the varying types of similarity measures which can be de ned. In some cases below, e.g. cosine similarity, a measure has its natural expression as a similarity rather than a dissimilarity measure. However, in most cases the natural de nition is as a dissimilarity measure. We shall use d when referring to a dissimilarity measure and s when referring to a similarity measure. The idea of calculating similarity as an additive inverse of distance comes from the idea of a metric. A metric on a set X is a mapping d : X X ! R such that 8x; y 2 X, the following conditions all hold: d(x; y) 0, d(x; y) = 0 if and only if x = y, d(x; y) = d(y; x), and d(x; z) d(x; y) + d(y; z).

We use the broader term measure because in some cases what we use will fail in one or more of the conditions above. For example, the Kullback-Liebler Divergence is not symmetric since, in general, d(x; y) 6= d(y; x). Finally, when a dissimilarity measure is being considered, it should be understood that we are using 1 d(x; y) to calculate the similarity where x and y are appropriately scaled so that d(x; y) 2 [0; 1]. 3.1

Various Similarity Measures

The following lists similarity or dissimilarity measures we considered in our study. Let x denote the vector (x1; x2; :::; xn) representing the query image and y the vector (y1; y2; :::; yn) representing another image. Further, let x represents the mean of the values in the x vector and y the mean of y. Further, let X and Y represent, respectively, the cumulative distributions of x and y when they are considered as probability distributions (Pn i=1 yi = 1). That is i=1 xi = Pn X = (X1; X2; ; Xn) where Xj = Pij=1 xi and similarly for Y and y. Finally = ( 1; ::; n) is the mean vector such that = x +2y . { Minkowski and Standard Measures

Euclidean Distance (L2) d(x; y) = pPn i=1 (xi yi)2 Cityblock Distance (L1) d(x; y) = Pn

i=1 jxi In nity Distance (L1) d(x; y) = maxin=1jxi yij yij Cosine Similarity (CO) s(x; y) =

x y kxkkyk { Statistical Measures { Divergence Measures Pin=1 (xi x)(yi y) Pearson Correlation Coe cient (CC) d(x; y) = pPin=1 (xi x)2(yi y)2 Chi-Square Dissimilarity (CS) d(x; y) = Pn i=1 (xi i)2 [ 21 ] i Kullback-Liebler Divergence (KL) d(x; y) = Pn i=1 xi log xyii [ 22 ] Je rey Divergence (JF) d(x; y) = Pn i=1 xi log xi + yi log yi [ 21 ] i i Kolmogorov-Smirnov Divergence (KS) d(x; y) = maxin=1jXi Cramer-von Mises Divergence (CvM) d(x; y) = Pn i=1 (Xi

Yij [ 23 ] Yi)2 [ 21 ] { Other Measures Earth Mover's Distance (EMD-L1) d(x; y) = Pn i=1 jXi

Yij [ 24 ]1 Di usion Distance (DD) d(x; y) = Plio=g12 n Pjn==21j z(ij) where z = (z1; z2; ; zn) and z(l) is the l-times iteratively Gaussian-smoothened, then 2-downsampled vector representation of jX Yj [ 18 ]. 3.2

Metric Learning

Metric Learning [ 26 ] is the process of using information about the similarity and/or dissimilarity of some dataset X, to learn a mapping to a new space Y = A1=2X, in which similar data will be closer together and dissimilar data will be farther apart. Let denote an n-dimensional vector in which i determines the weight given to the i-th variable. With such a we can de ne a weighted L2 metric on X such that for each x and y in X we capture the distance between them by d (x; y) = qPiN=1 i(xi yi)2. The idea of metric learning is to learn the appropriate weights from a training dataset. A less strict formulation of metric learning allows the weights to be described by a non-diagonal symmetric positive semi-de nite matrix A such that = diag(A), leading to a more general Mahalanobis-type metric formulation: dA(x; y) = jjx yjjA = q (x y)T A(x y) (1) Many algorithms [26{28] have been used to learn such a metric with Yang [ 29 ] giving a nice summary. We employ an algorithm called Information-Theoretic Metric Learning (hereafter ITML) which is widely used. ITML uses an informationtheoretic cost model which iteratively enforces similarity/dissimilarity constraints with the input being a list of such pairwise constraints and the output being a learned matrix A. An equivalent and more computationally e cient formulation to the one above is to use the L2 metric on the data after applying the data transformation X 7! A1=2X. In this study, we employ the diagonal form of A for simplicity and information about similarity/dissimilarity attained from the 2011 ImageCLEF dataset as our training data. 3.3

Query Filtering

We used the Modality Classi cation results made available by ImageCLEF to lter out certain image types which are likely to be irrelevant to all queries. Table 1 indicates the ltering performed. In short, we included all and only diagnostic images. 1 EMD for 1D features is equivalent to the Mallows Distance [ 25 ] Using the relevance judgments from 2011 ImageCLEF, we validate our proposed system. Table 2 shows the Mean Average Precision (hereafter MAP) scores for various permutation of our system components computed using the relevance judgment le from the 2011 results.

We used this table to select our best potential measure/transformation combinations for 2012 ImageCLEF competition. In the end we submitted the following seven runs to the 2012 ImageCLEF medical retrieval competition. 1. L1 on the untransformed data (reg cityblock) 2. DD on the untransformed data (reg di usion) 3. L2 on the Tf-Idf(PCA) transformed data (t df of pca euclidean) 4. CO on the Tf-Idf(PCA) transformed data (t df of pca cosine) 5. P C on the Tf-Idf(PCA) transformed data (t df of pca correlation)

6. L1 on the ITML data (itml cityblock) 7. DD on the ITML data (itml di usion) These selected runs are identi ed in Table 2 as highlighted items. Submissions to ImageCLEF medical retrieval[30, 31] are text les containing a ranked list of at most 1000 images for each of the competition queries, along with information

such as the rank, query number and score. These submission les are constructed in the TREC-style submission format [ 32 ]. 5

Discussion

We have presented a systematic comparison of various plug-in (dis-)similarity measures for M-CBIR with a standard bag-of-words feature method. Our validation results with the last year 2011 dataset indicates both ITML and di usion distance to be promising choices for the ad-hoc image-based retrieval task for medical images. Based on this result, we have entered seven runs (combinations of three top performing measures with di erent feature transformations). The results were disappointing. All the runs were placed at the last of this category with very low MAP scores for this year competition. The reasons for this performance may include a potentially suboptimal choice of our feature extraction/representation and query ltering employed. Investigation of this and a rerun of our study with a better base-CBIR system is our important future work. Among our 2012 results, we observe the consistent trend of the di usion and cityblock distances to perform best among other submitted runs. This indicates the virtue of distance measures based on L1 metric. The run with metric learning (ITML) was placed the last in our list. This may indicate signi cant change of data characteristics between the 2011 and 2012 data, which would naturally cause this reduced performance. Investigating the true advantage of the metric learning approach in M-CBIR remains another future work.

[1]

Muller ,

Clough ,

Deselaeres , and B. Caputo, eds., ImageCLEF: Experimental Evaluation in Visual Information Retrieval (The Information Retrieval Series) . Springer, 1st edition. ed., Aug . 2010 .

[2]

Clough , H. Muller, and M. Sanderson, \ Seven Years of Image Retrieval Evaluation," in ImageCLEF (H . Muller, P. Clough,

Deselaers ,

Caputo , and W. B. Croft, eds.), vol. 32 of The Information Retrieval Series, Springer Berlin Heidelberg, 2010 .

[3]

Mu ller and J. Kalpathy{Cramer, \ The Medical Image Retrieval Task," in ImageCLEF (H . Muller, P. Clough,

Deselaers ,

Caputo , and W. B. Croft, eds.), vol. 32 of The Information Retrieval Series, Springer Berlin Heidelberg, 2010 .

[4]

A. W. M.

Smeulders ,

Worring ,

Santini ,

Gupta , and

Jain , \ Content-based image retrieval at the end of the early years," IEEE Trans. Pattern Anal. and Machine Intell ., vol. 22 , no. 12 , 2000 .

[5]

Mu ller,

Michoux ,

Bandon , and

Geissbuhler , \ A review of content-based image retrieval systems in medical applications|clinical bene ts and future directions," Intl. J. Medical Informatics , vol. 73 , no. 1 , 2004 .

[6]

Rahman ,

Wang , and

B. C.

Desai , \ Medical image retrieval and registration: towards computer assisted diagnostic approach," in Proc. IDEAS Workshop on Medical Information Systems: The Digital Hospital , 2004 .

[7]

Deserno ,

Antani , and

Long , \ Ontology of Gaps in Content-Based Image Retrieval," Journal of Digital Imaging , vol. 22 , 2009 .

[8]

T. M.

Lehmann ,

B. B.

Wein ,

Dahmen ,

Bredno ,

Vogelsang , and

Kohnen , \ Content-based image retrieval in medical applications: a novel multistep approach," in

SPIE (M. M. Yeung , B.-L.

Yeo , and C. A . Bouman, eds.), vol. 3972 , 1999 .

[9]

Marchiori ,

Brodley ,

Dy ,

Pavlopoulou ,

Kak ,

Broderick , and

A. M.

Aisen , \ CBIR for medical images - an evaluation trial," in IEEE Workshop on Content-based Access of Image and Video Libraries , 2001 .

[10]

D. G.

Lowe , \ Object recognition from local scale-invariant features," in Proc. IEEE Int. Conf. Computer Vision , vol. 2 , 1999 .

[11]

D. G.

Lowe , \ Distinctive Image Features from Scale-invariant Keypoints," Int. J. Computer Vision , vol. 60 , 2004 .

[12]

Bay ,

Tuytelaars , and L. Van Gool , \SURF: Speeded Up Robust Features," in Proc. European Conf . Computer Vision (A. Leonardis , H. Bischof , and A . Pinz, eds.), vol. 3951 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2006 .

[13]

Bay ,

Ess ,

Tuytelaars , and

L. V.

Gool , \ Speeded-up robust features (SURF)," Computer Vision and Image Understanding , vol. 110 , no. 3 , 2008 .

[14]

T. S.

Lee , \ Image representation using 2D gabor wavelets," IEEE Trans. Pattern Anal. and Machine Intell ., vol. 18 , 1996 .

[15]

Pele and

Werman , \ The Quadratic-Chi Histogram Distance Family," in Proc. European Conf . Computer Vision (K. Daniilidis , P. Maragos , and N. Paragios, eds.), vol. 6312 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2010 .

[16]

Rubner ,

Tomasi , and

L. J.

Guibas , \ A metric for distributions with applications to image databases," in Proc. IEEE Int. Conf. Computer Vision , 1998 .

[17]

Puzicha ,

J. M.

Buhmann ,

Rubner , and

Tomasi , \ Empirical evaluation of dissimilarity measures for color and texture," in Proc. IEEE Int. Conf. Computer Vision , vol. 2 , 1999 .

[18]

Ling and

Okada , \ Di usion Distance for Histogram Comparison," in Proc. IEEE Conf. Computer Vision and Pattern Recognition , vol. 1 , 2006 .

[19]

Avni ,

Goldberger , and

Greenspan , \ Medical image classi cation at Tel Aviv and Bar Ilan Universities," in ImageCLEF (H . Muller, P. Clough,

Deselaers ,

Caputo , and W. B. Croft, eds.), vol. 32 of The Information Retrieval Series, Springer Berlin Heidelberg, 2010 .

[20]

R. O.

Duda ,

D. G.

Stork , and

P. E.

Hart , Pattern classi cation . Wiley, 2 ed., Nov . 2000 .

[21]

Puzicha ,

Hofmann , and

J. M.

Buhmann , \ Non-parametric similarity measures for unsupervised texture segmentation and image retrieval," in Proc. IEEE Conf. Computer Vision and Pattern Recognition , 1997 .

[22]

Ojala , M. Pietikainen, and

Harwood , \ A comparative study of texture measures with classi cation based on featured distributions," Pattern Recognition , vol. 29 , no. 1 , 1996 .

[23]

Geman ,

Geman , C.

Gra gne, and

Dong , \ Boundary detection by constrained optimization," IEEE Trans. Pattern Anal. and Machine Intell ., vol. 12 , no. 7 , 1990 .

[24]

Ling and

Okada , \ An E cient Earth Mover's Distance Algorithm for Robust Histogram Comparison," IEEE Trans. Pattern Anal. and Machine Intell ., vol. 29 , no. 5 , 2007 .

[25]

Levina and

Bickel , \ The Earth Mover's distance is the Mallows distance: some insights from statistics," in Proc. IEEE Int. Conf. Computer Vision , vol. 2 , 2001 .

[26]

E. P.

Xing ,

A. Y.

Ng ,

M. I.

Jordan , and

Russell , \ Distance metric learning with application to clustering with side-information," Learning , vol. 15 , no. 15 , 2003 .

[27]

J. V.

Davis ,

Kulis ,

Jain ,

Sra ,

and I. S.

Dhillon , \ Informationtheoretic metric learning," in Proc. Intl. Conf. Machine learning , (New York, NY, USA), ACM, 2007 .

[28]

McFee and

Lanckriet , \ Metric Learning to Rank," in Proc. Intl. Conf. Machine learning , 2010 .

[29]

Yang and

Jin , \ Distance Metric Learning: A Comprehensive Survey," tech. rep ., Department of Computer Science and Engineering, Michigan State University, 2006 .

[30]

ller,

A. G. S. de Herrera , J. Kalpathy-Cramer , D. D.

Fushman , S.

Antani , and I. Eggel , \ Overview of the ImageCLEF 2012 medical image retrieval and classi cation tasks," CLEF 2012 working notes , Sept. 2012 .

[31]

Kalpathy {Cramer,

Bedrick , and W. Hersh, \ Relevance Judgments for Image Retrieval Evaluation," in ImageCLEF (H . Muller, P. Clough,

Deselaers ,

Caputo , and W. B. Croft, eds.), vol. 32 of The Information Retrieval Series, Springer Berlin Heidelberg, 2010 .

[32]

Stokes , \TREC: Experiment and Evaluation in Information Retrieval," Computational Linguistics , vol. 32 , pp. 563 { 567 , Nov . 2006 .