=Paper=
{{Paper
|id=Vol-1178/CLEF2012wn-ImageCLEF-CollinsEt2012
|storemode=property
|title=A Comparative Study of Similarity Measures for Content-Based Medical Image Retrieval 
|pdfUrl=https://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-CollinsEt2012.pdf
|volume=Vol-1178
}}
==A Comparative Study of Similarity Measures for Content-Based Medical Image Retrieval ==
<pdf width="1500px">https://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-CollinsEt2012.pdf</pdf>
<pre>
A Comparative Study of Similarity Measures for
   Content-Based Medical Image Retrieval

                         John Collins, Kazunori Okada

                         San Francisco State University,
              1600 Holloway Avenue, San Francisco, CA 94132, USA
                             johncoll@mail.sfsu.edu
                               kazokada@sfsu.edu


      Abstract. This note summarizes methodologies employed in our sub-
      missions for the medical retrieval subtask of 2012 ImageCLEF compe-
      tition. Our work aims to provide a systematic comparison of various
      similarity measures in the Medical CBIR application context. Our sys-
      tem consists of the standard bag-of-words features with SIFT. Computed
      features are then compared by using various plug-in similarity measures,
      including diffusion distance and information-theoretic metric learning.
      This note provides the results of our experimental validation using the
      2011 ImageCLEF dataset.


      Keywords: ImageCLEF, CBIR, M-CBIR, Content-Based, Image Re-
      trieval, Medical


1    Introduction
ImageCLEF[1–3] is a public standardized competition which focuses attention
on, among other things, Medical CBIR (hereafter M-CBIR): CBIR[4–9] in which
all images are taken from figures in medical publications. This note focuses
on a subtask of M-CBIR 2012, the medical image retrieval task with image
data alone without other text-based data. Previous work on M-CBIR has led
to the development of an array of specific/general and local/global features. For
examples, see SIFT [10, 11], SURF [12, 13] and Gabor Wavelets [14]. Despite the
relative maturity of feature design studies, similarity measures in CBIR have not
been investigated thoroughly. Previous studies in this regard [15–17] are still few
and the lack is especially evident in the M-CBIR subfield.

Addressing this shortcoming, this paper presents a comparative study of M-
CBIR with a comprehensive list of similarity measures of many types. Our study
shows that well known measures tend to outperform more complex measures
with the notable exception of the Diffusion Distance [18]. Further, we show
that learning a metric from a set of training data is worthwhile, our best result
coming from a combination of a metric learning transformation combined with
the Diffusion Distance.
2       John Collins, Kazunori Okada

This paper is organized as follows. Sections 2 and 3 will outline, respectively, our
methods of feature extraction and representation, and our comparative study
of similarity measures. Sections 4 and 5 will summarize our results and their
interpretation.


2     Feature Extraction and Image Representation
In this section we describe the process and the individual steps involved in trans-
forming an image to a feature vector, which consists of the following three steps.
First, we identify and extract SIFT features from all of the dataset images.
Second, we create a codebook of K representative features using K-means clus-
tering. Third, we generate a single vector per image as a normalized histogram
of such representative features. Beyond this basic three-step procedure we ex-
periment with a number of standard transformations on the feature codebooks
for better retrieval performance.


2.1   Image Representation

From each image, we extract a variable number of features which we classify
into K types using the codebook resulting from the bag-of-words model de-
scribed below. An image is then represented by the frequency distribution of
feature types in the image and is, by construction, a vector of length K. Be-
fore calculating similarities, each vector is normalized so that it is a probability
distribution.


2.2   SIFT: Scale Invariant Feature Transform

SIFT [10, 11] is a proprietary algorithm that describes regions of interest within
an image as a feature which is both scale and rotation invariant. The positions of
these features, called keypoints, are determined by finding extrema of difference-
of-Gaussian images which are robust across multiple scales. Such regions are then
turned into 128-element SIFT feature vectors using local directional gradients
around the keypoint. We include the 4 extra parameters consisting of the 2 spa-
tial coordinates of the keypoint’s position within the image, the scale parameter
and the dominant-orientation parameter for a total of 132 dimensions.


2.3   Bag-of-Words

In order to generate an fixed-length vector for each image, we cluster all features
together in space using K-means clustering with a predefined vector-length K.
Before clustering, each SIFT feature-vector is centered and scaled using Z-Score
                    A Comparative Study of Similarity Measures for M-CBIR         3

normalization. In our case we chose K to be 1000 where this number was taken
from an earlier report in the same competition [19]. Each SIFT feature can
then be matched with one of the 1000 labels, 0-999, corresponding to the cluster
centers. We refer to this set of centers and the corresponding labels as a codebook.
This bag-of-words method yields the frequency distribution of these labels, 0-999,
which describes an image. The notion of a bag-of-words comes from textual data
mining and was originally proposed as a way of representing a text document
by it’s word frequency distribution, ignoring order. In the analogy here, SIFT
vectors are word instances and the K centers returned from K-means clustering
are the true words. Instead of instances being exact copies of that word as in the
text mining case, in the image context a word instance is ascribed to represent
the center to which it is closest in distance.


2.4     Data Transformations

The following standard transformations were examined with the goal of improved
performance.
 PCA: Principal Components Analysis PCA[20] is a technique used mainly
  for dimension
            Pn reduction. For a space X, It seeks to find the linear combina-
  tion Y = i=1 λi x(i) for column vectors x(i) of X such that the dimensions
  of Y are not correlated (linearly independent). Moreover, dimensions in Y
  are ordered from most to least important, where importance is defined in
  terms of variance. In practice, the transformed data in Y is often used for
  dimension reduction since one gets a variance-maximal m-dimensional rep-
  resentation of X by taking the first m dimensions of Y . How small to make
  m is data dependent and is typically chosen to cover at least 95% or 99% of
  the data’s variance.
      We experimented by varying the number of dimensions in PCA with both
      2011 and 2012 ImageCLEF competition datasets and the results are shown
      in Figure. 1. We found the variance spread of these two datasets to be quite
      large. Overall, using our image representation, the 2012 codebook captured
      more variance in fewer components than did the 2011 codebook. However,
      in both cases we found that it took most of the components to cover an
      adequate amount of variance.
 Tf-Idf: Term Frequency - Inverse Document Frequency This idea, like
   bag-of-words, comes from textual data mining. The goal is to penalize a
   vector for words (features) whenever they are common across the entire
   data set. Term Frequency (Tf) for an observation x is just the value at
   term i’s position, i.e. xi . Inverse Document Frequency (Idf) is calculated
                      |D|
   by Idft = log |{d∈D:t∈d}|   where D is the dataset of observations and {d ∈
   D : t ∈ d} is the number of observations which are non-zero in the i-th
   position. For Tf-Idf, we transform d ∈ D by d · Idf . In our case, we do not
   explicitly measure the presence or non-presence of a feature but rather the
4      John Collins, Kazunori Okada

                                                100


                 Percentage Variance Captured
                                                75


                                                50


                                                25


                                                                                  2011 principal compoenents
                                                                                  2012 principal compoenents
                                                 0
                                                      0   200           400        600             800         1000
                                                                Number Of Principal Components


                Fig. 1: Variance captured by principal components


    count of each feature. Thus, Tf-Idf provides for us a weighting of our images
    which penalizes features if they are very common in the data set and awards
    features otherwise.
In the course of our study we experimented not just with PCA and Tf-Idf, but
also with nestings of these operations. In short, for our dataset, X, we compute
the following data transformations.
 1. PCA(X)
 2. Tf-Idf(X)
 3. PCA(Tf-Idf(X))
 4. Tf-Idf(PCA(X))


3    Database Ranking by Similarity Comparison
Given a query image, the goal here is to calculate the similarities or distances
between it and each of the images in the database. Then the first image returned
will be the most similar, the second return will be the second most similar, and
so on. In some cases a query may consist of multiple images. In this case, we
calculate the average similarity of the query parts to each database image as
the representative score. The subjectivity inherent to the idea of similarity is
reflected in the varying types of similarity measures which can be defined. In
some cases below, e.g. cosine similarity, a measure has its natural expression
as a similarity rather than a dissimilarity measure. However, in most cases the
natural definition is as a dissimilarity measure. We shall use d when referring to
a dissimilarity measure and s when referring to a similarity measure. The idea
of calculating similarity as an additive inverse of distance comes from the idea of
                   A Comparative Study of Similarity Measures for M-CBIR                    5

a metric. A metric on a set X is a mapping d : X × X → R such that ∀x, y ∈ X,
the following conditions all hold: d(x, y) ≥ 0, d(x, y) = 0 if and only if x = y,
d(x, y) = d(y, x), and d(x, z) ≤ d(x, y) + d(y, z).
We use the broader term measure because in some cases what we use will fail
in one or more of the conditions above. For example, the Kullback-Liebler Di-
vergence is not symmetric since, in general, d(x, y) 6= d(y, x). Finally, when a
dissimilarity measure is being considered, it should be understood that we are
using 1 − d(x, y) to calculate the similarity where x and y are appropriately
scaled so that d(x, y) ∈ [0, 1].


3.1    Various Similarity Measures

The following lists similarity or dissimilarity measures we considered in our study.
Let x denote the vector (x1 , x2 , ..., xn ) representing the query image and y the
vector (y1 , y2 , ..., yn ) representing another image. Further, let x̄ represents the
mean of the values in the x vector and ȳ the mean of y. Further, let X and
Y represent, respectively, the cumulative distributions
                                                   Pn         of
                                                              Pnx and y when they
are considered as probability distributions ( i=1 xi = i=1 yi = 1). That is
                                          Pj
X = (X1 , X2 , · · · , Xn ) where Xj = i=1 xi and similarly for Y and y. Finally
µ = (µ1 , .., µn ) is the mean vector such that µ = x+y  2 .

 – Minkowski and Standard Measures
                                           pPn
                                                              2
      Euclidean Distance (L2 ) d(x, y) =        i=1 (xi − yi )
                                           Pn
      Cityblock Distance (L1 ) d(x, y) =     i=1 |xi − yi |

      Infinity Distance (L∞ ) d(x, y) = maxni=1 |xi − yi |
                                         x·y
      Cosine Similarity (CO) s(x, y) = kxkkyk

 – Statistical Measures
                                                              Pn
                                                                       (xi −x̄)(yi −ȳ)
      Pearson Correlation Coefficient (CC) d(x, y) = √Pi=1
                                                       n                       2 (y −ȳ)2
                                                                i=1 (xi −x̄)       i

                                                 Pn       (xi −µi )2
      Chi-Square Dissimilarity (CS) d(x, y) =       i=1       µi       [21]
 – Divergence Measures
                                                 Pn
      Kullback-Liebler Divergence (KL) d(x, y) = i=1 xi log xyii [22]
                                       Pn
      Jeffrey Divergence (JF) d(x, y) = i=1 xi log µxii + yi log µyii [21]

      Kolmogorov-Smirnov Divergence (KS) d(x, y) = maxni=1 |Xi − Yi | [23]
                                                  Pn
      Cramer-von Mises Divergence (CvM) d(x, y) = i=1 (Xi − Yi )2 [21]
 – Other Measures
6        John Collins, Kazunori Okada

                                                         Pn
      Earth Mover’s Distance (EMD-L1 ) d(x, y) = i=1 |Xi − Yi | [24]1
                                             Plog n Pn/2j (j)
      Diffusion Distance (DD) d(x, y) = i=12           j=1 zi where z = (z1 , z2 , · · · , zn )
         and z(l) is the l-times iteratively Gaussian-smoothened, then 2-downsampled
         vector representation of |X − Y| [18].


3.2    Metric Learning

Metric Learning [26] is the process of using information about the similarity
and/or dissimilarity of some dataset X, to learn a mapping to a new space
Y = A1/2 X, in which similar data will be closer together and dissimilar data will
be farther apart. Let λ denote an n-dimensional vector in which λi determines
the weight given to the i-th variable. With such a λ we can define a weighted L2
metric on X such thatqP  for each x and y in X we capture the distance between
                           N                2
them by dλ (x, y) =        i=1 λi (xi − yi ) . The idea of metric learning is to learn
the appropriate weights λ from a training dataset. A less strict formulation of
metric learning allows the weights to be described by a non-diagonal symmetric
positive semi-definite matrix A such that λ = diag(A), leading to a more general
Mahalanobis-type metric formulation:
                              q
    dA (x, y) = ||x − y||A = (x − y)T A(x − y)                                     (1)

Many algorithms [26–28] have been used to learn such a metric with Yang [29]
giving a nice summary. We employ an algorithm called Information-Theoretic
Metric Learning (hereafter ITML) which is widely used. ITML uses an information-
theoretic cost model which iteratively enforces similarity/dissimilarity constraints
with the input being a list of such pairwise constraints and the output being a
learned matrix A. An equivalent and more computationally efficient formulation
to the one above is to use the L2 metric on the data after applying the data
transformation X 7→ A1/2 X. In this study, we employ the diagonal form of A
for simplicity and information about similarity/dissimilarity attained from the
2011 ImageCLEF dataset as our training data.


3.3    Query Filtering

We used the Modality Classification results made available by ImageCLEF to
filter out certain image types which are likely to be irrelevant to all queries.
Table 1 indicates the filtering performed. In short, we included all and only
diagnostic images.


1
    EMD for 1D features is equivalent to the Mallows Distance [25]
                  A Comparative Study of Similarity Measures for M-CBIR        7


                     Table 1: Filtering of Modality Types
             Included Modalities             Modalities Filtered Out
                  Ultrasound            Compound or multipane images
             Magnetic Resonance                 Tables and Forms
          Computerized Tomography                Program Listing
           X-Ray, 2D Radiography        Statistical figures, graphs, charts
                 Angiography                        Screenshots
                     PET                            Flowcharts
       Combined Modalities in one image         System overviews
              Dermatology, skin                   Gene sequence
                  Endoscopy                      Chromatography
                 Other organs                  Chemical structure
           Electroencephalography            Mathematics, formulae
             Electrocardiography               Non-clinical photos
              Electromyography                Hand-drawn sketches
               Light microscopy
             Electron microscopy
           Transmission microscopy
           Fluorescence microscopy
              3D reconstructions


4    Experimental Results
Using the relevance judgments from 2011 ImageCLEF, we validate our proposed
system. Table 2 shows the Mean Average Precision (hereafter MAP) scores for
various permutation of our system components computed using the relevance
judgment file from the 2011 results.
We used this table to select our best potential measure/transformation combina-
tions for 2012 ImageCLEF competition. In the end we submitted the following
seven runs to the 2012 ImageCLEF medical retrieval competition.
1. L1 on the untransformed data (reg cityblock)
2. DD on the untransformed data (reg diffusion)
3. L2 on the Tf-Idf(PCA) transformed data (tfidf of pca euclidean)
4. CO on the Tf-Idf(PCA) transformed data (tfidf of pca cosine)
5. P C on the Tf-Idf(PCA) transformed data (tfidf of pca correlation)
6. L1 on the ITML data (itml cityblock)
7. DD on the ITML data (itml diffusion)
These selected runs are identified in Table 2 as highlighted items. Submissions
to ImageCLEF medical retrieval[30, 31] are text files containing a ranked list of
at most 1000 images for each of the competition queries, along with information
8        John Collins, Kazunori Okada


Table 2: Result of similarity measure comparison using the MAP score with 2011 Im-
ageCLEF data. P CM : codebook constructed using the first M principal components..
PC: all principal components. Tf-Idf(P C) is the Tf-Idf transformation of the PC trans-
formed data. Tf-Idf is the data under the Tf-Idf transformation and PC(Tf-Idf) is the
PC transformation of the Tf-Idf transformed data. ITML is A1/2 X where A is a metric
learned from similarity/dissimilarity information about X.
  Measure                            Data Transformation
             None P C75 P C200 P C500 P C Tf-Idf(P C) Tf-Idf PC(Tf-Idf) ITML
      L2   0.0169 0.0207 0.0168 0.0194 0.0203     0.0208    0.0157    0.0172   0.0126
      L1   0.0214 0.0183 0.0091 0.0196 0.0182     0.0180    0.0207    0.0180   0.0227
      L∞   0.0029 0.0032 0.0011 0.0012 0.0029     0.0016    0.0034    0.0097   0.0023
      CO   0.0169 0.0207 0.0168 0.0194 0.0203     0.0208    0.0157    0.0173   0.0126
      CC   0.0184 0.0207 0.0168 0.0194 0.0203     0.0209    0.0201    0.0172   0.0185
      CS   0.0133    0      0      0      0          0      0.0163       0        0
      KL   0.0004    0      0      0      0          0      0.0004       0        0
      JF      0      0      0      0      0          0      0.0008       0        0
      KS   0.0010 0.0176 0.0003 0.0020 0.0107     0.0176    0.0008    0.0008   0.0005
     CvM 0.0011 0.0047 0.0017 0.0014 0.0091       0.0104    0.0009    0.0008   0.0006
    EMD-L1 0.0011 0.0031 0.0016 0.0014 0.0089     0.0098    0.0009    0.0006   0.0006
      DD   0.0214 0.0183 0.0091 0.0196 0.0140     0.0137    0.0207    0.0177   0.0227


such as the rank, query number and score. These submission files are constructed
in the TREC-style submission format [32].


5      Discussion
We have presented a systematic comparison of various plug-in (dis-)similarity
measures for M-CBIR with a standard bag-of-words feature method. Our vali-
dation results with the last year 2011 dataset indicates both ITML and diffusion
distance to be promising choices for the ad-hoc image-based retrieval task for
medical images. Based on this result, we have entered seven runs (combinations
of three top performing measures with different feature transformations). The
results were disappointing. All the runs were placed at the last of this cate-
gory with very low MAP scores for this year competition. The reasons for this
performance may include a potentially suboptimal choice of our feature extrac-
tion/representation and query filtering employed. Investigation of this and a re-
run of our study with a better base-CBIR system is our important future work.
Among our 2012 results, we observe the consistent trend of the diffusion and
cityblock distances to perform best among other submitted runs. This indicates
the virtue of distance measures based on L1 metric. The run with metric learn-
ing (ITML) was placed the last in our list. This may indicate significant change
of data characteristics between the 2011 and 2012 data, which would naturally
cause this reduced performance. Investigating the true advantage of the metric
learning approach in M-CBIR remains another future work.
                                 References


 [1] H. Muller, P. Clough, T. Deselaeres, and B. Caputo, eds., ImageCLEF:
     Experimental Evaluation in Visual Information Retrieval (The Information
     Retrieval Series). Springer, 1st edition. ed., Aug. 2010.
 [2] P. Clough, H. Müller, and M. Sanderson, “Seven Years of Image Retrieval
     Evaluation,” in ImageCLEF (H. Müller, P. Clough, T. Deselaers, B. Caputo,
     and W. B. Croft, eds.), vol. 32 of The Information Retrieval Series, Springer
     Berlin Heidelberg, 2010.
 [3] H. Müller and J. Kalpathy–Cramer, “The Medical Image Retrieval Task,”
     in ImageCLEF (H. Müller, P. Clough, T. Deselaers, B. Caputo, and W. B.
     Croft, eds.), vol. 32 of The Information Retrieval Series, Springer Berlin
     Heidelberg, 2010.
 [4] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain,
     “Content-based image retrieval at the end of the early years,” IEEE Trans.
     Pattern Anal. and Machine Intell., vol. 22, no. 12, 2000.
 [5] H. Müller, N. Michoux, D. Bandon, and A. Geissbuhler, “A review of
     content-based image retrieval systems in medical applications—clinical ben-
     efits and future directions,” Intl. J. Medical Informatics, vol. 73, no. 1, 2004.
 [6] M. Rahman, T. Wang, and B. C. Desai, “Medical image retrieval and reg-
     istration: towards computer assisted diagnostic approach,” in Proc. IDEAS
     Workshop on Medical Information Systems: The Digital Hospital, 2004.
 [7] T. Deserno, S. Antani, and R. Long, “Ontology of Gaps in Content-Based
     Image Retrieval,” Journal of Digital Imaging, vol. 22, 2009.
 [8] T. M. Lehmann, B. B. Wein, J. Dahmen, J. Bredno, F. Vogelsang, and
     M. Kohnen, “Content-based image retrieval in medical applications: a novel
     multistep approach,” in SPIE (M. M. Yeung, B.-L. Yeo, and C. A. Bouman,
     eds.), vol. 3972, 1999.
 [9] A. Marchiori, C. Brodley, J. Dy, C. Pavlopoulou, A. Kak, L. Broderick,
     and A. M. Aisen, “CBIR for medical images - an evaluation trial,” in IEEE
     Workshop on Content-based Access of Image and Video Libraries, 2001.
[10] D. G. Lowe, “Object recognition from local scale-invariant features,” in
     Proc. IEEE Int. Conf. Computer Vision, vol. 2, 1999.
[11] D. G. Lowe, “Distinctive Image Features from Scale-invariant Keypoints,”
     Int. J. Computer Vision, vol. 60, 2004.
[12] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust Fea-
     tures,” in Proc. European Conf. Computer Vision (A. Leonardis, H. Bischof,
10     John Collins, Kazunori Okada

     and A. Pinz, eds.), vol. 3951 of Lecture Notes in Computer Science, Springer
     Berlin / Heidelberg, 2006.
[13] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-up robust features
     (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3, 2008.
[14] T. S. Lee, “Image representation using 2D gabor wavelets,” IEEE Trans.
     Pattern Anal. and Machine Intell., vol. 18, 1996.
[15] O. Pele and M. Werman, “The Quadratic-Chi Histogram Distance Family,”
     in Proc. European Conf. Computer Vision (K. Daniilidis, P. Maragos, and
     N. Paragios, eds.), vol. 6312 of Lecture Notes in Computer Science, Springer
     Berlin / Heidelberg, 2010.
[16] Y. Rubner, C. Tomasi, and L. J. Guibas, “A metric for distributions with
     applications to image databases,” in Proc. IEEE Int. Conf. Computer Vi-
     sion, 1998.
[17] J. Puzicha, J. M. Buhmann, Y. Rubner, and C. Tomasi, “Empirical eval-
     uation of dissimilarity measures for color and texture,” in Proc. IEEE Int.
     Conf. Computer Vision, vol. 2, 1999.
[18] H. Ling and K. Okada, “Diffusion Distance for Histogram Comparison,” in
     Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, 2006.
[19] U. Avni, J. Goldberger, and H. Greenspan, “Medical image classification at
     Tel Aviv and Bar Ilan Universities,” in ImageCLEF (H. Müller, P. Clough,
     T. Deselaers, B. Caputo, and W. B. Croft, eds.), vol. 32 of The Information
     Retrieval Series, Springer Berlin Heidelberg, 2010.
[20] R. O. Duda, D. G. Stork, and P. E. Hart, Pattern classification. Wiley,
     2 ed., Nov. 2000.
[21] J. Puzicha, T. Hofmann, and J. M. Buhmann, “Non-parametric similarity
     measures for unsupervised texture segmentation and image retrieval,” in
     Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1997.
[22] T. Ojala, M. Pietikäinen, and D. Harwood, “A comparative study of tex-
     ture measures with classification based on featured distributions,” Pattern
     Recognition, vol. 29, no. 1, 1996.
[23] D. Geman, S. Geman, C. Graffigne, and P. Dong, “Boundary detection by
     constrained optimization,” IEEE Trans. Pattern Anal. and Machine Intell.,
     vol. 12, no. 7, 1990.
[24] H. Ling and K. Okada, “An Efficient Earth Mover’s Distance Algorithm for
     Robust Histogram Comparison,” IEEE Trans. Pattern Anal. and Machine
     Intell., vol. 29, no. 5, 2007.
[25] E. Levina and P. Bickel, “The Earth Mover’s distance is the Mallows dis-
     tance: some insights from statistics,” in Proc. IEEE Int. Conf. Computer
     Vision, vol. 2, 2001.
                  A Comparative Study of Similarity Measures for M-CBIR        11

[26] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learn-
     ing with application to clustering with side-information,” Learning, vol. 15,
     no. 15, 2003.
[27] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-
     theoretic metric learning,” in Proc. Intl. Conf. Machine learning, (New
     York, NY, USA), ACM, 2007.
[28] B. McFee and G. Lanckriet, “Metric Learning to Rank,” in Proc. Intl. Conf.
     Machine learning, 2010.
[29] L. Yang and R. Jin, “Distance Metric Learning: A Comprehensive Survey,”
     tech. rep., Department of Computer Science and Engineering, Michigan
     State University, 2006.
[30] H. Müller, A. G. S. de Herrera, J. Kalpathy-Cramer, D. D. Fushman, S. An-
     tani, and I. Eggel, “Overview of the ImageCLEF 2012 medical image re-
     trieval and classification tasks,” CLEF 2012 working notes, Sept. 2012.
[31] J. Kalpathy–Cramer, S. Bedrick, and W. Hersh, “Relevance Judgments for
     Image Retrieval Evaluation,” in ImageCLEF (H. Müller, P. Clough, T. De-
     selaers, B. Caputo, and W. B. Croft, eds.), vol. 32 of The Information
     Retrieval Series, Springer Berlin Heidelberg, 2010.
[32] N. Stokes, “TREC: Experiment and Evaluation in Information Retrieval,”
     Computational Linguistics, vol. 32, pp. 563–567, Nov. 2006.

</pre>