<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SZTAKI @ ImageCLEF 2012 Photo Annotation</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ba ́lint Daro ́czy D ́avid Sikl ́osi Andr ́as A. Benczu ́r</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data Mining and Web Search Group, Informatics Laboratory Computer and Automation Research Institute Hungarian Academy of Sciences</institution>
          ,
          <addr-line>MTA SZTAKI</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe our approach to the ImageCLEF2012 Photo Annotation task. We used both visual and textual modalities for all submissions. We described each image with a fixed length representation using different similarity measures. By this method we were able to combine, before the classification, a large variety of descriptors to improve the classification quality. This descriptor is a combination of several visual and textual similarity values between the actual image and a reference image set, containing well selected training images. We trained Gaussian Mixture Models (GMM) to define a generative model for low-level descriptors extracted from the training set using Harris-Laplacian point detection. We used two descriptors, a grayscale gradient and a color moment based one. In order to measure the visual similarity between two images, we extracted several dense Fisher vectors per image. Besides calculating visual features, we adopted a biclustering method to cluster the Flickr tags and the images at the same time. Additionally, we measured the similarity of images according to their Flickr tags using Jensen-Shannon divergence.</p>
      </abstract>
      <kwd-group>
        <kwd>image classification</kwd>
        <kwd>biclustering</kwd>
        <kwd>generative models</kwd>
        <kwd>kernel methods</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Visual feature extraction</title>
      <p>
        The GMM based Fisher gradient vector computed of SIFT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] descriptors is a well-known
technique to represent an image with only one vector per pooling [
        <xref ref-type="bibr" rid="ref13 ref15 ref18">15, 13, 18</xref>
        ].
      </p>
      <p>
        We used low-level patch feature vectors to describe the visual content of an image by
approximately 15k descriptors per image per modality. Our sampling strategy included a dense grid
and a Harris-Laplace point detection [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. To avoid extracting large number of local features, we
downscaled all the images with proper aspect ratio. The maximal width and height was set to 500
pixels. We calculated HOG (Histogram of Oriented Gradients [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) and RGB color descriptors for
each patch using 16x16 and 48x48 pixel macroblock sizes. Both descriptors were L2 normalized
and by HOG we reduced the dimension to 96 by Principal Component Analysis (PCA). The PCA
model was trained on a small sample of patches extracted from training images.
⋆ This work is supported in part by the EC FET Open project “New tools and algorithms for directed
network analysis” (NADINE No 288956) and OTKA CNK 77782.
      </p>
      <p>
        To build an efficient soft codebook, we trained a Gaussian Mixture Model (GMM) for both
descriptors. The training procedure of GMM models took about 20 minutes using 3 million training
points per descriptor. We used our open-source CUDA GMM implementation. Our training method
was based on a standard Expectation Maximization algorithm with a non-hiherarchical structure.
We avoided the well-known vulnerability of the EM algorithm to underflow by computing the
condiditonal probabilities [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The resulted implementation is an accurate yet fast CUDA based code
optimized for fp32 architectures. Our source code along with previously trained GMM models for
different patch descriptors and codes for Fisher vector calculation is available free for research use
at https://dms.sztaki.hu/hu/projekt/gaussian-mixture-modeling-gmm-es-fisher-vector-toolkit.
      </p>
      <p>The final high-level dataset independent representation of images was the normalized Fisher
gradient vector. We also calculated a separate Fisher vector on the Harris-Laplacian detected
corner descriptors. As by our GMM implementation we were able to compute all the conditional
probabilites for each feature vector without significant loss of time, which resulted a strongly dense
Fisher vector even in fp32.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Biclustering Flickr tags and image similarity</title>
      <p>
        Our previous experiment [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Jensen-Shannon divergence of Flickr tags was an excellent image
similarity measure. Our goal was to expand it with determining deeper interrelations between the
tags and the documents using content based similarity.
      </p>
      <p>
        The applied biclustering was an expansion of Dhillan’s information theoretic co-clustering
algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In comparison to the original algorithm we measured document similarity with a
combination of visual and textual similarity values. We chose Jensen-Shannon divergence instead
of Kullback-Leibler used in the original article. Our choice was inspired by our experiences with
other datasets where Jensen-Shannon divergence resulted a significantly better clustering quality
instead of Kullback-Leibler [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In order to refine the clustering with non-textual information, we
added a similarity measure based on the best performing visual features (both HOG and color
Fisher vectors pooled on different partitions such as the whole image, only the detected corner
points and 3x1 spatial resolution).
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Combination of representations</title>
      <p>
        Efficient combination of different feature sets based on a wide range of visual modalities is one
of the main problems of image classification. This problem becomes more complex if we have
additional non-visual features such as Flickr tags. Our starting point was a widely used technique:
learning SVM models on textual and visual Bag-of-Words models [
        <xref ref-type="bibr" rid="ref12 ref20 ref5">20, 5, 12</xref>
        ]. The selection of the
ideal kernel depends on both of the original feature space and the class variable. Therefore the
selection procedure is computationally expensive. The dual form of the optimization problem of the
standard Support Vector Machine (SVM) classification [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with kernel K(xi, xj ) is the following:
m
M aximizeLDual(α) = X −
i=1
      </p>
      <sec id="sec-4-1">
        <title>1 Xm Xm αiαj yiyj K(xi, xj )</title>
        <p>2 i=1 j=1
(1)
subject to P yiαi = 0 for all i with αi ≥ 0.</p>
        <p>Having multiple number of kernels due the representations via different modalities with
previously selected kernel functions, we can modify the dual form into a multiple kernel learning
problem:</p>
        <p>m
M aximizeLDual(α, β) = X −
i=1</p>
      </sec>
      <sec id="sec-4-2">
        <title>1 Xm Xm αiαj yiyj XN βnKn(xi, xj )</title>
        <p>2 i=1 j=1 n=1
(2)
subject to P yiαi = 0 for all i with αi ≥ 0 , where N is the number of the basic kernels and
Kn(xi, xj) is the nth kernel function.</p>
        <p>
          The above problem is a special case of the Multiple Kernel Learning problems where the
kernels are computed on different feature sets. In comparison to Bach [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] we assumed that all of
representations are conducive to the train procedure. Bach suggested to solve the MKL problem
with an iterative and sparse learning method where in each iteration they solve a standard SVM
dual problem and update the weights of the basic kernels. One of the drawbacks of this solution
is the increased computational time.
        </p>
        <p>
          To avoid the computationally expensive MKL problem we used a feature transformation
method. Distance from the training set, as a feature transform for classification is a well-known
technique. Sch¨olkopf et al. showed that a class of kernels can be represented as norm-based
distances in Hilbert spaces [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and Ah-Pine et al. applied L1-norm based feature transformation
measuring the distance from the Fisher vectors of the training set for image classification with
excellent results [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>We defined a dense representation combining modality adaptive similarity based feature
transforms. Let us consider a set of documents D (we call it as reference set) and their corresponding
representations Dr. We defined the uniform representation of a document X over the set of
representations R of a reference set D as</p>
        <p>R R
LR(X, D) = [X βrsimr(Xr, Dr1), .., X βrsimr(Xr, Drd)]</p>
        <p>r=1 r=1
where P βr = 1 and simr denotes the selected similarity measure on basic representation r and d
is the size of the reference set. The dimensionality of this representation is the cardinality of the
reference set.
4.1</p>
        <sec id="sec-4-2-1">
          <title>Reference set selection and weight determination</title>
          <p>The proper selection of the reference set could decrease significantly the demanding computational
time of solving the standard dual problem. More precisely, we are seeking for the minimal set of
documents without affecting significantly the quality of the learning procedure.</p>
          <p>To determine the reference set we defined a ranking for the images according to their
annotations. The rarer a concept, the higher the score of its positive instances will be. We cut the list
where the training documents contain at least a specified quantity of positive samples for all
categories. We set the minimal amount of positive samples to p ∗ N where N is the number of training
images. If a category did not have the minimal amount of positive instances all the samples were
included. The resulted subset of training images using p = 0.01 contained only 6260 images out of
the original 15k training images. Since the dimension of the combined representation equals with
the number of images in the reference set this selection reduced the dimension by more than 50%.</p>
          <p>
            To identify the weight vector β of the basic representations per class we sampled the training
set. We used 5k images for training and 5k images for validation. We trained binary SVM classifiers
using the LibSVM package [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] separately for each representation and used grid search to find the
optimal linear combination per class.
(3)
All of our submissions used both visual and textual features. The main differences were the number
of training images used for classification and the size of the reference set. All the runs included
the following basic representations: HOG based Fisher vectors (1x1,3x1,Harris-Laplacian), Color
moment based Fisher vectors (1x1,3x1,Harris-Laplacian) and Jensen-Shannon divergence using
Flickr tags as probability distributions (Table 5).
5.1
          </p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Experiments on the validation set</title>
          <p>In order to determine the parameters of the combined representation we experimented on the
basic features using a subset of the training set. It can be seen in Table 2 that color moment
and HOG descriptors complement each other. Although the average number of keypoints detected
by Harris-Laplacian was considerably less than for the rest of the poolings (average 2k vs. 15k
descriptors per image), we measured small performance differences between them. For Flickr tags
we tested three methods (Table 3). We selected the top 25,000 Flickr tags as vocabulary. The
refined biclustering using visual similarity and Jensen-Shannon divergence outperformed
JensenShannon divergence and the purely tag based biclustering. We experimented with the parameter
p for proper reference set selection over the best combined representation including all visual
similarity values and Jensen-Shannon divergence. It can be seen in Table 1 that the performance
loss was negligible even using less than half of the features. If we left only the 11.9% of the training
set as reference set the performance dropped significantly.
In jch10ksep we used the ranked reference set with 6260 images and an annotation category
based weighting scheme for the combination (19 different weight vectors). We trained binary SVM
classifiers per class using a reduced training set containing only 10k images.</p>
          <p>Addition to jch10ksep, in jchb10ksep we added a refined biclustering representation with 2k
clusters to the common representations. Notice that by biclustering the dimension of the
representation was significantly the lowest of all (Table 4).</p>
          <p>Our best performing run jchf r15k used the entire training set as reference set and the binary
SVM models were trained on the whole training set (15k) per class. The adopted weight vector β
were the same for each class. It is worth to mention that we experienced increase in computational
time in comparison to jch10ksep or jchb10ksep. The reason for the nonlinear increase is that by
jchf r15k we used the whole training set as reference set and the binary SVM classifiers were
trained on the entire training set.</p>
          <p>Our second best performing run jchaggsep was a combination of jchf r15k and jch10ksep
(Table 6). We simply averaged the predictions of the runs. In jchbicwelf we aggregated the
output of the biclustering based classifier and the jchf r15k using a linear combination learned
previously on the training-validation set.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>Our approach for ImageCLEF 2012 Photo Annotation task employed various representations of
the images based on different visual and textual modalities. We extracted several Ficher vectors
using a grayscale and a color patch descriptor. We applied a biclustering method to cluster the
images and their Flickr tags. We combined the different descriptors and representations before the
classification. This combination procedure included a transformation, a feature aggregation and a
selection step.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Ah-Pine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cifarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clinchant</surname>
          </string-name>
          , G. Csurka, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Renders</surname>
          </string-name>
          . XRCEs Participation to
          <article-title>ImageCLEF 2008</article-title>
          .
          <source>In Working Notes of the 2008 CLEF Workshop</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. E. Bodzs´ar,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <article-title>Dar´oczy, I. Petr´as, and Andr´as A. Benczu´r. GMM based fisher vector calculation on GPGPU</article-title>
          . http://datamining.sztaki.hu/?q=en/GPU-GMM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>C.-C. Chang</surname>
            and
            <given-names>C.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>Libsvm: a library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          .
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>20</volume>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>G.</given-names>
            <surname>Csurka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Willamowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bray</surname>
          </string-name>
          .
          <article-title>Visual categorization with bags of keypoints</article-title>
          .
          <source>In Workshop on Statistical Learning in Computer Vision</source>
          , ECCV, volume
          <volume>1</volume>
          , page 22.
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Navneet</given-names>
            <surname>Dalal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bill</given-names>
            <surname>Triggs</surname>
          </string-name>
          .
          <article-title>Histograms of oriented gradients for human detection</article-title>
          .
          <source>CVPR</source>
          <year>2005</year>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>B.</given-names>
            <surname>Dar</surname>
          </string-name>
          ´oczy, A. Benczu´r, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Pethes</surname>
          </string-name>
          . SZTAKI at ImageCLEF
          <year>2011</year>
          .
          <source>In Working notes of CLEF</source>
          <year>2011</year>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>I.S.</given-names>
            <surname>Dhillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mallela</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.S.</given-names>
            <surname>Modha</surname>
          </string-name>
          .
          <article-title>Information-theoretic co-clustering</article-title>
          .
          <source>Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>89</fpage>
          -
          <lpage>98</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>C.</given-names>
            <surname>Harris</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Stephens</surname>
          </string-name>
          .
          <article-title>A combined corner and edge detector</article-title>
          .
          <source>Alvey Vision Conference</source>
          , page
          <volume>147151</volume>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>D.G.</given-names>
            <surname>Lowe.</surname>
          </string-name>
          <article-title>Object recognition from local scale-invariant features</article-title>
          .
          <source>In International Conference on Computer Vision</source>
          , volume
          <volume>2</volume>
          , pages
          <fpage>1150</fpage>
          -
          <lpage>1157</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>K.</given-names>
            <surname>Mikolajczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tuytelaars</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schaffalitzky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kadir</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          .
          <article-title>A comparison of affine region detectors</article-title>
          .
          <source>IJCV.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Nowak</surname>
          </string-name>
          .
          <article-title>New Strategies for Image Annotation: Overview of the Photo Annotation Task at ImageCLEF 2010</article-title>
          .
          <article-title>In Cross Language Evaluation Forum</article-title>
          ,
          <source>ImageCLEF Workshop</source>
          ,
          <year>2010</year>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Dance</surname>
          </string-name>
          .
          <article-title>Fisher kernels on visual vocabularies for image categorization</article-title>
          .
          <source>In IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2007</year>
          . CVPR'
          <volume>07</volume>
          , pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>A.</given-names>
            <surname>Rakotomamonjy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Canu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Grandvalet</surname>
          </string-name>
          . Simplemkl.
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>9</volume>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>C. Schmid S. Lazebnik</surname>
            and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ponce</surname>
          </string-name>
          .
          <article-title>Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition, New York,
          <year>June 2006</year>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Bernard</surname>
          </string-name>
          <article-title>Sch¨olkopf. The kernel trick for distances</article-title>
          . pages
          <fpage>301</fpage>
          -
          <lpage>307</lpage>
          . MIT Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <article-title>D´avid Sikl´osi, B´alint Dar´oczy, and Andr´as A. Benczu´r. Content-based trust and bias classification via biclustering</article-title>
          .
          <source>In Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality</source>
          ,
          <source>WebQuality '12</source>
          , pages
          <fpage>41</fpage>
          -
          <lpage>47</lpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. G.
          <string-name>
            <surname>Csurka</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mensink</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Perronnin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Snchez</surname>
            , and
            <given-names>J. Verbeek.</given-names>
          </string-name>
          <article-title>LEAR and XRCEs participation to Visual Concept Detection Task at ImageCLEF 2010</article-title>
          .
          <source>In Working Notes for the CLEF 2010 Workshop</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>Bart</given-names>
            <surname>Thomee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Adrian</given-names>
            <surname>Popescu</surname>
          </string-name>
          .
          <article-title>Overview of the imageclef 2012 flickr photo annotation and retrieval task</article-title>
          .
          <source>CLEF 2012 working notes</source>
          , Rome, Italy,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>K. E. A. van de Sande</surname>
            , T. Gevers, and
            <given-names>C. G. M.</given-names>
          </string-name>
          <string-name>
            <surname>Snoek</surname>
          </string-name>
          .
          <article-title>Evaluating color descriptors for object and scene recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>32</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1582</fpage>
          -
          <lpage>1596</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>