<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>and B. Sch¨olkopf. Large scale multiple kernel learning.
Journal of Machine Learning Research</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Fraunhofer FIRST's Submission to ImageCLEF2009 Photo Annotation Task: Non-sparse Multiple Kernel Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Binder</string-name>
          <email>alexander.binder@first.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Motoaki Kawanabe</string-name>
          <email>motoaki.kawanabe@first.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute FIRST</institution>
          ,
          <addr-line>Kekul ́estr. 7, 12489 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2006</year>
      </pub-date>
      <abstract>
        <p>In order to achieve good performance in image annotation tasks, it is necessary to combine information from various image features. In our submission, we applied the nonsparse multiple kernel learning for feature combination proposed by Kloft et al.(2009) to the ImageCLEF2009 photo annotation data. Since some of the concepts of the ImageCLEF task are rather abstract, we conjectured that color histograms are informative for some categories such as sky and snow. Therefore we tried pyramid histograms of pixel colors. Since the images are not aligned, we sorted histograms at different places, when computing similarity of two images. Short description of our methods will be presented and obtained results will be discussed in this manuscript.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recent research results show that combining information from various image features is inevitable
to achieve good performance in image annotation tasks. With the support vector machine (SVM),
this is implemented by mixing kernels (similarities between images) constructed from different
image descriptors with appropriate weights. For instance, the average kernel with uniform weights
or the optimal kernel trained by multiple kernel learning (MKL) have been used so far. Recently,
Kloft et al.(2009) proposed the non-sparse MKL with Lp-regularizer, which bridges the
abovementioned two extremes, i.e. the average kernel SVM and the standard MKL. The non-sparse
MKL is successfully applied to object classification tasks; it could outperform the two baseline
methods by optimizing the tuning parameter p ≥ 1 through cross validation. In our submission,
we applied the novel technique for feature combination to ImageCLEF2009 photo annotation data
based on bag-of-words [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] over SIFT features which is similar to a subset of the Pascal VOC 2008
winners [
        <xref ref-type="bibr" rid="ref3">10, 3</xref>
        ] submission. Furthermore, we tested sorted pyramid kernel with color histograms
which will be explained in Section 2.1.1.
We computed SIFT features [
        <xref ref-type="bibr" rid="ref6">6, 7</xref>
        ] on a dense grid of step size six over the color channels red,
green, blue, normalized red, normalized green, normalized blue, opponent color 1, opponent color2,
normalized opponent color 1,normalized opponent color 2, grey. We used radii 4,8,12,16 for the
SIFT feature supports.
      </p>
      <p>
        For the bag of words models we combined the color channels into the following five sets: grey,
red-green-blue, normalized red-green-blue, opponent color 1 - opponent color2. The bag of words
prototypes have been obtained using kmeans clustering with 800000 SIFT features randomly
sampled from 3000 files. The number of prototypes was fixed to 4000 for each of the five channel
combinations. Finally we computed the bag of words for image tilings 1x1, 2x2 and 3x1 in the
sense of spatial pyrmaid levels [
        <xref ref-type="bibr" rid="ref1 ref5">5, 1</xref>
        ]. This results in 12 features of which we applied in each
learning setting at most eleven due to 32 bit memory restrictions. Furthermore we employed
pyramid histograms over color intensities features based on the color channels grey, opponent
color 1,opponent color 2, green and blue. We chose the number of bins per pyramid cell to be ten.
2.1.1
      </p>
      <p>Modifications of pyramid histograms over color intensities
In order to improve the information content of the spatial pyramids for pyramid histograms over
color intensities (PHoCol), we sorted the cells within a pyramid level according to the slant of
the histogram. This reflects the notion that an image patch having high intensities in a certain
colour channel results in an intensity histogram slanted towards higher intensities. The slant was
determined via the entropy of the accumulated histogram of a histogram:
ea(h)i =</p>
      <p>X
k≤i
hk, a(h)i = P
ea(h)i
k ea(h)k
, sl(h) = X −a(h)i ln(a(h)i)
i
In addition to that we included the difference between sort costs in the kernel. Algebraically a
sort cost is just a mapping of the sort permutation of a pyramid level into the real numbers. In
our specific case we resorted to
sc(π) = C X max(v(π(k)) − v(k), 0)</p>
      <p>k
where v(i) is the height of the cell with index i with C being chosen so that the sort cost is
upperbounded by one and lies at a similar scale compared to the χ2-distance. The intuition
behind such a sort cost is to punish large changes of the vertical position of an image patch
associated to a pyramid cell caused by sorting compared to the unsorted pyramid. This sort cost
is obviously invariant to changes in horizontal position of pyramid cells.</p>
      <p>The sort cost modified kernel is a product of the χ2-Kernel and a gaussian kernel over the
difference of the sortcosts between two features</p>
      <p>k(x, y) = exp(−σ(dχ2 (hx, hy) + (scx − scy)2))
. While it would be straightforward to replace this fixed weight kernel modification by an linearized
MKL-approach, we omitted it due to 32 bit memory restrictions. We also experimented with an
alternative formulation</p>
      <p>k(x, y) = exp(−σ(dχ2 (hx, hy)(1 + (scx − scy)2))
for which the kernel property is not proven which however turned out to be nonnegative definite.</p>
      <p>In the end we computed these features using tilings 4x4, no sorting, 3x1, sorted with sort cost,
16x16, sorted without sort cost over color channels noted above. While we do not regard such
features as good standalone features, we realized their speed and potential value for contributing
to some classifiers of the 53 classes in a proper MKL-procedure. We had low structured concepts
with high variance in mind like sky, water, portrait, daytime.
2.2</p>
    </sec>
    <sec id="sec-2">
      <title>Kernels 2.3</title>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>We used the χ2-Kernel with the exception of pyramid histograms over color intensities. All kernels
have been normalized.</p>
      <p>
        We used support vector machines with average kernels, sparse L1-MKL, and the recently developed
general non-sparse Lp-MKL [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The regularization parameter was fixed to one.
      </p>
      <p>We submitted one single kernel result, namely phows over red-green-blue with 1x1 tiling as
an early fallback, furthermore we submitted for each binary classifier the single best support
vector learning algorithm measured by AP/AUC and FRR rate score over ten-fold crossvalidation
(Lp-joint) using which generates three additional submissions</p>
      <p>A) eleven bag of words kernels, namely red-green-blue, normalized red-green-blue and grey, all
tilings, and opponent color 1 - opponent color2, 1x1 and 3x1 tiling.</p>
      <p>B) nine bag of words kernels, namely red-green-blue, normalized red-green-blue and grey, all
tilings,
C) 10 kernels, namely bag of words red-green-blue, all tilings, grey 1x1 tiling, PHoCol over
green and blue channels 4x4,3x1,16x16 tilings (various sort strategies)
D) eleven kernels, namely bag of words red-green-blue and grey, all tilings, opponent color 1
opponent color2 and normalized opponent color 1 - opponent color2, both with 1x1 tiling,
PHoCol over grey channels 4x4,3x1,16x16 tilings (various sort strategies)
E) 10 kernels, namely bag of words red-green-blue, all tilings, grey 1x1 tiling, PHoCol over
opponent color 1 and 2 channels 4x4,3x1,16x16 tilings (various sort strategies)
We used for the standard SVM and the Lp-MKL code from the shogun toolbox [9, 8].</p>
      <p>The experiments were conducted with C++-code on an Opteron-Cluster running on a 32 bit
Linux which implied a hard limit for memory usage of 3GB. On average we used 20 CPUs.
2.4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>On average combinations of only Bag of words features perform best. There exists classes for
which adding the low level color descriptors improves performance. Since each binary classifier
can be selected separately based on crossvalidation error adding the color features improves the
final submission. By comparing the results from A and B versus C, D, and E one can clearly
observe overfitting. In our point of view it comes from noisy feature extraction in connection with
small sample sizes for many of the smaller concept classes. We doubt that the applied features
permit a Bayes Error close to zero. In that sense there is still a need for better feature extraction
despite the success story of Bag of Words representations.</p>
      <p>Below are AP and AUC scores over 5-fold crossvalidation, the result of the best method is
marked bold. For multiclass subproblems we computed for each method the average score over
the subproblem and marked the subproblem in total bold.
This work was supported in part by Federal Ministry of Economics and Technology of Germany
under the project THESEUS (01MQ07018).</p>
      <p>Gool,
PASCAL
[10] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluating color descriptors for object
and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, (in
press), 2010.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Anna</given-names>
            <surname>Bosch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , and
          <article-title>Xavier Mun˜oz. Representing shape with a spatial pyramid kernel</article-title>
          .
          <source>In Proceedings of the 6th ACM international conference on Image and video retrieval (CIVR '07)</source>
          , pages
          <fpage>401</fpage>
          -
          <lpage>408</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Csurka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dance</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          .
          <article-title>Visual categorization with bags of keypoints</article-title>
          .
          <source>In Workshop on Statistical Learning in Computer Vision</source>
          , ECCV, pages
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          , Prague, Czech Republic, May
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Everingham</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Van A. Zisserman.</surname>
          </string-name>
          <article-title>The (VOC2008) Results</article-title>
          . voc2008/workshop/index.html,
          <year>2008</year>
          . C.
          <string-name>
            <surname>K. I. Williams</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Winn</surname>
          </string-name>
          , and
          <source>Visual Object Classes Challenge</source>
          <year>2008</year>
          http://www.pascal-network.org/challenges/VOC/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Marius</given-names>
            <surname>Kloft</surname>
          </string-name>
          , Ulf Brefeld, Pavel Laskov, and S¨oren Sonnenburg.
          <article-title>Non-sparse multiple kernel learning</article-title>
          ,
          <source>2008. Abstract of the NIPS workshop on Kernel Learning: Automatic Selection of Optimal Kernels.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Svetlana</given-names>
            <surname>Lazebnik</surname>
          </string-name>
          , Cordelia Schmid, and
          <string-name>
            <given-names>Jean</given-names>
            <surname>Ponce</surname>
          </string-name>
          .
          <article-title>Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories</article-title>
          .
          <source>In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06)</source>
          , volume
          <volume>2</volume>
          , pages
          <fpage>2169</fpage>
          -
          <lpage>2178</lpage>
          , New York, USA,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>David G.</given-names>
            <surname>Lowe</surname>
          </string-name>
          .
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>60</volume>
          (
          <issue>2</issue>
          ):
          <fpage>91</fpage>
          -
          <lpage>110</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>