<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The joint submission of the TU Berlin and Fraunhofer FIRST (TUBFI) to the ImageCLEF2011 Photo Annotation Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Binder</string-name>
          <email>alexander.binder@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wojciech Samek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marius Kloft</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christina Mu¨ ller</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klaus-Robert Mu¨ ller</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Motoaki Kawanabe</string-name>
          <email>motoaki.kawanabe@first.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute FIRST</institution>
          ,
          <addr-line>Kekule ́str. 7, 12489 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Machine Learning Group, Berlin Institute of Technology (TU Berlin)</institution>
          ,
          <addr-line>Franklinstr. 28/29, 10587, Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present details on the joint submission of TU Berlin and Fraunhofer FIRST to the ImageCLEF 2011 Photo Annotation Task. We sought to experiment with extensions of Bag-of-Words (BoW) models at several levels and to apply several kernel-based learning methods recently developed in our group. For classifier training we used non-sparse multiple kernel learning (MKL) and an efficient multi-task learning (MTL) heuristic based on MKL over kernels from classifier outputs. For the multi-modal fusion we used a smoothing method on tag-based features inspired by Bag-of-Words soft mappings and Markov random walks. We submitted one multi-modal run extended by the user tags and four purely visual runs based on Bag-of-Words models. Our best visual result which used the MTL method was ranked first according to mean average precision (MAP) within the purely visual submissions. Our multi-modal submission achieved the first rank by MAP among the multi-modal submissions and the best MAP among all submissions. Submissions by other groups such as BPACAD, CAEN, UvA-ISIS, LIRIS were ranked closely.</p>
      </abstract>
      <kwd-group>
        <kwd>ImageCLEF</kwd>
        <kwd>Photo Annotation</kwd>
        <kwd>Image Classification</kwd>
        <kwd>Bag-of-Words</kwd>
        <kwd>Multi-Task Learning</kwd>
        <kwd>Multiple Kernel Learning</kwd>
        <kwd>THESEUS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Our goals were to experiment with extensions of Bag-of-Words (BoW) models at
several levels and to combine them with several kernel-based learning methods recently
developed in our group while working within the THESEUS project. For this purpose
we generated a submission to the annotation task of the ImageCLEF2011 Photo
Annotation Challenge [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. This task required the annotation of 10000 images in the
provided test corpus according to the 99 pre-defined categories. Note that this year’s
ImageCLEF Photo-based task provides additionally another challenging competition [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
a concept-based retrieval task. In the following we will focus on the firstly mentioned
annotation task over the 10000 images. The ImageCLEF photo corpus is challenging
due to its heterogeneity of classes. It contains classes based on concrete tangible
objects such as female, cat and vehicle as well as more abstractly defined classes such
as technical, boring or Esthetic Impression. As a result our visual submission and our
multi-modal submission achieved both first ranks by MAP measure among the purely
visual and multi-modal submissions, respectively. We will describe our methods in a
concise manner here.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Bag-of-Words Features</title>
      <p>
        All our submissions were based on discriminatively trained classifiers over kernels
using BoW features. The BoW feature pipeline can be decomposed into the following
steps: generating sampling regions, computing local features, mapping local features
onto visual words. The coarse layout of our approach is influenced by the works of the
Xerox group on Bag-of-Words in Computer Vision [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the challenge submissions by
INRIA groups [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and the works on color descriptors by the University of
Amsterdam [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. For that reason we computed for each set of parameters three BoW features
based on regular spatial tilings 1 1; 2 2; 3 1 (vertical horizontal). Preliminary
experiments with an additional spatial tiling 3 3 showed merely minor performance
gains. Furthermore we used vectors of quantile estimators along the established SIFT
feature [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as local feature. Table 1 shows the computed BoW features. Information
about the sampling method is given in Section 2.1. We used color channel combinations
red-green-blue (RGB), grey (Gr), grey-opponentcolor1-opponentcolor2 (Opp in Table
1) and a grey-value normalized version of the last combination (N-Opp in Table 1). The
total number of kernels is large however their computation is a fairly automatized task
which requires little human intervention.
      </p>
      <p>In this years submission, we incorporated the following new extensions described
in Sections 2.1 and 2.2 into our BoW modeling.
2.1</p>
      <sec id="sec-2-1">
        <title>Extensions on Sampling level</title>
        <p>
          In addition to BoW features created from known grid sampling we tried biased random
sampling [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. In contrast to [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] we resorted to probability maps computed from edge
detectors. Such sampling approaches offer two potential advantages over Harris Laplace
detectors: Firstly, we get keypoints located on edges rather than corners. A motivating
example can be seen in Figure 1 – the bridge contains corner points but the essential
structures are lines. Similar examples are smooth borders of buildings, borders between
mountains and sky, or simply a circular structure.
        </p>
        <p>
          Secondly, we did adjust the number of local features to be extracted per image as
a function of the image size instead of using the typical corner detection thresholds.
The reference is the number of local features extracted by grid sampling, in our case 6
pixels. This comes from the idea that some images can be more smooth in general.
Furthermore [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] showed that too sparse sampling of local features leads to reduced
classification performance. The opposite extreme end of this is documented in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] where
quite large improvements using sampling each pixel are reported. As a consequence
we can tune the trade-off between computational cost and performance compared to
the dense sampling baseline. In practice we chose to extract approximately one half as
much local features using biased random sampling. We tried four detectors:
– bias3 was a simplified version of an attention based detector [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. However this
detector requires to set scale parameters. The highly varying scales of motifs in the
images makes it difficult to find a globally optimal set of scales without expensive
optimizations. This inspired us to try detectors which depend less on scale
parameters:
– bias1 computes an average of gradient responses over pixel-wise images of the
following color channels: grey, red minus green, green minus blue and blue minus
red.
– bias2 is like bias1 except for dropping the grey channel. Thus it will fail on grey
images but detects strong local color variations. On the other hand such differences
between RGB color channels are more prominent on bright regions. This allows to
use features over normalized color channels more safely on color images.
– bias4 takes the same set of color channels as the underlying SIFT descriptor and
computes the entropy of the gradient orientation histogram on the same scale as
the SIFT descriptor. Regions with low entropy are preferred in the probability map
used for biased random sampling. This detector is adapted closely to the SIFT
feature. The question behind this detector is whether the focus on peaky low entropy
histograms constitutes an advantage.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Extensions on Bag-of-Words Mapping Level</title>
        <p>
          As we used k-means for generating a set of visual words, the usual approach to generate
soft BoW mappings [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] which is adapted to radius-based clustering and relies on one
global width parameter may become inappropriate when the density of clusters varies
strongly in the space of local features. K-means results in clusters of varying size
depending on the local density of the local features. To resolve this issue we resorted to
rank-based BoW mapping where the vote of a local feature is the 2:4-based power of
the negative rank. Be RKd(l) the rank of the distances between the local feature l and
the visual word corresponding to BoW dimension d, sorted in increasing order. Then
the BoW mapping md for dimension d is defined as:
md(l) =
(2:4 RKd(l)
0
if RKd(l)
else.
        </p>
        <p>8
(1)</p>
        <p>Initially we performed experiments with several alternative soft mappings. Shortly
summarized, these experiments revealed that it is necessary to achieve a sufficiently fast
decay of soft mapping weights as a function of the distance of a local feature to distant
visual words in order to achieve a better performance than simple hard mapping.</p>
        <p>
          Our second attempt after using the mapping from [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] was to introduce a cutoff
constant K. Only distances below rank K + 1 are considered. Be V a visual vocabulary,
and wd the visual word from it corresponding to BoW feature dimension d, l a local
feature. Then the cut-off mapping is given by:
md(l) =
(
        </p>
        <p>exp( wd dist(l;wd))
Pv2VjRank(dist(l;v)) K exp(
0
vdist(l;v))
if Rank(dist(l; wd))</p>
        <p>K
otherwise
(2)
where the width parameter was estimated for each visual word locally as the inverse
of quantile estimators of distances to all local features from an image which had wd as
the nearest visual word.</p>
        <p>This experiment led to the conclusion that quantiles leading to large values for
and thus fast decay of weights yielded better performances.</p>
        <p>Note that the rank-based voting ensures exponential drop-off per se.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Kernels 2.4</title>
      </sec>
      <sec id="sec-2-4">
        <title>Used Resources</title>
        <p>We used 2-Kernels. The width was set to be the mean of the inner distances.
For feature and kernel computations we resorted to a cluster with 40 mostly AMD
Opterons 275 Core Units with up to 2.4 GHz which had according to cpubenchmark.net
a speed rank of 134 in August 2011. The OS was a 32bit which limited usable memory
resources during feature computation, in particular during visual word generation to 3
GByte.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Heuristically down-scaled non-sparse Multiple Kernel Learning</title>
      <p>
        Due to limited resources on a 64 bit cluster which we employed for classifier
training we decided to try out a down-scaled version of MKL based on 25 kernels which
are the averages of the 75 kernels over the spatial tilings. Instead of evaluating many
pairs of sparsity parameters and regularization constants the idea was to run non-sparse
MKL [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] once for each class for merely one sparsity parameter tuned towards low
kernel weight regularization (p = 1:2) and one choice of the regularization constant
tuned towards high SVM regularization (C = 0:1). The obtained kernel weights can
be used afterwards in SVMs with fixed-weighted kernels and several weaker SVM
regularizations and powers applied to the kernel weights simulating higher sparsity. This
consumes substantially less memory and allows in practice to use more cores in
parallel. For each class one can choose via cross-validation the optimal regularization and
power on the initially obtained MKL weights.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Output Kernel based MKL/MTL</title>
      <p>
        By considering the set of semantic concepts in the ImageCLEF Photo one can expect
weak relations between many of them. Some of them can be established
deterministically such as season labels like Spring necessarily require the photo to be an outdoor
shot. Others might be present in a statistical sense: photos showing Park Garden tend
to be rather calm instead of active, however the latter is possible. The extent of
activity might depend on the dataset. The total number of concepts is however prohibitive
for manual modeling of all relations. One principled approach for exploiting such
relations is multi-task learning [
        <xref ref-type="bibr" rid="ref19 ref4">4, 19</xref>
        ] which attempts to transfer information between
concepts. Classical Multi-Task Learning (MTL) has two shortcomings: firstly, it often
scales poorly with the number of concepts and samples. Secondly, kernel-based MTL
leads to symmetric solutions, which implies that poorly recognized concepts can spoil
classification rates of better performing classes. The work in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] tackles both problems.
It formulates a decomposable approximation which can be solved as a set of separate
MKL problems. Thus it shares the scalability limits of MKL approaches. Secondly,
the formulation as an approximation permits asymmetric information transfer between
classes. The approach uses kernels computed from SVM predictions for the
information transfer. We picked 12 concepts under the constraints to use general concepts and
to have rather high MAP values under cross-validation for the kernels (animals, food,
no persons, outdoor, indoor, building sights, landscape nature, single person, sky,
water, sea, trees) and combined them with the average kernel which has been used in the
TUBFI 1 submission as inputs for the non-sparse MKL algorithm resulting in 13
kernels. Here we applied as a consequence of lack of computation time the same
downscaled MKL strategy as for the BoW kernels alone described in Section 3 with MKL
regularization parameter p = 1:125 and SVM regularization constant C = 0:75,
however without applying sparsifying powers on the kernel weights.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Smoothing textual Bag-of-Words</title>
      <p>
        In the field of image classification it is known that soft mappings improve the
performance of BoW features substantially [
        <xref ref-type="bibr" rid="ref20 ref5">5, 20</xref>
        ]. The soft mapping relies on a notion of
distance between the local features. Similar approaches have been also used for Fisher
vectors [
        <xref ref-type="bibr" rid="ref12 ref15">12, 15</xref>
        ] where the non-sparsity of features does not require a distance. For tag
based BoW features one can derive analogously a notion of similarity without
resorting to external sources via co-occurrence. We applied for our multi-modal submission
TUBFI 3 the method from [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which uses derived similarity to achieve a soft
mapping for textual bags of words. The set of visual words has been selected by choosing
the 0:2% most frequent tags as in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Experiments on the cross-validated training set
confirmed performance improvements using smoothed tags over unsmoothed tags.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>
        For the detailed results of all submissions we refer to the overview given in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. A
small excerpt can be seen in Table 2.
      </p>
      <p>
        While optimization of complex measures, particularly hierarchically representable
ones is feasible [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we did not do any optimization for the example-based measures as
we were not fully aware of their structure. In particular, each classifier had a different
threshold due to a simple linear mapping of minimal and maximal outputs onto
boundaries of the required interval [0; 1] which leaves the targeted MAP values unchanged.
This across-concept variation in the classifier threshold explains the limited results for
the example-based measures.
      </p>
      <p>Considering the targeted MAP score, we can see in Table 2 that the pure textual
runs perform worst although one can expect them to be very efficient in terms of time
consumption versus ranking performance difference to visual ones. Multi-modal
approaches perform best with a considerable margin of 0.06 MAP (16% over TUBFI
1 baseline) over visual ones which indicates that the information in tags and images
is fairly non-redundant. The improvement over pure textual runs is substantial. When
considered as an absolute number, an MAP of 44 shows much space for improvements.
When looking at AUC values which allow better comparison between concepts, only
31 out of 99 classes had AUC values above 0.9 in our best submission.
The first purely visual submission (TUBFI 1 in Table 2) was an average kernel SVM
over all sets but the second BoW feature set from Table 1. Its performance was almost
identical to the best submission of the CAEN group which, however, used a completely
different methodology, namely Fisher-Kernels. For all other submissions we selected
for each class separately the best classifier from a set of classifiers by MAP values
obtained on 12-fold cross-validation on the training data. The idea was to counter the
heterogeneity of concept classes in the ImageCLEF data by a bag of methods. However,
this mixture does not allow to judge the impact of the separate methods precisely. Table
3 shows for each submission applied the number of classes using particular method.
Selection was based on cross-validated MAP.</p>
      <p>The pool for the second purely visual submission (TUBFI 2 in Table 2) consisted of
average kernels computed over several combinations of the sets from Table 1.
Hypothesis testing using a Wilcoxon’s signed rank test on the cross-validation results showed no
improvement in MAP over the first pure visual submission. Nevertheless we submitted
it for the sake of scientific curiosity – using average kernel SVMs over varying sets of
kernels is a computationally very efficient method. On the test data we observed a drop
in performance. Table 3 shows for each submission applied the number of classes using
particular method.</p>
      <p>The pool for the fourth purely visual submission (TUBFI 5 in Table 2) consisted of
the classifiers from the second submission combined with a MKL heuristic (see Section
3). Statistical testing revealed that a small improvement in MAP could be expected.
Indeed, the result on test data was marginally better than the first and the second
submission despite it contained some classifiers from the flawed second submission and
used a heuristically down-scaled variant of MKL.</p>
      <p>The pool for the third and a posteriori best purely visual submission (TUBFI 4 in
Table 2) consisted of the classifiers from the second submission, the MKL heuristic and
the output kernel MKL/MTL (see Section 4). Statistical testing revealed more classes
with significant gains over the baseline (TUBFI 1 in Table 2). In 42 categories the
chosen classifier belongs to the output kernel MKL/MTL. The result on test data was
the only purely visual run which showed a larger improvement over the baseline TUBFI
1. Therefore we attribute its gains to the influence of the output kernel MKL procedure.</p>
      <p>The only multi-modal submission (TUBFI 3 in Table 2) used kernels from the
baseline (TUBFI 1 in Table 2) combined with smoothed textual Bag-of-Words (see Section
5).
Acknowledgments We like to thank Shinichi Nakajima, Roger Holst, Dominik Kuehne, Malte
Danzmann, Stefanie Nowak, Volker Tresp and Klaus-Robert Mu¨ller. This work was supported in
part by the Federal Ministry of Economics and Technology of Germany (BMWi) under the project
THESEUS (01MQ07018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Binder</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Mu¨ ller,
          <string-name>
            <given-names>K.R.</given-names>
            ,
            <surname>Kawanabe</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>On taxonomies for multi-class image categorization</article-title>
          .
          <source>International Journal of Computer</source>
          Vision pp.
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          (
          <year>January 2011</year>
          ), http://dx. doi.org/10.1007/s11263-010-0417-8
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Braschler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pianta</surname>
          </string-name>
          , E. (eds.):
          <article-title>CLEF 2010 LABs and Workshops</article-title>
          , Notebook Papers,
          <fpage>22</fpage>
          -
          <lpage>23</lpage>
          September 2010, Padua, Italy (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Csurka</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bray</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dance</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Visual categorization with bags of keypoints</article-title>
          .
          <source>In: Workshop on Statistical Learning in Computer Vision</source>
          , ECCV. pp.
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          . Prague, Czech Republic (May
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Evgeniou</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Micchelli</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pontil</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Learning multiple tasks with kernel methods</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>6</volume>
          ,
          <fpage>615</fpage>
          -
          <lpage>637</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. van Gemert,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Geusebroek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Veenman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Smeulders</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Kernel codebooks for scene categorization</article-title>
          .
          <source>In: ECCV</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Guillaumin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Multimodal semi-supervised learning for image classication</article-title>
          .
          <source>In: Proc. of IEEE Int. Conf. on Comp. Vis. &amp; Pat. Rec. (CVPR '10)</source>
          . San Francisco, CA, USA (
          <year>2010</year>
          ), http://lear.inrialpes.fr/pubs/2010/GVS10
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Itti</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niebur</surname>
          </string-name>
          , E.:
          <article-title>A model of saliency-based visual attention for rapid scene analysis</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>20</volume>
          (
          <issue>11</issue>
          ),
          <fpage>1254</fpage>
          -
          <lpage>1259</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kawanabe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Binder</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Mu¨ ller,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Wojcikiewicz</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.:</surname>
          </string-name>
          <article-title>Multi-modal visual concept classification of images via markov random walk over tags</article-title>
          .
          <source>In: Applications of Computer Vision (WACV)</source>
          ,
          <source>2011 IEEE Workshop on</source>
          . pp.
          <fpage>396</fpage>
          -
          <lpage>401</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kloft</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brefeld</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sonnenburg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zien</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Lp-norm multiple kernel learning</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>953</fpage>
          -
          <lpage>997</lpage>
          (
          <year>Mar 2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Distinctive image features from scale invariant keypoints</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ),
          <fpage>91</fpage>
          -
          <lpage>110</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Marszalek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Learning representations for visual object class recognition</article-title>
          , http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/ workshop/marszalek.pdf
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Mensink</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Csurka</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perronnin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Sa</given-names>
            ´nchez, J.,
            <surname>Verbeek</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.J.</surname>
          </string-name>
          :
          <article-title>Lear and xrce's participation to visual concept detection task - imageclef 2010</article-title>
          . In: Braschler et al. [
          <volume>2</volume>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurie</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Triggs</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Sampling strategies for bag-of-features image classification</article-title>
          . In: Leonardis,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bischof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Pinz</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <source>ECCV (4). Lecture Notes in Computer Science</source>
          , vol.
          <volume>3954</volume>
          , pp.
          <fpage>490</fpage>
          -
          <lpage>503</lpage>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nagel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liebetrau</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The clef 2011 photo annotation and concept-based retrieval tasks</article-title>
          .
          <source>In: CLEF 2011 working notes. The Netherlands</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Perronnin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Sa</given-names>
            ´nchez, J.,
            <surname>Mensink</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Improving the fisher kernel for large-scale image classification</article-title>
          . In: Daniilidis,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Maragos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Paragios</surname>
          </string-name>
          , N. (eds.)
          <source>ECCV (4). Lecture Notes in Computer Science</source>
          , vol.
          <volume>6314</volume>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>156</lpage>
          . Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Samek</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Binder</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawanabe</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Multi-task learning via non-sparse multiple kernel learning</article-title>
          .
          <source>In: CAIP</source>
          (
          <year>2011</year>
          ), accepted
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. van de Sande,
          <string-name>
            <given-names>K.E.A.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.G.M.:</surname>
          </string-name>
          <article-title>Evaluating color descriptors for object and scene recognition</article-title>
          .
          <source>IEEE Trans. Pat. Anal. &amp; Mach. Intel</source>
          . (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. van de Sande,
          <string-name>
            <given-names>K.E.A.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>The university of amsterdam's concept detection system at imageclef 2010</article-title>
          . In: Braschler et al. [
          <volume>2</volume>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Sheldon</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Graphical multi-task learning</article-title>
          . http://agbs.kyb.tuegingen.mpg.de/wikis/bg/siso2008/Sheldon.pdf (
          <year>2008</year>
          ), nIPS workshop on strictured input - structured output
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Tahir</surname>
            , M., van de Sande,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uijlings</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kittler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gevers</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeulders</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SurreyUVA SRKDA method</article-title>
          . http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/workshop/tahir.pdf
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , H.:
          <article-title>A biased sampling strategy for object categorization</article-title>
          .
          <source>In: ICCV</source>
          . pp.
          <fpage>1141</fpage>
          -
          <lpage>1148</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>