<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LEAR and XRCE's participation to Visual Concept Detection Task - ImageCLEF 2010</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Mensink</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriela Csurka</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florent Perronnin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jorge Sa´nchez</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jakob Verbeek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LEAR, INRIA Rhoˆ ne-Alpes</institution>
          ,
          <addr-line>Montbonnot</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Xerox Research Centre Europe</institution>
          ,
          <addr-line>Meylan</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <abstract>
        <p>In this paper we present the common effort of Lear and XRCE for the ImageCLEF Visual Concept Detection and Annotation Task. We first sought to combine our individual state-of-the-art approaches: the Fisher vector image representation, with the TagProp method for image auto-annotation. Our second motivation was to investigate the annotation performance by using extra information in the form of provided Flickr-tags. The results show that using the Flickr-tags in combination with visual features improves the results of any method using only visual features. Our winning system, an early-fusion linear-SVM classifier, trained on visual and Flickr-tags features, obtains 45.5% in mean Average Precision (mAP), almost a 5% absolute improvement compared to the best visual-only system. Our best visual-only system obtains 39.0% mAP, and is close to the best visual-only system. It is a late-fusion linear-SVM classifier, trained on two types of visual features (SIFT and colour). The performance of TagProp is close to our SVM classifiers. The methods presented in this paper, are all scalable to large datasets and/or many concepts. This is due to the fast FK framework for image representation, and due to the classifiers. The linear SVM classifier has proven to scale well for large datasets. The k-NN approach of TagProp, is interesting in this respect since it requires only 2 parameters per concept.</p>
      </abstract>
      <kwd-group>
        <kwd>Image Classification</kwd>
        <kwd>Auto Annotation</kwd>
        <kwd>Multi-Modal</kwd>
        <kwd>Linear SVM</kwd>
        <kwd>Fisher Vectors</kwd>
        <kwd>TagProp</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In our participation to the ImageCLEF Visual Concept Detection and Annotation Task
(VCDT) we focused on two main aspects. First, we wanted to investigate the effect of
using the available modalities, visual (image) and textual (Flickr-tags), both at train and
test time. Our second goal was to compare some of our recent techniques that potentially
scale to large data sets with many concepts on the proposed task.</p>
      <p>
        The VCDT is a multi-label classification challenge on the MIR Flickr dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
It aims at automatic annotation of 10; 000 test images with multiple concepts, learned
from 8; 000 train images. The 93 concepts include abstract categories (like Partylife),
the time of day (like day or night), persons (like no person visible, small or big group)
and quality (like blurred or underexposed). For a complete overview of the challenge
see [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>This year’s challenge allowed the use of ‘multi-modal approaches that consider
visual information and/or Flickr user tags and/or EXIF information’. For all images in the
train and test set the original tag data of the Flickr users (further denoted as Flickr-tags)
was provided. The set of Flickr-tags contains over 53; 000 different tags, from which
we use a subset of most occurring tags. Also, for most of the photos the EXIF data was
provided, however in our experiments we did not use this information.</p>
      <p>In Fig. 1 an image from the database is shown, together with the Flickr-tags and the
annotation concepts. We see that the tags and annotation concepts are quite
complementary. While the Flickr-tags of an image corresponds to concepts which are not necessary
visually perceptible (e.g. Australia), the image annotation system is interested in the
visual concepts (e.g. sky and clouds).</p>
      <p>Although the objective of a user tagging his images is different from a (visual)
keyword based retrieval system, the Flickr-tags might offer useful information in the
annotation task. To analyse our first aspect we have used the Flickr-tags as textual
representation of an image, and conducted experiments with systems using either both
modalities, or using only the visual modality. The results (see Section 4) show that
indeed the Flickr-tags are complementary to the visual information. All our systems using
both modalities outperform any of the visual only systems.</p>
      <p>Concerning the second aspect, in spite of the fact that the task was relatively small
especially in the number of images, we tested methods that potentially scale to large
annotated data sets, e.g. up to hundreds of thousands of labelled images, and/or many
concepts. Hence, we used image representations and classifiers which are efficient both
in learning and in classifying. Efficiency includes (1) the cost of computing the
representations, (2) the cost of learning classifiers on these representations, and (3) the cost
of classifying a new image.</p>
      <p>
        As our image representation we use the Improved Fisher vectors [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ], which are
based on the Fisher Kernel (FK) framework [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The Fisher vector extends the popular
bag-of-visual-words (BOV) histograms [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], by not only including word counts, but also
      </p>
    </sec>
    <sec id="sec-2">
      <title>Flickr-tags</title>
      <p>Boardwalk, Sunset, Wilsonsprom, Wilsonspromontory,
Victoria, Australia, 3kmseoftidalriver, Explore</p>
    </sec>
    <sec id="sec-3">
      <title>Annotation-concepts</title>
      <p>Landscape Nature, No Visual Season, Outdoor, Plants, Trees,
Sky, Clouds, Day, Neutral Illumination, No Blur, No Persons,</p>
      <p>Overall Quality, Park Garden, Visual Arts, Natural
Fig. 1. Example image with Flickr user tags, and with the ground truth annotation concepts.
additional information about the distribution of the descriptors. Due to the use of this
additional information the visual code book in a FK approach could be much smaller
than in the BOV approach. We use a code book of only 256 words, while a size of
several thousands is common in BOV approaches. Since the size of the visual code
book determines largely the computational cost for the descriptor, this makes the FK a
very fast descriptor.</p>
      <p>
        On the classifier part, we compare a per-keyword-trained linear
Support-VectorMachine (SVM) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] to TagProp, a k-NN classifier with learned neighbourhood weights
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The training cost of a linear SVM is linear in the number of images [
        <xref ref-type="bibr" rid="ref15 ref7">7, 15</xref>
        ], therefore
they can be efficiently learned with large quantities of images [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The advantage of
the k-NN classifier is that it requires only 2 parameters per keyword, additional training
for a new keyword is therefore very fast. For both classifiers we have used the same
image and text representations, therefore we can fairly compare the results of the two
methods.
      </p>
      <p>
        Note that these representations and methods have shown state-of-the-art
performances [
        <xref ref-type="bibr" rid="ref13 ref14 ref4">4, 13, 14</xref>
        ] on different tasks on several publicly available databases. However
they were not necessarily compared or combined. The ImageCLEF VCDT challenge
gave us a good opportunity to do this.
      </p>
      <p>The rest of the paper is organized as follows. In Section 2 we describe the FK
framework and the recent improvements on Fisher vectors. In Section 3 we give an
overview of our TagProp method. Then in Section 4 we present in more detail the
experiments we did, the submitted runs and the obtained results. Finally, we conclude
the paper in Section 5.
2</p>
      <p>
        Visual Features - the Improved Fisher vector
As image representation, we use the Improved Fisher vector [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. The Fisher vector
is an extension of the bag-of-visual-words (BOV) representation, instead of
characterizing an image with the number of occurrences of each visual word, it characterizes the
image with a gradient vector derived from a generative probabilistic model. The
gradient of the log-likelihood describes the contribution of the parameters to the generation
process.
      </p>
      <p>
        We assume that the local descriptors X = fxt; t = 1 : : : T g of an image are
generated by a Gaussian mixture model (GMM) u with parameters . X can be described
by the gradient vector [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]:
      </p>
      <p>
        GX = T1 r log u (X):
A natural kernel on these gradients is using the Fisher information matrix [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]:
K(X; Y ) = GX 0F 1GY ;
      </p>
      <p>F</p>
      <p>= Ex u [r log u (x)r log u (x)0] :
As F is symmetric and positive definite, F 1 has a Cholesky decomposition F 1 =
L0 L . Therefore K(X; Y ) can be rewritten as a dot-product between normalized
vectors G with: GX = L GX : We will refer to GX as the Fisher vector of X.
(1)
(2)</p>
      <p>As generative model we use a GMM: u (x) = PM
i=1 wiui(x), with parameters
= fwi; i; i; i = 1 : : : M g. Gaussian ui has mixture weight wi, mean vector i,
and covariance matrix i. We assume diagonal covariance matrix i and denote the
variance vector by i2. Let GX;i (resp. G ;i) be the normalized gradient vectors with
respect to the i (resp. i) of Gaussian i. The final gradient vector GX is the concatenation
of the GX;i and GX;i vectors for i = 1 : : : M , and is therefore 2M D-dimensional.</p>
      <p>
        The Improved Fisher vector [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], takes the Fisher vector as described above and
adds L2 normalization and Power normalization, both described in details below.
2.1
      </p>
      <p>
        L2 normalization
It has been shown that the Fisher vector approximately discards image-independent (i.e.
background) information [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. However the vector depends on the proportion of
imagespecific information w.r.t. to the proportion of background information. We use the L2
norm to cancel this effect.
      </p>
      <p>According to the law of large numbers Eq. 1 can be approximated as: GX
r Rx p(x) log u (x)dx. Assume that p is a mixture containing a background
component (u ) and an image-specific component (with image-specific distribution q), and
let ! denote the mixing weight:</p>
      <p>GX
!r</p>
      <p>Z
x
!)r</p>
      <p>Z</p>
      <p>x
q(x) log u (x)dx + (1
u (x) log u (x)dx:
(3)
Since the parameters are estimated with a Maximum Likelihood approach (i.e. to
maximize Ex u log u (x)), the derivative of the background component approximates
zero. Consequently, the FV equals GX !r Rx q(x) log u (x)dx, it focuses on the
image-specific content, but depends on the proportion of image specific component !.</p>
      <p>Therefore, two images containing the same object but at different scales will have
different signatures. To remove the dependence on !, we L2-normalize the vector GX
or equivalently GX .
2.2</p>
      <p>Power normalization
The Power normalization is motivated by an empirical observation: Fisher vectors
become sparser as the number of Gaussians increases. Because fewer descriptors xt are
assigned (with a significant probability) to each Gaussian, and the derivative of a
Gaussian without assigned descriptors is zero. Hence, the distribution of features in a given
dimension becomes more peaky around zero, as shown in Fig 2.</p>
      <p>Linear classification requires a dot-product kernel, however the L2 distance is a
poor measure of similarity on sparse vectors. Therefore we “unsparsify” the vector z by
using:
f (z) = sign(z)jzj ;
(4)
where 0 1 is a parameter of the normalization. The optimal value of may vary
with the number M of Gaussians in the GMM. Earlier experiments have shown that
= 0:5 is a reasonable value for 16 M 512, so this value is used throughout the
experiments. In Fig 2 the effect of this normalization is shown.
400
350
300
250
200
150
100
50
−00.015 −0.01 −0.005 0 0.005 0.01 0.015
(d)</p>
      <p>
        When combining the power normalization and the L2 normalization, we apply the
power normalization first and then the L2 normalization. We note that this does not
affect the analysis of the previous section: the L2 normalization on the power-normalized
vectors sill removes the influence of the mixing coefficient !.
Spatial pyramid matching was introduced by Lazebnik et al . to take into account the
rough geometry of a scene [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It consists in repeatedly subdividing an image and
computing histograms of local features at increasingly fine resolutions by pooling
descriptorlevel statistics. We follow the splitting strategy adopted by the winning systems of
PASCAL VOC 2008 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and extract 8 Fisher vectors per image: one for the whole image,
three for the top, middle and bottom regions and four for each of the four quadrants.
      </p>
      <p>In the case where Fisher vectors are extracted from sub-regions, the “peakiness”
effect will be even more exaggerated as fewer descriptors are pooled at a region-level
compared to the image-level. Hence, the power normalization is likely to be even more
beneficial in this case. When combining power normalization and L2 normalization
with spatial pyramids, we normalize each of the 8 Fisher vectors independently.</p>
      <sec id="sec-3-1">
        <title>3 Image Annotation with TagProp</title>
        <p>
          In this section we present TagProp [
          <xref ref-type="bibr" rid="ref17 ref4">4, 17</xref>
          ], our weighted nearest neighbour annotation
model. We assume that some visual similarity or distance measures between images are
given, abstracting away from their precise definition. We proceed by discussing how to
use rank based weights with multiple distances in Section 3.2 and we extend the model
by adding a per-word sigmoid function that can compensate for the different frequencies
of annotation terms in the database, in Section 3.3.
3.1
        </p>
        <p>A Weighted Nearest Neighbour Model
In the following we use yiw 2 f 1; +1g to denote whether concept w is relevant for
image i or not. The probability that concept w is relevant for image i, i.e. p(yiw = +1),
is obtained by taking a weighted sum of the relevance values for w of neighbouring
training images j. Formally, we define:</p>
        <p>p(yiw = +1) = X ij p(yiw = +1jj);
p(yiw = +1jj) =</p>
        <p>j
(1
for yjw = +1;
otherwise:
(5)
(6)
(7)
The ij denote the weight of training image j when predicting the annotation for image
i. To ensure proper distributions, we require that ij 0, and Pj ij = 1. The
introduction of is a technicality to avoid zero prediction probabilities when none of the
neighbours j have the correct relevance value. In practice we fix = 10 5, although
the exact value has little impact on performance.</p>
        <p>The parameters of the model control the weights ij . To estimate these parameters
we maximize the log-likelihood of predicting the correct annotations for training images
in a leave-one-out manner. Taking care to exclude each training image as a neighbour
of itself, i.e. by setting ii = 0, our objective is to maximize the log-likelihood:
L =</p>
        <p>X ln p(yiw):
i;w
3.2</p>
        <p>
          Rank-based weighting
In our experiments we use rank-based TagProp, which has shown good performance on
the MIR Flickr database [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. When using rank-based weights we set ij = k if j is
the k-th nearest neighbour of i. This directly generalizes a simple K nearest neighbour
approach, where the K nearest neighbours receive an equal weight of 1=K. The data
log-likelihood (7) is concave in the parameters k, and can be maximised using an
EMalgorithm or a projected-gradient algorithm. In our implementation we use the latter
because of its speed. To limit the computational cost of the learning algorithm we only
allow non-zero weights for the first K neighbours, typically K is in the order of 100
to 1000. The number of parameters of the model then equals K. By pre-computing the
K nearest neighbours of each training image the run-time of the learning algorithm is
O(N K) with N the number of training images.
        </p>
        <p>In order to make use of several different distance measures between images we can
extend the model by introducing a weight for each combination of rank and distance
measure. For each distance measure d we define a weight idj that is equal to dk if
j is the k-th neighbour of i according to the d-th distance measure. The total weight
for an image j is then given by the sum of weights ij = Pd idj obtained using
different distance measures. Again we require all weights to be non-negative and to
sum to unity: Pj;d idj = 1. In this manner we effectively learn rank-based weights
per distance measure, and at the same time learn how much to rely on the rank-based
weights provided by each distance measure.</p>
        <p>In the experiments we use a fixed K = 1000 independently from the number of
distance measures used. So the effective number of k-NN per distance measures varies.
E.g. when two distance measures are used, we take the 500 NN per distance measure.
An image might occur twice, as neighbour according to both distance measures.</p>
        <p>Word-specific Logistic Discriminants
The weighted nearest neighbour model introduced above tends to have relatively low
recall scores for rare annotation terms. This effect is easy to understand as in order to
receive a high probability for the presence of a term, it needs to be present among most
neighbours with a significant weight. This, however, is unlikely to be the case for rare
annotation terms.</p>
        <p>To overcome this, we introduce word-specific logistic discriminant model that can
boost the probability for rare terms and possibly decrease it for frequent ones. The
logistic model uses weighted neighbour predictions by defining:
p(yiw = +1) =
where (z) = (1+exp( z)) 1 is the sigmoid function, and xiw is the weighted nearest
neighbour prediction for term w and image i c.f. Eq. 5. The word-specific models adds
two parameters per annotation term.</p>
        <p>In practice we estimate the parameters f w; wg and ij in an alternating
fashion. For fixed ij the model is a logistic discriminant model, and the log-likelihood
is concave in f w; wg, and can be trained per term. In the other step we optimize
the parameters that control the weights ij using gradient descent. We observe rapid
convergence, typically after alternating the optimization three times.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4 ImageCLEF Experiments</title>
        <p>In this section we describe the experiments for the VCDT. We evaluate the performance
of systems using the textual and visual modality and compare them to visual-only
systems. Also, we investigate the performance of per-keyword-trained SVMs compared to
TagProp. See Table 1 for an overview of our submitted runs.
4.1</p>
        <p>
          Dataset and Features
The dataset of this year’s ImageCLEF VCDT was the MIRFlickr dataset [
          <xref ref-type="bibr" rid="ref12 ref5">5, 12</xref>
          ]. In
contrast to last year, there were more concept classes (93) and the training set was
extended to 8; 000 images. Also, in the ‘multi-modal’ approach it was allowed to use
the provided textual ‘Flickr-tag’ information during both the training phase and test
phase.
        </p>
        <p>
          Features We extract our low level visual features from 32 32 pixel patches on
regular grids (every 16 pixels) at five scales. Besides using 128-D SIFT-like Orientation
Histograms (ORH) descriptors [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], we also use simple 96-D colour features (COL) in
the experiments. To obtain the latter, a patch is subdivided into 4 4 sub-regions (as
for the SIFT descriptor) and in each sub-region the mean and standard deviation for the
three R, G and B channels are computed. Both SIFT and colour features are reduced to
64 dimensions using Principal Component Analysis (PCA).
        </p>
        <p>Mixed
Mixed
Visual
Visual</p>
        <p>Visual
Name
SVM
SVM
SVM
SVM
SLR
TagProp Mixed
TagProp Mixed
TagProp Visual
TagProp Visual</p>
        <p>Modality Nr Features Remark</p>
        <p>In all our experiments, we use GMMs with M = 256 Gaussians to compute the
Fisher vectors (referred also to as FV in which follows). The GMMs are trained
using the Maximum Likelihood (ML) criterion and a standard Expectation-Maximization
(EM) algorithm.</p>
        <p>We extracted visual features using three spatial layouts (1 1, 2 2, and 1 3) as
described in Section 2.3. The dimensionality of each FV is M (2 64), since we take
the derivative w.r.t. to mean and (diagonal) covariance. For each layout the component
Fisher vectors were simply concatenated (e.g. 3 FVs in the 1 3 layout).</p>
        <p>In some of the experiments we also use textual information (here the Flickr-tags).
As textual representation for an image we use the binary absence/presence vector of
the 698 most common tags among the over 53:000 provided Flickr-tags. We required
each tag to be present in both the train-set and test-set, and for each tag to occur at least
25 times. This binary feature vector for each image i, is L2 normalized (denoted by
ti). The tag-similarity siTj between the tags of image i and image j is the dot-product:
siTj = ti tj .
4.2</p>
        <p>
          SVM Experiments
In these experiments we wanted to investigate on one hand the effect of using both
visual and textual modalities, and on the other hand the different fusion techniques
(early and late) in this context. Since we use the FV representation, with the
corresponding dot-product similarities, we use linear SVM’s for all experiments. In all our
experiments, we used the LIBSVM package [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] with C = 1 (some preliminary
crossvalidation results have shown this is a reasonable choice for this task).
Late Fusion For the late fusion experiments we have learned for each concept a
classifier per low level feature (FV-ORH, FV-COL) and per spatial-layout (1x1, 2x2, 1x3)
leading to 6 visual classifiers per concept. In additional, we trained a classifier per
concept on the textual features (ti). The scores of the Late Fusion SVM are obtained by
averaging the scores of the individual classifiers with equal weights. For the mixed
modality we average over 7 scores, and for the visual-only over 6 scores.
        </p>
        <p>
          We have also included a visual-only late fusion experiment using Linear Sparse
Logistic Regression (SLR) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], instead of SVM. SLR is a logistic regression classifier
with a Laplacian prior. It uses the log-loss (instead of the hinge loss), and the
probabilistic output might be more interpretable. Nevertheless, on all the measurements the
corresponding SVM outperformed the SLR run (see Table 2).
        </p>
        <p>Early Fusion For the early fusion experiments we have to concatenate the feature
vectors. Since we use the dot-product kernel Kd(i; j), concatenation of feature vectors is
equivalent to the Early Fusion kernel: KEF(i; j) = Pd Kd(i; j). We learn one SVM
per concept using this kernel. We have experimented with visual-only (d = f1; : : : ; 6g)
and mixed modality (d = f1; : : : ; 7g) classifiers.</p>
        <p>Scoring Note that only the final scores (after either late or early fusion) were normalized
to be between 0 and 1, as required. We defined our confidence score as: x = (x
min(X))=(max(X) min(X)). This normalization does not affect the ordering, and
therefore does not influence the per concept evaluation. The threshold for the binary
decision (for per image evaluation) was set to 0 on the original scoring function x.
4.3</p>
        <p>
          TagProp Experiments
Concerning TagProp we wanted to investigate on one hand the performance
improvement by using the textual modality, and on the other hand the performance difference
between SVM and TagProp using the FV representation. Therefore we have used
exactly the same features, and distance measures, in these experiments as in the previous
section. We have followed the word-specific rank-based TagProp, as described in
Section 3. For all experiments we have used K = 1000, which is a good choice on this
dataset as shown in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>We have ran two different sets of experiments using TagProp, one with and one
without combining the spatial-layouts. When combining the different spatial-layouts
we sum over the kernels Kd(i; j) of the three spatial layouts to compute a single
FVORH and a single FV-COL kernel. This is equivalent to early fusion of the spatial layout
vectors. Using these combined visual kernels reduces the number of similarities used
in TagProp, therefore effectively more neighbours per similarity are used, which might
result in a better set of nearest neighbours. The 3rd (resp 7th) feature (see Table 1) is
the textual kernel based on ti. To obtain K = 1000 nearest neighbours from D
different similarity measures, we select from each similarity measure the Kd = ceil(K=D)
neighbours, and concatenate those into K = fK1; : : : ; KDg.</p>
        <p>The output of TagProp is a probability value, therefore we use it directly as the
confidence score. For the binary decision scores we use a threshold of :5.
4.4</p>
        <p>
          Analysis of the Results
Performance evaluation To determine the quality of the annotations five measures were
used, three for the evaluation per concept and two for the evaluation per photo. For the
evaluation per concept the mean Average Precision (mAP), the equal-error-rate (EER),
and the area-under the curve (AUC) are used, using the confidence scores. For the
evaluation per photo the example-based F-Measure (F-ex) and the Ontology Score with
Flickr Context Similarity cost map (OS) are used, which uses the binary annotation
scores. More details on these measures can be found in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          Overview of Results In Table 2 we list the performance of our submitted runs and
the highest scoring competitors, sorted on the mAP value. In Fig. 3 we show individual
concept-based comparison of different algorithms, see also the caption for more details.
From these results we can deduce that:
– All our approaches using visual and tag features outperform any of the visual-only
approach. The performance is increased in the order of 5 8% in mAP.
– While early fusion outperforms late fusion when we use the textual feature, there is
no clear winner for the visual-only classifiers. The reason might be that the textual
information is more complementary, while there is more redundancy between the
different visual features.
– Combining the spatial-layout features into a single similarity (used in TagProp)
gives slightly better results. This might be due to the fact that effectively more
neighbours per similarity measure are used.
– While linear-SVM classifiers outperform TagProp, the performance is quite similar,
especially for the mixed modality approach. The latter might be due to the weights
TagProp learns for the two modalities, while the SVM uses an equal weighting.
This conclusion confirms the observations made in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] using a different set of
features.
1
0.9
0.8
ino0.7
s
tLauF00..56
e
V
T&amp;0.4
M
SV0.3
0.2
0.1
00
1
0.9
0.8
0.7
LK0.6
M
ISS0.5
IvA0.4
U0.3
0.2
0.1
00
1
0.2 0T.a4gProp V0D.62 0.8 1 0.2 T0a.g4Prop T&amp;V0.D63 0.8 1 0.2 SV0M.4V Early F0u.6sion 0.8 1
(d) (e) (f)
Fig. 3. Comparisons of different submissions, in each figure the AP of each concept is plotted.
Plot (a) shows the performance of the best scoring SVM classifier (V&amp;T) versus the visual only
SVM. Plot (b) and (c) compares the early version late fusion SVM’s. Plot (d) and (e) compares
TagProp versus the early fusion SVM’s. Plot (f) shows the performance of the best visual
submission (UvA-MKL) versus our best visual only SVM.
        </p>
        <p>– Finally, the performance of our best visual-only classifier is close to the best scoring
visual-only UvA-MKL classifier, and we are using a fast image representation with
linear-SVMs.
5</p>
      </sec>
      <sec id="sec-3-3">
        <title>Conclusions</title>
        <p>Our goal for the ImageCLEF VCDT 2010 challenge was to take advantage of the
available textual information. The experiments have shown that all our methods combining
visual and textual modalities outperform the best visual only classifiers. Our best
scoring classifier obtains 45.5 % in mAP, about 5% higher than the best visual-only system.</p>
        <p>Besides we have compared two different approaches, linear SVM classifiers versus
TagProp (a k-NN classifier). The results show that the SVM approach is superior to
TagProp, but TagProp is able to compete. We believe that both these methods allow
for learning from datasets with large number of images and/or concepts. Linear-SVMs
have proven to scale to very large quantities images. TagProp is especially interesting
for cases with many concepts and partially labelled datasets.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <issue>1</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>LIBSVM: a library for support vector machines (</article-title>
          <year>2001</year>
          ), software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Csurka</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dance</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Willamowski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bray</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Visual categorization with bags of keypoints</article-title>
          .
          <source>In: ECCV SLCV Workshop</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Everingham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gool</surname>
            ,
            <given-names>L.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winn</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <string-name>
            <surname>The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Guillaumin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mensink</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Tagprop:
          <article-title>Discriminative metric learning in nearest neighbor models for image auto-annotation</article-title>
          . In: ICCV (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Huiskes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lew</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The MIR Flickr retrieval evaluation</article-title>
          .
          <source>In: ACM MIR</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jaakkola</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haussler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Exploiting generative models in discriminative classifiers</article-title>
          .
          <source>In: NIPS</source>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Training linear svms in linear time</article-title>
          .
          <source>In: KDD</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Krishnapuram</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Figueiredo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartemink</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Sparse multinomial logistic regression: Fast algorithms and generalization bounds</article-title>
          .
          <source>PAMI</source>
          <volume>27</volume>
          (
          <issue>6</issue>
          ),
          <fpage>957</fpage>
          -
          <lpage>968</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponce</surname>
          </string-name>
          , J.:
          <article-title>Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories</article-title>
          .
          <source>In: CVPR</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crandall</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huttenlocher</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Landmark classification in large-scale image collections</article-title>
          . In: ICCV (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>IJCV</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ) (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huiskes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>New strategies for image annotation: Overview of the photo annotation task at ImageCLEF 2010</article-title>
          . In: Working Notes of CLEF (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Perronnin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dance</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Fisher kernels on visual vocabularies for image categorization</article-title>
          .
          <source>In: CVPR</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Perronnin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Sa</given-names>
            ´nchez, J.,
            <surname>Mensink</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Improving the fisher kernel for large-scale image classification</article-title>
          .
          <source>In: ECCV</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Shalev-Shwartz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srebro</surname>
          </string-name>
          , N.:
          <article-title>Pegasos: Primal estimate sub-gradient solver for SVM</article-title>
          . In: ICML (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>The Nature of Statistical Learning Theory</article-title>
          . Springer Verlag (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guillaumin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mensink</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Image annotation with tagprop on the mirflickr set</article-title>
          .
          <source>In: ACM Multimedia Information Retrieval</source>
          (mar
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>