=Paper=
{{Paper
|id=Vol-1178/CLEF2012wn-ImageCLEF-UshikuEt2012
|storemode=property
|title=ISI at ImageCLEF 2012: Scalable System for Image Annotation 
|pdfUrl=https://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-UshikuEt2012.pdf
|volume=Vol-1178
|dblpUrl=https://dblp.org/rec/conf/clef/UshikuMIFYGHHHK12
}}
==ISI at ImageCLEF 2012: Scalable System for Image Annotation ==
<pdf width="1500px">https://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-UshikuEt2012.pdf</pdf>
<pre>
               ISI at ImageCLEF 2012:
        Scalable System for Image Annotation

       Yoshitaka Ushiku, Hiroshi Muraoka, Sho Inaba, Teppei Fujisawa,
        Koki Yasumoto, Naoyuki Gunji, Takayuki Higuchi, Yuko Hara,
                   Tatsuya Harada, and Yasuo Kuniyoshi

         Intelligent Systems and Informatics Lab., the University of Tokyo
                         {ushiku,muraoka,inaba,fujisawa,
                        k-yasumoto,gunji,higuchi,y-hara,
                   harada,kuniyosh}@isi.imi.i.u-tokyo.ac.jp
                       http://www.isi.imi.i.u-tokyo.ac.jp


      Abstract. We participate in the ImageCLEF 2012 Photo Annotation
      Tasks. We devote our attention to make our system scalable for the data
      amount. Therefore we train linear classifiers with our online multilabel
      learning. For Flickr Photo task, we extract visual Fisher Vectors (FVs)
      from some kinds of local descriptors and used the provided Flickr-tags
      for textual features. For Web Photo tasks, we just use the provided Bag-
      of-Visual-Words (BoVW) of some kinds of SIFT descriptors. A linear
      classifier for each label is obtained with an online multilabel learning,
      Passive-Aggressive with Averaged Pairwise Loss (PAAPL). The results
      have shown that our scalable system achieves pretty good performances
      in all tasks we take part in.


1   Introduction
In this paper, we describe our method for the ImageCLEF 2012 Photo Annota-
tion tasks. In particular, we attack three tasks: concept annotation using Flickr
photos [8], improving performance in Flickr concept annotation task using Web
photos [10], and scalable concept image annotation using Web photos [10].
    Especially, we pay our attention to the scalability for the data amount. In this
literature, many techniques are developed to improve the performance of object
recognition. Though some of them have succeeded by introducing a complicated
classifier such as the multiple kernel SVM, the complexity for leaning and an-
notating is a problem. Because many kinds of labels require a large amount of
training data, the scalability for the data amount is important for generic object
recognition.
    Consequently, our objective is to investigate scalable methods for feature
extraction, for learning, and for annotation. Recent studies for large scale im-
age classification adopt online learning for linear classification. In [7, 4], high-
dimensional features in [6, 11] are used for learning 1000 classes from over a
million images. In fact, Fisher Vectors (FVs) and linear SVM won the Image-
CLEF 2010 as described in [5].
2       Authors Suppressed Due to Excessive Length

    Our main contribution is the investigation of our novel online learning for
multilabel problem. Because batch learning by loading all training samples is
impossible, usage of an online learning is a promising method in order to realize
scalability. In [7, 4], online SVM learning with Stochastic Gradient Descent [1]
(SGD-SVM) is applied with a one-vs.-the-rest manner. The classifier for a label
is obtained by regarding images associated with the label as positive samples and
the rest images as negative samples. Furthermore, labels are output according
to the scores from the binary classifiers. Nevertheless, no guarantee exists that
the output of SVMs for diﬀerent classifiers will have appropriate scales. Thus we
investigate a multiclass learning Passive-Aggressive algorithm [2] to solve this
problem. In [9], we have proposed Passive-Aggressive with Averaged Pairwise
Loss (PAAPL) for which multiple labels are attached to one sample. At first,
we use an averaged pairwise loss instead of the hinge-loss of PA. Secondly, we
randomly select these pairs at every learning. These two improvements make
PAAPL can converge faster than PA.


2     Feature Extraction
In this section, we describe the features we use in three tasks. For the Flickr
Photo task, we extract FVs as visual features and some kinds of BoW of Flickr-
tags as textual features. For the Web Photo tasks, we use the provided Bag-of-
Visual-Words only.

2.1   Visual Features
Bag-of-Visual-Words. Bag-of-Visual-Words (BoVW) is quite a popular ap-
proach for image classification, because it achieves good performance in spite of
its simplicity. The main idea is that images are treated as loose collections of
K codewords, representative local descriptors, and that each key-point patch,
in which a local descriptor is extracted, is sampled independently. The BoVW
feature is obtained by making a histogram of the number of local descriptors
assigned to each codeword. The dictionary, which consists of K codewords, has
to be generated by unsupervised clustering of training samples in advance and
each local descriptor is assigned to the nearest codeword in the dictionary. BoVW
vector is therefore K-dimensional.

Fisher Vectors. Fisher Vectors (FV), which is regarded as an extension of the
BoVW representation, is a standard approach to the large-scale image recog-
nition. BoVW utilizes 0-order statistics of the distribution of local descriptors,
whereas FV utilizes 1- and 2-order statistics. The distribution of local descrip-
tors is fitted to the mixture model of K Gaussians, and the gradients of the
likelihood in the parameter space are computed. The gradients describe which
direction the model parameters are to be modified to get a better description of
the image. The dimensions are then whitened by multiplying the square root of
the Fisher information matrix.
                                                           ISI at ImageCLEF 2012       3

   We denote T local descriptors by X = {x1 , x2 , · · · , xT }, and the mixture
weight, mean, covariance matrix of i-th Gaussian by wi , µi , and σi , respectively.
Since the covariance matrices are assumed to be diagonal, we denote the variance
vector by σ 2 . The FV representation is thus given as,

                                                 (     )
                                1 ∑
                                       T
                                              xt − µi
                      Gµ,i
                       X
                           =    √       γt (i)           ,                            (1)
                               T wi t=1          σi
                                  ∑T         [                ]
                              1                (xt − µi )2
                      Gσ,i
                       X
                           = √        γt (i)               − 1  ,                     (2)
                            T 2wi t=1              σi2

where γt (i) is the soft assignment of xt to i-th Gaussian as follows,
                                             wi ui (xt )
                                γt (i) =    K w u (x )
                                                           .                          (3)
                                           Σj=1  j j     t

     Then, we obtain the 2KD-Dimensional vector by concatenating Gµ,i
                                                                  X
                                                                      and Gσ,i
                                                                           X
                                                                               .
To enhance the performance, a power normalization is proven to be eﬀective in
[6]. The vectors are normalized with L2 norms after the power normalization.

2.2     Text Features
We use Bag-of-Words (BoW) representation, which is based on the idea that each
word in a text appears independently. BoW is obtained by counting appearance
of words in a text. In our method, the feature is converted in following two ways,
TF-IDF and L2 -normalization.

TF-IDF weight We regard the typicality of each Flickr-tag as a clue to how
the tag relates to the image’s contents. Therefore, we use TF-IDF value for each
element of a BoW vector.

L2 -normalization To reduce the eﬀect of diﬀerent numbers of tags among
images, we simply L2 -normalize the BoW vectors.

3     Online Multilabel Learning
To learn the models for each label from various images, requirements are not
only compatibility of scalability for the data amount and accuracy for label
estimation, but also tolerability of noise.
    Given the t-th training sample xt ∈ Rd associated with a label set Yt , a
subset of Y = {1, . . . , ny }, it is classified with the present weight vector µyt i
(i = 1, . . . , ny )1 as,
                                 yˆt = arg max µyt i · xt .                           (4)
                                            yi
1
    Here, the bias b is included in µt as µ⊤     ⊤                     ⊤     ⊤
                                           t ← [µt , b] by redefining xt ← [xt , 1]
4        Authors Suppressed Due to Excessive Length

If necessary, multiple labels are estimated in score order.
    Multilabeling for one sample is applicable by defining ny > 1. Here, hinge-loss
ℓ is given as,
                                    {
                                     0                        µrt t · xt − µst t · xt ≥ 1
    ℓ(µrt t , µst t ; (xt , Yt )) =                                                       . (5)
                                     1 − (µt · xt − µt · xt ) otherwise
                                           rt        st


where rt = arg min µrt · xt and st = arg max µst · xt .
                r∈Yt                           s∈Y
                                                / t
   PA is an online learning method for binary and multiclass classification,
regression, uniclass estimation and structure estimation. The biggest benefit of
PA is that the update coeﬃcient is analytically calculated according to the loss.
In contrast, SGD based methods and traditional perceptron require designing
the coeﬃcient.
   Here we seek to decrease the hinge-loss of multi-classification and not to
change the weight radically. Consequently, we obtain the following formulation.

             µrt+1
                t
                   , µst+1
                        t
                           = arg min ||µrt − µrt t ||2 + ||µst − µst t ||2 + Cξ 2 ,        (6)
                              µrt ,µrt
             s.t. ℓ(µrt , µst ; (xt , Yt )) ≤ ξ and ξ ≥ 0.                                 (7)

Therein, ξ denotes a slack variable representing the bound of the loss. C signifies
a parameter to reduce the negative influence of noisy labels. It can be derived
using Lagrange’s method of undetermined multipliers. Therefore we obtain,

                       µrt+1
                          t
                             = µrt t + τt · xt , µst+1
                                                    t
                                                       = µst t − τt · xt ,                 (8)
                       τt = min{C, ℓ(µrt t , µst t ; (xt , Yt ))/(2x2t )}.                 (9)

   This PA is called PA-II in [2]. PA and SGD-SVM have a closed form. Indeed,
PA for binary classification and SGD-SVM without L2 regularization have the
same update rule. Diﬀerences between SGD-SVM and PA here are (1) binary
or multi-class, (2) regularization form, and (3) the number of parameter to be
tuned.

3.1    Passive-Aggressive with Averaged Pairwise Loss
PA is online learning methods for classification, but it presents no problem if a
sample is associated with multiple labels. Indeed, the Passive–Aggressive Model
for Image Retrieval (PAMIR) [3] is proposed by application of PA to image
retrieval.
    However, they treat only one relevant label and one irrelevant label. Appar-
ently, models of some labels are not well updated and that convergence becomes
delayed.
    Therefore, we have proposed a novel online learning algorithm for which mul-
tiple labels are attached to one sample in [9]. General online learning methods
consist of two steps: classification of the t-th sample, and update of the t-th mod-
els. Given the d-dimensional weight vectors µ for all ny labels, the complexity for
                                                         ISI at ImageCLEF 2012           5

             Hinge-loss                      Averaged Pairwise Loss


                 Score


                                                Score
                  correct labels                 correct labels


                             update


                                                            update
                 Score


                                                Score
                 correct labels                 correct labels


          Fig. 1. Comparison between hinge-loss and averaged pairwise loss.


classification of a sample is O(dny ), while update of a model is O(d). If we update
all models with given labels Yt , its complexity becomes O(d|Yt |). In image anno-
tation and especially sentence generation, we can assume ny ≫ |Yt |. Therefore,
since classification is the rate-controlling step, total computation time remains
much the same whether we update one model or |Yt | models. Fig.1 shows the
conceptual diﬀerence between hinge-loss and the loss used in proposed method.
Thus the proposed PAAPL achieves eﬃciency by averaging all pairwise loss be-
tween relevant and irrelevant labels.

 1. Given a t-th image, define label set Ȳt of ny labels by selecting highly scored
    and irrelevant labels.
 2. Randomly select one relevant label rt from Yt and one irrelevant label st
    from Ȳt .
 3. Based on a hinge-loss between rt and st , 1 − (µrt t · xt − µst t · xt ), update
    models according to PA.

    Additionally, we investigate a way to reduce the complexity O(dny ) for the
classification step. In [12], the approximation of a loss function by the random se-
lection of labels is an important step for online learning when using less powerful
computers. Although random selection may miss incorrectly-classified labels at
a higher rate, it was experimentally verified that correct models can be obtained
eventually. Therefore, we also adopted random selection.

 1. Randomly select one relevant label rt from Yt .
 2. Define irrelevant label st with random selection from Yt and compute the
    hinge-loss 1 − (µrt t · xt − µst t · xt ). Continue selecting st until the loss becomes
    positive.
 3. If the hinge-loss becomes positive, update models for rt and st according to
    PA; otherwise move on to next training sample.
6       Authors Suppressed Due to Excessive Length

4     Results

In this section, we describe the details of methods we use for Flickr Photo an-
notion task, Web Photo subtask1, and Web Photo subtask2, respectively.


4.1   Photo Flickr

In our experiment, we extracted 5 kinds of visual descriptors from each image,
SIFT and LBP in five patch sizes, and color-SIFTs (C-SIFT, RGB-SIFT, Op-
ponentSIFT) in three patch sizes. As pre-processing, the images were resized
into at most 300 × 300 pixels, of which aspect ratio were maintained. To cal-
culate SIFT and LBP, the images were rendered in gray scale, in contrast to
color-SIFTs which utilized color information. Then, each descriptor was dense-
sampled on regular grids (every six pixels). The dimensionalities of SIFT, LBP,
and color-SIFTs were 128, 1024, and 384 respectively. All of these descriptors
were reduced to 64 dimensions with PCA, and then coded into two state-of-the-
art global feature representations. (5 × 2 = 10 visual features in total) One is
FV, explained in the previous section. At first, we trained the mixture model
of 256 Gaussians using standard EM-algorithm. To embed spatial information,
FVs were calculated respectively over 1 × 1, 2 × 2, and 3 × 1 cells. In this way,
we obtained FVs whose dimensionality was 64 × 256 × 8 × 2 = 262, 144. The
other is Locality-constrained Linear Coding (LLC) [11], which describes each
local descriptor by a linear weighted sum of a few nearest codewords. In our
experiment, 4,096 codewords were generated with k-means algorithm, and then
each local descriptor was approximated using 3-NN of the descriptor. The im-
ages were divided into 1 × 1, 2 × 2 and 3 × 3 spatial grids diﬀerently from FV,
so the dimensionality was 4096 × 14 = 57, 344.
    As text features, BoW vectors were extracted from Flickr-tags, and then
we also prepared the one whose dimensions were removed if the corresponding
words appeared 24 times or less. Furthermore, all combinations of 2 kinds of
processing (TF-IDF, L2 -normalization) were done, so 2 × 22 = 8 text features
were generated in total. Finally, the classifiers of 10 visual features and 8 text
features were trained separately using PAAPL. The trade-oﬀ parameter was set
to C = 105 . All the experiments in the following sub-section were implemented
using a workstation with CPUs of 12-core and 96GB RAM.
    The validation set consisted of one-third of the training images, and valida-
tion was done only two times for the lack of time.
    The size of the visual feature such as FV or LLC tends to be large. It is
known to be eﬀective to quantize the vector with Product Quantization (PQ) as
described in [7, 4]. At first, we investigated the performance eﬀect of PQ using
FV-SIFT. The parameter of PQ was decided empirically, b = 1, G = 8. We
iterated PAAPL learning 15 times. As a result, FV-SIFT achieved F1-measure
(F1) 0.5604 with PQ while it achieved 0.5632 without PQ. Because the drop of
performance was actually not significant, we quantized visual features of training
samples for saving the RAM.
                                                       ISI at ImageCLEF 2012     7

    Since the number of runs which could be submitted was limited, we examined
which combinations of visual features were eﬀective. Then we investigated which
text feature should be added to achieve the best performance through the next
experiment.
    The Table 1 shows the top six F1-measures of the combinations of 210 = 1, 024
visual features. These features were all quantized with PQ. Note that all LLCs
are shown to be ineﬃcient here. In this way, we chosen to extract FVs from
SIFT, from C-SIFT, and from LBP, which achieved the best performance.


                FV-SIFT             ✓      ✓      ✓      ✓        ✓   ✓
                 FV-LBP             ✓      ✓      ✓      ✓        ✓   ✓
            FV-OpponentSIFT         -      -      -      ✓        -   ✓
                FV-cSIFT            ✓      ✓      -      -        ✓   ✓
               FV-rgbSIFT           -      ✓      ✓      -        ✓   -
                LLC-SIFT            -      -      -      -        -   -
                LLC-LBP             -      -      -      -        -   -
            LLC-OpponentSIFT        -      -      -      -        -   -
               LLC-cSIFT            -      -      -      -        -   -
              LLC-rgbSIFT           -      -      -      -        ✓   -
               F1-measure       0.5715 0.5707 0.5703 0.5693 0.5693 0.5688

                    Table 1. Top combinations of visual features.


    The Table 2 shows the F1-measures of eight text features. As a result, thresh-
olding w.r.t. the number of corresponding images is not eﬀective for the perfor-
mance. Therefore, we chosen only four text features which were not thresholded.


                                      histgram       TF-IDF
                        threshold
                                             L2            L2
                            -       0.5157 0.5156 0.5135 0.5121
                            ✓       0.5114 0.5109 0.5020 0.4990

                    Table 2. Top combinations of visual features.


    Finally, we present the top six F1-measures (F1) of the combinations of three
visual features and for text features in the Table 3. 2 Following these results, we
submitted our runs described in Table 4 and got the scores also shown in Table
4. Note that all of these visual features from test images were not quantized.

2
    FV from SIFT is not quantized here.
8          Authors Suppressed Due to Excessive Length


           Visual (FV)                    Textual (BoW)
                                                                              F1
    FV-SIFT FV-LBP FV-cSIFT histgram TF-IDF histgram (L2 ) TF-IDF (L2 )
       ✓        ✓         ✓        ✓        -           -           -       0.5798
       ✓        ✓         -        ✓        -           -           -       0.5792
       ✓        ✓         -        -        -           ✓           -       0.5770
       ✓        ✓         ✓        -        -           ✓           -       0.5763
       ✓        ✓         -        -        ✓           -           -       0.5760
       ✓        ✓         ✓        ✓        -           ✓           -       0.5759

Table 3. Top combinations of visual and textual features. L2 means the vector is
normalized according to its L2 norm.


      method                                                MiAP GMiAP     F1
      FV-SIFT + FV-LBP                            0.3243         0.2590   0.5451
      FV-SIFT + FV-LBP + text                     0.4046         0.3436   0.5559
      FV-SIFT + FV-LBP + text(TF-IDF)             0.4029         0.3462   0.5597
      FV-SIFT + FV-LBP + FV-C-SIFT + text         0.4136         0.3540   0.5574
      FV-SIFT + FV-LBP + FV-C-SIFT + text(TF-IDF) 0.4131         0.3580   0.5583

Table 4. All five submissions and their scores on the Flickr Photo task. MiAP and
GMiAP stand for (Geometric) Mean interpolated Average Precision


4.2     Photo Web
For Photo Web tasks, we could not extract FVs because there were no images
in the provided dataset. Hence we use the provide BoVWs from some kinds of
SIFTs.
    Moreover, our PAAPL requires labels for each training sample. We investi-
gated a simple way to define the labels for each training sample. In particular,
we extracted words whichever are concept words from the surrounding texts for
each image. Images around which any concepts do not exist are just discarded.

Subtask 1: Improving performance in Flickr concept annotation task
For Flickr and Web image representation, we made use of four provided BoVWs,
which were computed respectively from SIFT, C-SIFT, OpponentSIFT and RGB-
SIFT. Unlike Flickr data, Web data was not annotated. Therefore we needed to
estimate labels of web data for using it as training data in supervised learning.
In order to do this, we sought for concepts in surrounding texts. If any concept
labels exist in a textual feature, then we consider that the corresponding image
has that label.
    To annotate images, we use PAAPL with regularization parameter from 104 ,
10 and 106 . As a result, C = 106 achieves the best performance in almost all
  5

cases where a classifier is trained on diﬀerent type descriptors. The number of
training iteration is 25, same as the Flickr Photo task.
                                                   ISI at ImageCLEF 2012        9

    We have two ideas on how to utilize web data for improvement of annotation
performance. One idea is that we use Web data and Flickr data independently.
At first, we trained eight classifiers using four types BoVWs from either Web
data or Flickr data. Then, we summed the scores form the eight classifiers when
we annotated the test images. To find best combinations of descriptors, we used
10000 Flickr images or 10000 Web images for training, and 5000 Flickr data for
validation. We computed F1-measure as a measure of eﬀectiveness of a combi-
nation. Results are shown in Table 5. The best F1-measure is worse than that


                                             Flickr
                                SIFT C-SIFT O-SIFT RGB-SIFT
                         SIFT   0.2086 0.2195 0.2162 0.2119
                 Web


                        C-SIFT 0.2158 0.2207 0.2220 0.2195
                        O-SIFT 0.2190 0.2276 0.2252 0.2227
                       RGB-SIFT 0.2009 0.2112 0.2062 0.2031

    Table 5. Results of descriptor combinations. O-SIFT means OpponentSIFT.


obtained by the other idea, so we do not adopt this idea.
    The other idea is that we merge 15000 Flickr data and 250000 web data ,
and unified 265000 data is used for training. Classifiers are trained respectively
for each type of BoVW computed from diﬀerent descriptor, so we have four
classifiers. Scores for each label are computed by summing scores  ∑4 from each
diﬀerent classifiers. A number of ways of combining classifiers is i=1 4 Ci = 15.
To find best combinations of four diﬀerent classifiers, we use 10000 Flickr images
and 10000 Web images for training, and 5000 Flickr data for validation. Then
we computed F1-measure shown in Table 6.
    Therefore we submitted the following combinations of BoVWs.

1. SIFT + C-SIFT
2. SIFT + C-SIFT + Opponent SIFT
3. SIFT + C-SIFT + RGB-SIFT
4. SIFT + C-SIFT + Opponent SIFT + RGB-SIFT

In combination of Web and Flickr photos, we obtained MiAP 0.264, GMiAP
0.217 and F1 0.182. However, we obtained MiAP 0.719, GMiAP 0.689 and F1
0.553 when we use only Flickr photos. This means that improving performance in
Flickr concept annotation task using Web photos is not successful. Although im-
proving performance in Web concept annotation task using Flickr photos seems
to be meaningful, Web images are too noisy with such a simple way to combine.


Subtask 2: Scalable concept image annotation Before we achieved the
results, we took three steps as follows.
10       Authors Suppressed Due to Excessive Length


                     SIFT C-SIFT O-SIFT RGB-SIFT       F1
                       ✓     -       -         -      0.4682
                       -     ✓       -         -      0.4792
                       -     -       ✓         -      0.4666
                       -     -       -         ✓      0.4665
                       ✓     ✓       -         -      0.4824
                       ✓     -       ✓         -      0.4747
                       ✓     -       -         ✓      0.4738
                       -     ✓       ✓         -      0.4807
                       -     ✓       -         ✓      0.4817
                       -     -       ✓         ✓      0.4748
                       ✓     ✓       ✓         -      0.4833
                       ✓     ✓       -         ✓      0.4821
                       ✓     -       ✓         ✓      0.4771
                       -     ✓       ✓         ✓      0.4821
                       ✓     ✓       ✓         ✓      0.4822

     Table 6. Results of combination experiments. O-SIFT means OpponentSIFT.


    First, in order to assign concepts to each image, we compared the concepts
to the raw text extracted near each image. If the raw text contains a concept,
the concept is assigned as one of the concepts of the image.
    Second, using randomly sampled 10000 training data from web and test data
from all development set, we made grid search as for the two parameters in
PA, that is C = {104 , 105 , 106 } and the iteration number N = {10, 15, 20, 25}.
These candidates are empirically selected. As a result, shown in Fig.2, we chose
{CC−SIF T = 104 , COpponentSIF T = 104 , CRGB−SIF T = 104 , CSIF T = 106 } and
the iteration number N = 25.
    Finally, utilizing the parameters stated above, we trained the weight vectors
µ corresponding to each feature from all 250k web data. After training, we
examined all combinations (24 = 16) calculated by summing up the candidates
from the four dot products of weight vectors and BoVW feature vectors. We
assigned three concepts which had highest combined scores to each image from
development set. We submitted five runs from 16 combinations as the results of
the development set by calculating mean F1-measure (shown in Table 7) and
used the same five combinations in order to achieve the results of the test set.
    Therefore we submitted the following combinations of BoVWs.
1. C-SIFT + O-SIFT
2. C-SIFT + RGB-SIFT
3. O-SIFT + RGB-SIFT
4. C-SIFT + O-SIFT + RGB-SIFT
5. SIFT + C-SIFT + Opponent SIFT + RGB-SIFT
As a result, we obtained MiAP 0.332, GMiAP 0.227, and F1 0.254. These scores
are higher than those we obtained with Web images in Subtask 1.
                                                    ISI at ImageCLEF 2012                11


           0.21                                                                      4
                                                                   C          =10
                                                                       C−SIFT
                                                                                     5
                                                                   C          =10
           0.20                                                        C−SIFT
                                                                   C          =106
                                                                       C−SIFT
           0.19                                                    C
                                                                       O−SIFT
                                                                                =10
                                                                                     4


                                                                   C            =105
                                                                       O−SIFT
           0.18                                                    C            =10
                                                                                     6
                                                                       O−SIFT
                                                                                     4
                                                                   C          =10
           0.17                                                        R−SIFT
                                                                                     5
      F1


                                                                   CR−SIFT=10

           0.16                                                    C          =106
                                                                       R−SIFT
                                                                                 4
                                                                   C      =10
                                                                       SIFT
           0.15                                                    C      =10
                                                                                 5
                                                                       SIFT
                                                                                 6
                                                                   C      =10
                                                                       SIFT
           0.14

           0.13

           0.12
               10           15             20              25
                             iteration number


Fig. 2. Results of grid search on randomly selected 10k train data from web set and
test data from development set as a function of iteration number and mean F1-measure
(F1) for the four BoVW features.


5   Conclusions


In this working note, we describe our method to annotate images in ImageCLEF
2012 Photo Annotation and Retrieval tasks. We pay our attention to make our
method scalable for a large amount of images. Consequently, we use FVs and
BoWs in Photo Flickr task, and use the provided BoVW in Photo Web tasks.
Annotation itself is achieved using a novel online learning PAAPL, which has
been already proposed in [9] for multilabel problem.
    For Photo Flickr task, we have achieved the top scores among all teams
although our system is scalable and simple. Not only FVs from SIFTs but also FV
from LBP is shown to be useful for annotation. Moreover, simple tag information
with BoW improves the performance. For Photo Web tasks, there are few teams
that have submitted at least one run. In Subtask 1, there are no teams that have
improved the performance with Web data. The result of Subtask 2 also indicates
that the Web Photoes are diﬃcult to be extracted their proper concepts from
their Web pages.
12      Authors Suppressed Due to Excessive Length


                      SIFT C-SIFT O-SIFT RGB-SIFT F1
                         ✓      -        -         -     0.237
                         -      ✓        -         -     0.258
                         -      -       ✓          -     0.251
                         -      -        -        ✓      0.244
                         ✓      ✓        -         -     0.257
                         ✓      -       ✓          -     0.256
                         ✓      -        -        ✓      0.247
                         -      ✓       ✓          -     0.267
                         -      ✓        -        ✓      0.260
                         -      -       ✓         ✓      0.262
                         ✓      ✓       ✓          -     0.256
                         ✓      ✓        -        ✓      0.258
                         ✓      -       ✓         ✓      0.257
                         -      ✓       ✓         ✓      0.266
                         ✓      ✓       ✓         ✓      0.264
Table 7. Results of summing up the combinations of the scores from each BoVW
feature on train data from all web set and test data from all development set.


References
 1. Bottou, L.: Large-Scale Machine Learning with Stochastic Gradient Descent. In:
    COMPSTAT (2010)
 2. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online Passive-
    Aggressive Algorithms. JMLR 7, 551–585 (2006)
 3. Grangier, D., Monay, F., Bengio, S.: A Discriminative Approach for the Retrieval
    of Images from Text Queries. In: ECML (2006)
 4. Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., Cao, L., Huang, T.: Large-
    scale Image Classification: Fast Feature Extraction and SVM Training. In: CVPR
    (2011)
 5. Mensink, T., Csurka, G., Perronnin, F., Sanchez, J., Verbeek, J.: Lear and xrce’s
    participation to visual concept detection task. In: CLEF 2010 working notes (2010)
 6. Perronnin, F., Sánchez, J., Mensink, T.: Improving the Fisher Kernel for Large-
    Scale Image Classification. In: ECCV (2010)
 7. Sánchez, J., Perronnin, F.: High-Dimensional Signature Compression for Large-
    Scale Image Classification. In: CVPR (2011)
 8. Thomee, B., Popescu, A.: Overview of the imageclef 2012 flickr photo annotation
    and retrieval task. In: CLEF 2012 working notes (2012)
 9. Ushiku, Y., Harada, T., Kuniyoshi, Y.: Eﬃcient Image Annotation for Automatic
    Sentence Generation. In: ACM MM (2012, accepted)
10. Villegas, M., Paredes, R.: Overview of the imageclef 2012 scalable web image an-
    notation task. In: CLEF 2012 working notes (2012)
11. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained
    Linear Coding for Image Classification. In: CVPR (2010)
12. Weston, J., Bengio, S., Usunier, N.: WSABIE: Scaling Up To Large Vocabulary
    Image Annotation. In: IJCAI (2011)

</pre>