Renmin University of China at ImageCLEF 2013
     Scalable Concept Image Annotation

                   Xirong Li1,2 , Shuai Liao1 , Binbin Liu2 , Gang Yang2
                          Qin Jin2 , Jieping Xu2 , Xiaoyong Du1
              1
                Key Lab of Data Engineering and Knowledge Engineering, MOE
    2
        Multimedia Computing Lab, School of Information, Renmin University of China
                       Zhongguancun Street 59, Beijing 100872, China
                                    xirong@ruc.edu.cn


           Abstract. In this paper we describe our image annotation system par-
           ticipated in the ImageCLEF 2013 scalable concept image annotation
           task. The system leverages multiple base classifiers, including single-
           feature and multi-feature k NN classifiers and histogram intersection ker-
           nel SVMs, all of which are learned from the provided 250K web images
           and provided features with no extra manual verification. These base clas-
           sifiers are combined into a stacked model, with the combination weights
           optimized to maximize the geometric mean of F-samples, F-concepts,
           and AP-samples metrics on the provided development set. By varying
           the configuration of the system, we submitted five runs. Evaluation re-
           sults show that for all of our runs, model stacking with optimized weights
           performs best. Our system can annotate diverse Internet images purely
           based on the visual content, at the following accuracy level: F-samples
           of 0.290, F-concepts of 0.304, and AP-samples of 0.380. What is more, a
           system-to-system comparison reveals that our system and the best sub-
           mission this year are complementary with respect to the best annotated
           concepts, suggesting the potential for future improvement.

           Keywords: Image annotation, learning from web, stacked model


1        Introduction

Annotating unlabeled images by computers is crucial for organizing and retriev-
ing the ever-growing amounts of images on personal devices and the Internet.
Due to the semantic gap, i.e., the lack of coincidence between visual features
extracted from the visual data and a user’s interpretation on the same data,
image auto-annotation is challenging.
    In the context of annotating Internet images, the semantic gap becomes even
bigger, as a specific concept exhibits significant diversity in its visual appearance.
The imagery of a concept does not limit to realistic photographs, but can also
be artificial correspondences such as posters, drawings, and cartoons, as demon-
strated in Fig. 1. To annotate the uncontrolled visual content with a large set of
concepts, a promising line of research is to learn from web data which contains
2        Li et al.


    airplane cloud daytime        bird cartoon                      bird flower painting
    poster sky vehicle                                              plant


Fig. 1. Internet images and ground truth tags from the development set of the
Scalable Concept Image Annotation 2013 task.


many images but with unreliable annotations [1–6]. In these works, k Nearest
Neighbors (k NN) and Support Vector Machines (SVMs) are two popular clas-
sifiers, as have been separately used in [1–3] and [4–6]. Given the difficulty of
the Scalable Concept Image Annotation 2013 task [7], we believe that a single
classifier is inadequate. In that regard, we develop an image annotation system
that combines a number of base classifiers into a stacked model. In our 2013 ex-
periments we submitted five runs, with the purpose of verifying the effectiveness
of model stacking.
     The remainder of the paper is organized as follows. We first describe our
image annotation system in Section 2. Then we detail our experiments in Section
3, with conclusions given in Section 4.


2      The RUC Image Annotation System
Our image annotation system consists of multiple base classifiers, which are then
combined into a stacked model to make final predictions. The base classifiers are
learned from the 250K Internet images provided by the 2013 task, while the
stacked model is optimized using the development set of 1,000 ground-truthed
images. A conceptual diagram of the system is given in Fig. 2.
    Next, we describe the stacked model in Section 2.1, followed by the base
classifiers in Section 2.2.

2.1     Stacked Image Annotation Model
For the ease of consistent description, let x be a specific image. For a given con-
cept ω, let g(x, ω) be an image annotation model which produces a relevance s-
core of ω with respect to x. A set of t models are denoted by {g1 (x, ω), . . . , gt (x, ω)}.
   We define our stacked model as a convex combination of the t models:
                                            X
                                            t
                              gΛ (x, ω) =         λi · gi (x, ω),                      (1)
                                            i=1
                RUC at ImageCLEF 2013 Scalable Concept Image Annotation                   3

                                                                        development set
            Construct base classifiers
                                             Single/Multi
                                              feature kNN
Internet


                                                                      ...
 images                                                                       Model
                                                                            stacking

                  Select relevant
                                                  HikSVMs
                      examples


Fig. 2. A conceptual diagram of the RUC image annotation system partici-
pated in the Scalable Concept Image Annotation 2013 task.

                                                               Pt
where λi is the nonnegative weight of the i-th classifier, with i=1 λi = 1. Notice
that gΛ (x, ω) is to indicate that the stacked model is parameterized by Λ = {λi }.
   We look for the setting of {λi } that maximizes the image annotation per-
formance. The 2013 task specifies three performance metrics [7]: F-measure for
images, F-measure for concepts, and Average Precison for images, as given in
the Appendix. To jointly maximize the three metrics, we take their geometric
mean as a combined metric. Weights of Eq. (1) optimized with respect to the
combined metric are found by a coordinate ascent algorithm.

2.2     Base Classifiers
We choose k Nearest Neighbors (k NN) and Support Vector Machines (SVMs)
as two types of base classifiers, for their good performance.
      Base classifier I: k NN. Given a test image x, we define the k NN classifier as

                                           X
                                           k
                           gknn (x, ω) =         rel(xi , ω) · wi ,                    (2)
                                           i=1

where rel(xi , ω) denotes the relevance score between the i-th neighbor and the
concept, and wi is the neighbor weight. In this work, we instantiate rel(xi , ω)
using the provided textual feature, which is derived from the web page of xi [7].
                                                                      Pk
For wi , we choose a Bayesian implementation [3], computing wi as k −1 j=i j −1
to give more importance to closer neighbors.
    Base Classifier II: SVMs. We choose the histogram intersection kernel SVMs
(hikSVMs) which is known to effective for bag of visual words features of medium
size [8], such as the 5,000-dim features provided by the task. More importantly,
the decision function of a hikSVMs can be efficiently computed by a few linear
interpolations on a set of precomputed points.
    To obtain relevant positive training examples for a given concept from the
250K Internet images, we utilize both the textual feature and the query log gen-
erated when collecting the images from web image search engines. We describe
4      Li et al.

the connection of an Internet image x to a web image search engine s by a triplet
< q, r, s >, where q represents a query keyword, r is the rank of x in the search
results of q returned by s. An image can be associated with multiple triplets. To
determine the positiveness of the image x with respect to the given concept ω,
we propose to compute a search engine based score as

                                          X
                                          l
                                                         w(si )
                   positiveness(x, ω) =       sim(qi , ω) √ ,                 (3)
                                          i=1
                                                           ri

where l is the number of triplets associated with the image, and sim(qi , ω) is a
tag-wise similarity measure, computed on the base of tag co-occurrence in 1.2
million Flickr images [4]. The variable w(si ) indicates the weight of a specific
search engine, which we empirically set to be 1, 0.5, and 0.5 for Google, Yahoo,
and Bing, respectively. The positiveness score in Eq. 3 is further combined with
the given textual feature. For each concept, we sort images labeled with the
concept in descending order by their positiveness scores, and select the top 500
ranked images as positive training examples.
    For the acquisition of negative training examples, we consider the following
two approaches. The first approach is to sample negative examples at random.
Given a specific concept and the 500 selected positive examples, we randomly
sample a set of 500 negative examples, to make the training data perfectly bal-
anced. We repeat the random sampling 10 times, yielding a set of 10 hikSVMs.
The second approach is Negative Bootstrap [9]. Different from random sampling,
this approach iteratively selects negative examples which are most misclassified
by present classifiers, and thus most relevant to improve classification. Per it-
eration, the approach randomly samples 5,000 examples to form a candidate
set. An ensemble of classifiers obtained in the previous iterations are used to
classify each candidate example. The top 500 most misclassified examples are
selected and used together with the 500 positives to train a new hikSVMs. We
conduct Negative Bootstrap with 10 iterations, producing 10 hikSVMs for each
concept. For efficient classification, we leverage a model compression technique
[10] to compress the ensemble into a single classifier such that the prediction
time complexity is independent of the ensemble size.


3     Experiments

3.1   Submitted Runs

This year we submitted five runs, which are listed as follows. For all runs, we
empirically preserve for each image the top six ranked concepts as its final an-
notation.

Run: RUC Crane. This run uses a k NN classifier with early fused multiple fea-
tures (Multi-Feat-k NN). We use all the 7 provided features, i.e., getlf (256-D),
colorhist (576-D), gist (480-D), sift (5000-D), csift (5000-D), rgbsift (5000-D),
               RUC at ImageCLEF 2013 Scalable Concept Image Annotation                     5

and opponentsift (5000-D). For each feature, we compute the l1 distance. Dis-
tance values of individual features are zero-score normalized, and then averaged
to obtain k = 256 nearest neighbors.

Run: RUC Snake. This run combines multiple single feature k NN (Single-Feat-
k NN) and SVMs classifiers with uniform weights. In particular, five features
(colorhist, gist, csift, rgbsift, and opponentsift) are used separately for Single-
Feat-k NN, again with the l1 distance and k = 256. For SVMs, we use the three
variants of sift, i.e., csift, rgbsift, and opponentsift. In total, this run employs
5+3+3=11 base classifiers.

Run: RUC Monkey. This run uses the same 11 base classifiers as have been used
in the Snake run, but with optimized weights. Consequently, the effectiveness of
the stacked model can be verified by comparing the Monkey and the Snake.

Run: RUC Mantis. This run combines the Multi-Feat-k NN, multiple Single-
Feat-k NN, and 6 variants of SVMs with uniform weights. So the run employs 12
base classifiers in total.

Run: RUC Tiger. This run uses the same 12 base classifiers as have been used
in the Mantis run, but with optimized weights. The Tiger is our primary run.

3.2      Results
The performance scores of the five runs are summarized in Table 1. For all
performance metrics, the Tiger run, which combines all the base classifiers using
the coordinate ascent optimized weights, is the best. On the test set, this run
reaches an F-samples score of 0.290, an F-concepts score of 0.304, and an AP-
samples score of 0.380. The Monkey run using optimized weights is better than
the Snake run using uniform weights. Moreover, by comparing the performance
on the development set and on the test set, we find that the system generalizes
well to unseen images and concepts. These results justify the importance of
model stacking and weights optimization for image annotation.


Table 1. Settings and performance of our submitted runs. The symbol +
indicates that a specific base classifier is used in a specific run. Our primary run Tiger,
which combines all base classifiers with optimized weights, performs best.

                    Based classifiers                    F-samples F-concepts AP-samples
 Run      Multi-Feat-kNN Single-Feat-kNN SVMs Weights dev       test   dev   test   dev    test
 Crane          +               -           -   N.A.      28.8 25.4 26.6     23.9   36.1   32.4
 Snake          -               +          +    uniform 28.8 26.5 30.8       28.5   38.2   35.5
 Monkey         -               +          +    optimized 31.0 28.3 32.7     29.6   40.5   37.6
 Mantis         +               +          +    uniform 29.8 27.8 31.4 29.2 39.4           36.9
 Tiger          +               +          +    optimized 31.6 29.0 33.4 30.4 41.2         38.0
6                                Li et al.


                      sunset
                    firework
                      galaxy
                       space                                                                    Median score
               underwater
                 helicopter
                   lightning
                                                                                                Best submission
                         aerial
                     cartoon
                              fog                                                               RUC Tiger
                           pool
                     portrait
                       desert
                             bus
                         water
                    rainbow
                      protest
                  mountain
                          sand
                     tricycle
                           boat
                            tree
                           logo
                          snow
                        castle
                      nebula
                          plant
                newspaper
                    butterfly
                      person
                motorcycle
                             sea
                          drink
                             sky
                        flower
                          drum
                           road
                         forest
                   furniture
                         moon
                      wagon
                    building
                      church
                          violin
                         bottle
                       beach
                           food
                            lake
                              fire
                      bicycle
                 silhouette
                         horse
                         cloud
                             sun
    Concepts


                         traffic
                            sign
                      vehicle
                       phone
                  teenager
                       reptile
                              toy
                   footwear
                             fish
                instrument
                  nighttime
                          chair
                    highway
                 arthropod
                       poster
                  reflection
               countryside
                    painting
                          truck
                              car
                      harbor
                         guitar
                           train
                  sculpture
                    diagram
                           river
                   overcast
                    airplane
                         grass
                             rain
                       bridge
                  cityscape
                             soil
                              hat
                       indoor
                          sport
                submarine
                              cat
                             dog
                         coast
                       spider
                          baby
               embroidery
                      garden
                          elder
                      female
                      rodent
                           child
                          book
                          table
                           park
                             bird
                      smoke
                     outdoor
                    shadow
                          male
                   unpaved
                    daytime
                monument
                 spectacle
                  cloudless
                    closeup
                                 0       0.1   0.2   0.3   0.4       0.5      0.6   0.7   0.8         0.9         1
                                                                 F−concepts


Fig. 3. A system-to-system comparison. The concepts have been sorted in terms
of their F-concepts scores. The median score of each concept is the median of the
F-concepts scores of all the 58 submissions. The RUC Tiger run is clearly above the
average level. What we find more interesting is that our system and the best submission
seem complementary, showing the potential for future improvement.


   Fig. 3 shows a system-to-system comparison, measured by F-concepts. The
RUC image annotation system is clearly above the average level. Moreover, when
compared with the best submission of this year, we find that the two systems
are complementary, as their best annotated concepts differ.
              RUC at ImageCLEF 2013 Scalable Concept Image Annotation           7

4    Discussions and Conclusions
This paper documents our experiments in the ImageCLEF 2013 Scalable Con-
cept Image Annotation, a testbed for developing image annotation systems using
generic web data. We have built such a system.
    Our system annotates images purely based on the visual content. It combines
multiple base classifiers, i.e., variants of k NN and SVMs, into a stacked model.
In all of our five submitted runs, the stacked model with optimized weights
performs best.
    Through a system-to-system comparison, we find that our system and the
best submission this year is complementary in the sense of the best annotated
concepts. Given that we use relatively simple base classifiers, we consider this
finding interesting, as it suggests the potential of future improvement.

Acknowledgments. This research was supported by the Basic Research funds
in Renmin University of China from the central government (13XNLF05). The
authors are grateful to the ImageCLEF coordinators for the benchmark organi-
zation efforts.


Appendix
A1. Performance Measures
F-samples. Given a test image x with its relevant tag set Rx and a predicted
tag set Px , its F-samples score is computed as
                                     2 ∗ precision(x) ∗ recall(x)
                   F-samples(x) =                                 ,           (4)
                                      precision(x) + recall(x)
where precision(x) is |Rx ∩ Px |/|Px |, and recall(x) is |Rx ∩ Px |/|Rx |.

F-concepts. Given a test concept ω with its relevant image set Rω and a set
of images Pω labeled with ω by the annotation system, its F-concepts score is
computed as
                                 2 ∗ precision(ω) ∗ recall(ω)
                F-concepts(ω) =                               ,           (5)
                                  precision(ω) + recall(ω)
where precision(ω) is |Rω ∩ Pω |/|Pω |, and recall(ω) is |Rω ∩ Pω |/|Rω |.

AP-samples. Given a test image x with m tags sorted in descending order by
relevance scores, its AP-samples score is computed as

                                              1 X ri
                                                   m
                        AP-samples(x) =                 δ(i),                 (6)
                                            |Rx | i=1 i

where ri is the number of relevant tags among the top i tags, and δ(i) is 1 if the
i-th tag is in Rx , 0 otherwise.
8       Li et al.

References
 1. Wang, X.J., Zhang, L., Li, X., Ma, W.Y.: Annotating images by mining image
    search results. IEEE Transactions on Pattern Analysis and Machine Intelligence
    30(11) (Nov. 2008) 1919–1932
 2. Li, X., Snoek, C., Worring, M.: Annotating images by harnessing worldwide user-
    tagged photos. In: ICASSP. (2009)
 3. Villegas, M., Paredes, R.: A k-nn approach for scalable image annotation using
    general web data. In: BigVision. (2012)
 4. Li, X., Snoek, C., Worring, M., Smeulders, A.: Harvesting social images for bi-
    concept search. IEEE Transactions on Multimedia 14(4) (Aug. 2012) 1091–1104
 5. Zhu, S., Jiang, Y.G., Ngo, C.W.: Sampling and ontologically pooling web images
    for visual concept learning. IEEE Transactions on Multimedia 14(4) (Aug. 2012)
    1068–1078
 6. Kordumova, S., Li, X., Snoek, C.: Evaluating sources and strategies for learning
    video concepts from social media. In: CBMI. (2013)
 7. Villegas, M., Paredes, R., Thomee, B.: Overview of the imageclef 2013 scalable
    concept image annotation subtask. In: CLEF 2013 working notes. (2013)
 8. Maji, S., Berg, A., Malik, J.: Classification using intersection kernel support vector
    machines is efficient. In: CVPR. (2008)
 9. Li, X., Snoek, C., Worring, M., Smeulders, A.: Social negative bootstrapping for
    visual categorization. In: ICMR. (2011)
10. Li, X., Snoek, C., Worring, M., Koelma, D., Smeulders, A.: Bootstrapping visual
    categorization with relevant negatives. IEEE Transactions on Multimedia 15(4)
    (Jun. 2013) 933–945