<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Renmin University of China at ImageCLEF 2014 Scalable Concept Image Annotation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xirong Li</string-name>
          <email>xirong@ruc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xixi He</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gang Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qin Jin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jieping Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Multimedia Computing Lab, Renmin University of China Key Lab of DEKE, Renmin University of China No.</institution>
          <addr-line>59 Zhongguancun Street, Beijing 100872</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>380</fpage>
      <lpage>385</lpage>
      <abstract>
        <p>In this paper we describe our image annotation system participated in the ImageCLEF 2014 scalable concept image annotation task. The system is fully SVM based. Per concept we learn an ensemble of fast intersection kernel SVMs from three sources of training data, all obtained with manual annotation for free. The focus of our experiments this year is to answer the question of how many tags we should use to annotate a novel image. To that end, we introduce adaptive tag selection. In contrast to the common top-k strategy which selects a fixed number of top ranked tags to annotate an unlabeled image, our method estimates the value of k with respect to the image. Given the same concept rankings, the top-5 strategy obtains MF-sample of 0.206, while adaptive tag selection reaches MF-sample of 0.311.</p>
      </abstract>
      <kwd-group>
        <kwd>Scalable image annotation</kwd>
        <kwd>SVM</kwd>
        <kwd>adaptive tag selection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        For automated image annotation, how to adaptively determine a proper number
of tags for a specific image is an open problem. The top-k strategy, which simply
selects the top k ranked tags per image, is probably the most popular solution.
A number of systems in ImageCLEF 2013 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] have used this strategy, including
our 2013 system [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The fact that the number of relevant tags vary over images
makes the top-k strategy suboptimal. Hence, our focus this year is on studying
and evaluating strategies for adaptive tag selection.
2.1
      </p>
      <sec id="sec-1-1">
        <title>Visual features</title>
        <p>
          For each image, we extract bag of visual words (BoW) using the color descriptor
software [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. A precomputed codebook of size 4,000 is used to quantize densely
sampled SIFT descriptors. We further consider 1x1+1x3 spatial pyramids,
resulting in a BoW feature of 16,000 dimensions per image.
2.2
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Training data</title>
        <p>
          In addition to the 250K web images from ImageCLEF 2013, we leverage two
additional two sets, both of which are acquired with manual annotation for free.
The first set consists of one million images with user-clicked count, released
by MSR Bing [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The other set consists of four million user-tagged images
from Flickr. Notice that the data collecting process of the two extra sets were
independent of the ImageCLEF dev/test concept lists. So the use of the extra
data does not affect the scalability of our system.
2.3
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Image annotation models</title>
        <p>
          Different from our 2013 system [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] which combines kNN and SVM models, the
2014 edition is fully SVM based. For each dev/test tag ω, we learn an ensemble
of two-class SVM classifiers from the three training sets separately.
        </p>
        <p>Positive example selection. As the training sets come from different
sources with different (noisy) annotation information, we describe how to
select positive training examples for ω from the individual sets.</p>
        <p>For the 250K web images, as they were collected from three web image search
engines, namely Google, Yahoo, and Bing, each image x can be described by a
triplet &lt; q, r, s &gt;, where q represent a query tag, r is the rank of x in the search
results of q returned by an specific search engine s. Because a given image might
be retrieved by different queries or by the same query but with different search
engines, it can be associated with multiple triplets, denoted as &lt; qi, ri, si &gt;,
i = 1, . . . , l, where l is the number of triplets. To estimate the relevance of x
with respect to ω, we propose to compute a search engine based score as
l
relevancesearch(x, ω) = X δ(qi, ω)
i=1
w(si)
√ri
where δ(qi, ω) returns 1 if qi and ω are the same, and 0 otherwise. The variable
w(si) indicates the weight of a specific search engine, which we empirically set
to be 1, 0.5, and 0.5 for Google, Yahoo, and Bing, respectively.</p>
        <p>
          For the user-clicked set, each image is associated with a textual query and
the accumulated count of user click. A larger click count indicates that the image
is more likely to be relevant to the query [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We thus match ω with queries and
use the corresponding click count as the relevance score.
        </p>
        <p>
          For the Flickr set, we use a semantic-based relevance measurement as
depicted in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which computes tagwise similarity between ω and user tags of an
image.
        </p>
        <p>Given the concept ω, we sort images in descending order by their relevance
scores w.r.t ω, and preserve the top 1,000 ranked images as positive training
examples.</p>
        <p>
          SVM training. As the training data is overwhelmed by negative examples,
we learn SVM classifiers by the Negative Bootstrap algorithm [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Different from
sampling negative examples at random, Negative Bootstrap iteratively selects
negative examples which are most misclassified by present classifiers, and thus
most relevant to improve classification. Per iteration, the algorithm randomly
samples 10x1,000=10,000 examples to form a candidate set. An ensemble of
classifiers obtained in the previous iterations are used to classify each candidate
example. The top 1,000 most misclassified examples are selected and used
together with the 1,000 positives to train a new classifier. For the consideration of
efficiency, we use fast intersection kernel SVMs (fikSVM). For each of the three
sets, we conduct Negative Bootstrap with 10 iterations, producing in total 3x10
fikSVMs per concept. These fikSVMs are further compressed into a single model
such that the prediction time complexity is independent of the ensemble size.
        </p>
        <p>
          As we focus on adaptive tag selection, we do not submit runs to compare
the effectiveness of the three training sets. Nevertheless, our preliminary
observation from the development set is that models trained on the three sets are
complementary to each other to some extent. We therefore combine the
models in a linear late fusion manner. As shown in previous studies [
          <xref ref-type="bibr" rid="ref2 ref7">2, 7</xref>
          ], weights
optimized by coordinate ascent are consistently better than averaging. So we
continue this good practice, and learn the fusion weights by coordinate ascent
on the development set.
        </p>
        <p>
          Adaptive tag selection. For the 107 dev concepts, we have access to a
ground truth set of 1,000 images provided by ImageCLEF 2014 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We find that
finding a cutoff threshold that maximizes the F-concept measure per concept is a
good strategy. However, this strategy is inapplicable to novel concepts that have
no ground truth data available. In the 2014 task, there are 100 novel concepts,
as listed in Table 1. We devise a method that selects the top k ranked tags to
annotate an unlabeled image, whilst the value of k is adaptively determined with
respect to the test image. The method is based on the following hypothesis. We
assume that the dev concept vocabulary and the test concept vocabulary are
independent, such that for a given image the ratio of its relevant tags covered
in the dev vocabulary equals to the ratio of the relevant tags covered in the
test vocabulary. In that regard, we can estimate the number of test concepts
according to the number of dev concepts that have been selected. We will detail
the method somewhere else [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Evaluation</title>
      <sec id="sec-2-1">
        <title>Submitted Runs</title>
        <p>This year we submitted eight runs:
– RUC 01: Adaptive tag selection, with the scores of test concepts updated
with respect to the scores of the selected dev concepts and their cutoff
threshold (Adaptive-I).
– RUC 02: Adaptive tag selection, with the scores of test concepts updated
with respect to the scores of the selected dev concepts (Adaptive-II).</p>
        <p>The 107 dev concepts:
aerial airplane baby beach bicycle bird boat book bottle bridge building bus car cartoon
castle cat chair child church cityscape closeup cloud cloudless coast countryside daytime
desert diagram dog drink drum elder embroidery female fire firework fish flower fog food
footwear forest furniture garden grass guitar harbor hat helicopter highway horse indoor
instrument lake lightning logo male monument moon motorcycle mountain newspaper
nighttime outdoor overcast painting park person phone plant portrait poster protest
rain rainbow reflection river road sand sculpture sea shadow sign silhouette sky smoke
snow soil space spectacle sport sun sunset table teenager toy traffic train tree tricycle
truck underwater unpaved vehicle violin wagon water
The 100 novel concepts:
antelope apple arthropod asparagus avocado banana bear berry blood branch bread
broccoli buffalo butterfly camel canidae captive carrot cauliflower cervidae cheese
cheetah chimpanzee corn crocodile cucumber donkey egg eggplant elephant equidae felidae
flamingo fox fried fruit galaxy giraffe gorilla grape hippopotamus human hunting
kangaroo knife koala leaf leopard lettuce lion mammal marsupial meat monkey mud
mushroom nebula onion orange ostrich pan pasta pear penguin pig pineapple pinniped pool
potato pumpkin rabbit raccoon reptile rhino rice rifle roasted rock rodent sausage soup
spider spoon squirrel strawberry submarine tiger tomato trunk tuber turtle vegetable
walrus warthog watermelon wild wolf yam zebra zoo
– RUC 03: Adaptive tag selection, with the scores of test concepts updated
with respect to the rank-normalized scores of the selected dev concepts
(Adaptive-III).
– RUC 04: Selecting the top 5 ranked tags as final annotations.
– RUC 05: The same as RUC 01 except that the output of SVM classifiers
learned from the individual data sources have been converted to probabilistic
output before fusion.
– RUC 06: The same as RUC 02 except that the output of SVM classifiers
learned from the individual data sources have been converted to probabilistic
output before fusion.
– RUC 07: The same as RUC 03 except that the output of SVM classifiers
learned from the individual data sources have been converted to probabilistic
output before fusion.
– RUC 08: The same as RUC 04 except that the output of SVM classifiers
learned from the individual data sources have been converted to probabilistic
output before fusion.
The performance scores of the eight runs are summarized in Table 2. Compared
to the top-5 strategy (RUC 04 and RUC 08), all the other runs are better. The
results clearly show that adaptive tag selection outperforms the top-k strategy.</p>
        <p>As the only difference between RUC 01, RUC 02, RUC 03, and RUC 04 is
on how to determine the value of k, which does not change concept ranking,
the MAP-samples scores are the same for the four runs. Similarly, RUC 05,
RUC 06, RUC 07, and RUC 08 have the same MAP-samples scores. Comparing
the two groups of runs, we find that score normalization before fusion is helpful
for improving MF-samples and MF-concepts, but causes performance drop in
MAP-samples.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>This paper documents our experiments in the ImageCLEF 2014 Scalable
Concept Image Annotation, a testbed for developing a scalable image annotation
system with manual annotation for free. A novel component of our 2014 system
is adaptive tag selection that determines the number of tags to be selected with
respect to a given test image. Adaptive tag selection is better than selecting a
fixed number of tags for image annotation.</p>
      <p>
        Acknowledgments. This research was supported by the National Science
Foundation of China (No. 61303184), the Fundamental Research Funds for the
Central Universities and the Research Funds of Renmin University of China
(No. 14XNLQ01), the Specialized Research Fund for the Doctoral Program of
Higher Education (No. 20130004120006), and the Scientific Research
Foundation for the Returned Overseas Chinese Scholars, State Education Ministry. The
authors are grateful to the ImageCLEF coordinators for the benchmark
organization efforts [
        <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paredes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the imageclef 2013 scalable concept image annotation subtask</article-title>
          .
          <source>In: CLEF 2013 working notes</source>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Renmin university of china at imageclef 2013 scalable concept image annotation</article-title>
          . In: ImageCLEF working notes. (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. van de Sande,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Evaluating color descriptors for object and scene recognition</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>32</volume>
          (
          <year>2010</year>
          )
          <fpage>1582</fpage>
          -
          <lpage>1596</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hua</surname>
            ,
            <given-names>X.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rui</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          : Clickage:
          <article-title>Towards bridging semantic and intent gaps via mining click logs of search engines</article-title>
          . In: ACM MM. (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeulders</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Harvesting social images for biconcept search</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>14</volume>
          (
          <issue>4</issue>
          ) (Aug.
          <year>2012</year>
          )
          <fpage>1091</fpage>
          -
          <lpage>1104</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koelma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeulders</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Bootstrapping visual categorization with relevant negatives</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>15</volume>
          (
          <issue>4</issue>
          ) (Jun.
          <year>2013</year>
          )
          <fpage>933</fpage>
          -
          <lpage>945</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeulders</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Fusing concept detection and geo context for visual search</article-title>
          . In: ICMR. (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paredes</surname>
          </string-name>
          , R.:
          <article-title>Overview of the ImageCLEF 2014 Scalable Concept Image Annotation Task</article-title>
          . In:
          <article-title>CLEF 2014 Evaluation Labs</article-title>
          and Workshop, Online Working Notes. (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Adaptive top-k tag selection for image annotation</article-title>
          .
          <article-title>(2014) submitted</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Caputo</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Mu¨ller, H.,
          <string-name>
            <surname>Martinez-Gomez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Acar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patricia</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marvasti</surname>
            , N., U¨sku¨darlı,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paredes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cazorla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Varea</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morell</surname>
          </string-name>
          , V.:
          <article-title>ImageCLEF 2014: Overview and analysis of the results</article-title>
          .
          <source>In: CLEF proceedings. Lecture Notes in Computer Science</source>
          . Springer Berlin Heidelberg (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>