<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BUAA AUDR at ImageCLEF 2012 Photo Annotation Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lei Huang</string-name>
          <email>huanglei@nlsde.buaa.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Liu</string-name>
          <email>liuyang@nlsde.buaa.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>State Key Laboratory of Software Development Enviroment, Beihang University</institution>
          ,
          <addr-line>100191 Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the participation of the BUAA AUDR group at ImageCLEF 2012 in the Photo Annotation and Retrieval task. We selected Flickr photos as data set to perform visual concept detection, annotation and retrieval. In this task, we had proposed multi-modal approaches that considered visual information and Flickr user tag information. We presented our visual-based and tag-based photo annotation methods, and also proposed Annotation Refining Algorithm (ARA), which attempted to make use of the relation between concepts to improve the annotation result. It was our first time to participate the Photo Annotation and Retrieval task. We submitted 3 runs totally and the purely visual submission using the global visual features and BoW features get better performance.</p>
      </abstract>
      <kwd-group>
        <kwd>ImageCLEF</kwd>
        <kwd>Photo Annotation</kwd>
        <kwd>Flickr</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This paper presents the first participation of the BUAA AUDR group at ImageCLEF
photo annotation and retrieval task.</p>
      <p>ImageCLEF 2012 includes four types of tasks: medical image retrieval, photo
annotation, plant identification and robot vision. In the photo annotation task, the aim
is to analyze a collection of Flickr photos in terms of their visual or textual features in
order to detect the presence of one or more concepts. The detected concepts can then
be used for the purpose of automatically annotating the images or for retrieving the
best matching images to a given concept-oriented query. This task provides 15000
images for training and requires the annotation of 10000 images in the provided test
corpus according to the 94 pre-defined categories.</p>
      <p>We proposed multi-modal approaches that considered visual information and
Flickr user tag information. We presented our visual-based and tag-based photo
annotation methods, and also attempted to make use of the relation between concepts
to improve the annotation result.</p>
      <p>The remainder of this paper is organized as follows. In section 2 we describe our
approaches in detail. And our submitted runs are discussed in section 3. Then we
conclude in section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Approaches</title>
      <p>For the visual concept annotation task, we proposed multi-modal approaches that
considered visual information and Flickr user tag information. We presented our
visual-based and tag-based photo annotation methods in this section.</p>
      <p>Local visual features
1.Extract SIFT
2.Generate</p>
      <p>codebook
3.quantize</p>
      <p>BoW features</p>
      <p>Train/test images
-images
-tags
Global visual features
- Extract RGB
-Extract FCTH
-Extract CEDD</p>
      <p>Textual features
1.Filter tag
2.Coutnt tag frequency
3.geterate text vector
Global visual features</p>
      <p>Textual features</p>
      <sec id="sec-2-1">
        <title>SVM classifier</title>
      </sec>
      <sec id="sec-2-2">
        <title>SVM classifier</title>
      </sec>
      <sec id="sec-2-3">
        <title>SVM classifier</title>
      </sec>
      <sec id="sec-2-4">
        <title>Refine</title>
        <p>
          Recently, the Bag-of-Words (BoW) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] model has been very popular in image
recognition and retrieval. In this model, the key points extracted are quantized to
visual words, and an image is represented as a frequency histogram of these words.
We followed BoW model. We adopted the implementation described in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for
extracting local feature. Harris -Laplace was used to detect interest points and SIFT
descriptor was extracted. These descriptors were then quantized in visual words. To
form the codebook, we randomly selected approximate 1.5 million descriptors from
all descriptors extracted from the training images for clustering. We used k-means
clustering method to group these descriptors into K (K=200, 1000, 2000) clusters.
The codebook was formed by picking K cluster centers computed from the K clusters.
        </p>
        <p>
          Soft assignment was used to form the feature vector. We used the mapping from
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Let us define V a visual vocabulary set, and vd the visual word from its
corresponding dimension d and l a local feature. Then the BoW mapping m(l) is
defined as:



m(l)   


0
        </p>
        <p>Since metadata of images were provided for annotation task, using text-based
method in image annotation became possible. Compared with visual features of
images, metadata was usually more semantic, so it could be used for identification of
many abstract concepts, which was difficult with visual features.</p>
        <p>For the experiment presented in this paper, a tag-based image annotation method,
which makes use of custom tags in image metadata in determination of existence of
specific concepts, was implemented and used besides visual-feature-based annotation
methods. It is a rather simple algorithm, and its progress is presented below:
1．Extract unique tags from the training dataset. Since tags are attached by users,
they don't have any common rules, so even tags that share the same meaning may
appear different in the same dataset. Therefore, the amount of tags could be
intolerably large, and it would be difficult to generate the codebook.</p>
        <p>2．Select tags with high frequency to form a small tag set. In this experiment,
totally 2400 most frequently used tags were selected from the extracted tag set.
3．Each image is given a tag-based feature vector.</p>
        <p>4．Train SVM classifiers for each concept with tag-based feature vectors of
images.</p>
        <p>Through the above steps, SVM classifiers are obtained, and they could be used to
identify the concepts in annotation task just like any other kind of classifier.
2.3</p>
        <p>Annotation Refining Algorithm
Annotation refining algorithm (ARA) works after all previous annotation process. Its
input is a complete annotation which used to be directly submitted. That is to say,
Annotation refining is an extra process after common annotation process, and it is
used to improve the annotation result.</p>
        <p>It could be observed that each concept is not totally individual, e.g. “night” is
usually along with “moon” and hardly with “sun”. Therefore, if each concept in a
picture is identified separately, the information provided by their relations could be
neglected, so the annotation refining algorithm presented here attempts to make use of
this usually ignored information to improve the annotation result.</p>
        <p>In short, the process of annotation refining is simply minimization of an evaluation
function. Moreover, in order to learn parameters in this evaluation function in specific
dataset, ARA needed a training process. The training process and evaluation function
form the main part of ARA.</p>
        <p>The ARA training process is a series of solution of a Linear Programming (LP)
problem, which is</p>
        <p>x j  bTj x  a j . (2.3)</p>
        <p>In equation (2.3), x is a vector with the jth element being xj, which means the
possibility of the existence of the jth concept in the given picture. In this experiment,
xj is the score of the jth concept in annotations_raw dataset. For training, x is given by
the training dataset. bj and aj are parameters that need to be learned. They are simply
the parameters in general LP problems. By solving this problem in the whole training
dataset, a pair of parameters could be obtained for each concept.</p>
        <p>The minimization problem needed to be solved in ARA can be presented as below:
f (x)  aT (BT  I)x 
xT (B  I)(BT  I)x 
 xT x  x 'T x
min
s.t.</p>
        <p>xi [1,1], i  1, 2,
1
2
, n
In the above problem, a and B are learned parameters in training stage.
 a </p>
        <p>1
 </p>
        <p>a
a   2  , B  b1, b2 ,
 
 an 
, bn 
1
2
(2.4)
(2.5)
x’ is the input annotation result. α is tuning parameter used to balance the
importance of previous annotation process and refining, i.e. the refining result would
be closer to x’ with a bigger eα, and vice versa.</p>
        <p>This is a convex optimization problem and has only one solution. Refined
annotation result could be obtained as the solution x*.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <p>We submitted three different runs. The purely visual submission (BUAA_AUDR_1 in
Table 1) was adopted using the global visual features and BoW features. For each
concept, we selected separately the best classifier from a set of classifiers by MAP
values obtained on 5-fold cross-validation on the training data. BUAA_AUDR_1
obtained the rank 52 on 80 submissions, with a MAP value of 0.142. This value was
0.08 lower than the median value for these runs, 0.228.</p>
      <p>The purely text submission (BUAA_AUDR_2) obtained the rank 76. The result
was poor, but it got a better result than the visual method for the valid set which we
adopted to validate these methods. Maybe the tag-based photo annotation method was
not robust.</p>
      <p>BUAA_AUDR_3 was adopted as a multi-modal approaches with ARA. However,
it didn't obtain better results than BUAA_AUDR_1. For two reasons:
1. We used a linear integration which might lead to a worse result than purely
visual submission for that our purely text submission had a poor result and was not
robust.</p>
      <p>2. ARA had a shortcoming that poorly recognized concepts could spoil
classification rates of better performing classes.
This article describes the approaches and results of BUAA AUDR group at
ImageCLEF 2012 photo annotation task. We submitted 3 runs totally and their results
were not competitive among all the submitted runs. As this is our first time to
participate in this task, we will investigate our methods to find its weak and improve
its performance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1．
          <string-name>
            <surname>Leung</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Representing and recognizing the visual appearance of materials using three-dimensional textons</article-title>
          .
          <source>IJCV</source>
          <volume>43</volume>
          (
          <year>2001</year>
          )
          <fpage>29</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2．
          <string-name>
            <surname>Koen</surname>
            <given-names>E. A. van de Sande</given-names>
          </string-name>
          , Theo Gevers and
          <string-name>
            <surname>Cees G. M. Snoek</surname>
          </string-name>
          ,
          <article-title>Evaluating Color Descriptors for Object and Scene Recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          volume
          <volume>32</volume>
          (
          <issue>9</issue>
          ), pages
          <fpage>1582</fpage>
          -
          <lpage>1596</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3．
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Binder</surname>
          </string-name>
          , Wojciech Samek,
          <string-name>
            <given-names>Marius</given-names>
            <surname>Kloft</surname>
          </string-name>
          .
          <article-title>The joint submission of the TU Berlin and Fraunhofer FIRST(TUBFI) to the ImageCLEF2011 Photo Annotation Task</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4．
          <string-name>
            <surname>Chih-Chung Chang</surname>
          </string-name>
          and
          <string-name>
            <surname>Chih-Jen Lin</surname>
          </string-name>
          .
          <article-title>LIBSVM: a library for support vector machines</article-title>
          ,
          <year>2001</year>
          .Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>