<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Random Sampling Image to Class Distance for Photo Annotation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Deyuan Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bingquan Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chengjie Sun</string-name>
          <email>cjsun@insun.hit.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaolong Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ITNLP Lab, School of Computer Science and Technology Harbin Institute of Technology</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Image classi¯cation or annotation is proved di±cult for the computer algorithms. The Naive-Bayes Nearest Neighbor method is proposed to tackle the problem, and achieved the state of the art results on Caltech-101 and Caltech-256 image databases. Although the method is simple and fast, for the real applications, it su®er from the imbalance of the training datasets. In this paper, we extend the image to class distance which is more general, and use the random sampling technique to alleviate the situation of the imbalance of the training datasets. We perform our method on the ImageCLEF 2010 Photo Annotation task, and the results(INSUNHIT) showing that the algorithm is fast and stable. Although it does not achieving the state of the art performance, more image features can be used to improve the performance and dimension reduction techniques can be adopted to reduce the complexity of space and time.</p>
      </abstract>
      <kwd-group>
        <kwd>Image Annotation</kwd>
        <kwd>Nearest Neighbor Classi¯cation</kwd>
        <kwd>Random Sampling</kwd>
        <kwd>ImageCLEF Photo Annotation Task</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Although human beings recognize the scenes or objects in an image easily,
automatic image classi¯cation and annotation is a challenging task for computer
programs. According to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], human can recognize about 30000 categories, while
discriminating even two categories is di±cult for computer vision systems[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In recent thirty years researchers have proposed many image descriptors and
learning algorithms to recognize objects and scenes. The image descriptors in
the literature is roughly classi¯ed into ¯ve categories: global descriptors[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
that represent the image with global attributes, block based descriptors[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] which
represent the image using the image blocks, the region based features[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that is
generated by image segmentation algorithms, local patch features that represent
images with descriptors ¯nd by the interest point or blob operators of the image,
and some other features[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] such as the text labels tagged by the internet user on
photo sharing communities. Although these image features have been proposed
to tackle some speci¯c image recognition tasks, they usually failed to succeed
when applying to new image concepts and datasets.
      </p>
      <p>
        The learning algorithms have been proposed with the development of
image descriptors. Various kinds of learning algorithms are proposed or adopted in
the literature, including Non-parametric methods[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], kernel machines[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
generative models[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], multiple instance learning algorithms[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], distance metric learning
frameworks[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], or biological inspired neural networks[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. These models have
applied to many di®erent tasks, and achieved state of the art results.
      </p>
      <p>Although these image recognition systems succeed in several tasks or
benchmark databases, they remain naive for the real world applications. In the real
world image databases, the visual concepts shares large intra-class and
interclass variability, and the relation of the concepts is complicated. In addition, the
images used for training is imbalanced, resulting in the failure of some learning
algorithms. Finally, the databases contain large scale images, and the simplicity
and the running time need to be considered when design image categorization
systems.</p>
      <p>
        In this paper we describe our system(INSUNHIT) for ImageCLEF2010 Photo
Annotation task. We use the dense SIFT[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] image descriptor based on the
observation that it performs well both in object categorization and scene classi¯cation.
We extend Naive-Bayes Nearest Neighbor(NBNN)[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] classi¯cation method, and
propose Random Sampling Image to Class Distance(RS-ICD) for image
classi¯cation. RS-ICD can deal with imbalanced dataset while preserving the simplicity
of NBNN method. The ImageCLEF2010 challenge results shows that the method
is very stable and fast.
2
      </p>
      <p>
        Overview of the ImageCLEF2010 Photo Annotation
Dataset
In this section we brie°y discuss some di±culties of the ImageCLEF Photo
Annotation dataset. For a detail overview of the dataset please refer to [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
the emphasis in this section is that why this dataset is di±cult for computer
vision algorithms. The focus of our system is image content, therefore we do not
introduce the di±culties of text labels and EXIF information of the photos.
      </p>
      <p>First, compared to the benchmark datasets used in the literature, the visual
concepts(annotation keywords) is more abstract and shares more variability.
There exist 93 visual concepts, including objects(such as \trees" and \°owers"),
scenes(such as \desert", \sky" and so on), abstract concepts(such as \macro",
\spring" and \summer"). Some of the concepts have visual similar properties|
for example, \child" and \baby"|, and the image features can not discriminate
them perfectly. Image descriptors is di±cult to choose.</p>
      <p>Second, the image dataset is very imbalanced. The quantity of training
images of each concept ranges from 12 training images(\skateboard"), to 7484
images(\Neutral Illumination"). This makes the classi¯cation system more di±cult
to train the concepts.</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>In this section we introduce our learning algorithm in detail. Our algorithm
extends the NBNN method, and the RS-ICD is stable for measuring the distance
of the image to a query concept.
3.1</p>
      <sec id="sec-2-1">
        <title>Overview of NBNN</title>
        <p>The NBNN method de¯nes the distance of image to class to cope with the large
intra-class variance of the class. The method is based on the Naive-Bayes and
Kernel Density Estimation framework, and the image to class distance is the
near optimal distance when training images is very large.</p>
        <p>Here we only review the method of NBNN. The NBNN methods operate on
the local patch based image features. For a given class Cj; j 2 (1; 2; :::; L), and
query image Q and its corresponding image descriptors d1; d2; :::; dn, the distance
of Q and class Cj is de¯ned as follows:</p>
        <p>n
d(Q; Cj) = X N NCj (di)</p>
        <p>1
where N NCj (di) is the nearest distance of the descriptor di to class Cj:</p>
        <p>N NCj (di) = min(distance(di; dCj )); dCj 2 Cj
The classi¯cation process is proceed:</p>
        <p>Copt = argminC (d(Q; Cj)); Cj 2 C
(1)
(2)
(3)
(4)
(5)</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2 Image to Class Distance</title>
        <p>Although the NBNN is e®ective for image classi¯cation that output a single label
when decision, it can not be extend to image annotation(multi label) because
the image to class distance of each image is not comparable. In order to deal
with this problem, the distance should be normalized:
When K = 1, both the distance function and the decision function are the same
as the NBNN method. Therefore the distance de¯ned by NBNN is a special
case of our Image to Class Distance(ICD). The most important is that ICD is
comparable between images, and the distance can be used to do multi label
classi¯cation.
3.3
End
4
4.1</p>
      </sec>
      <sec id="sec-2-3">
        <title>Random Sampling Generalized NBNN</title>
        <p>The Image to Class Distance su®ers from the imbalanced training datasets. We
use random sampling technique to tackle the problem. Therefore the our
algorithm described as follows:
The Random Sampling Image to Class Distance Based Annoation
Begin</p>
        <p>For t = 1,2,...T</p>
        <p>Random sampling L images from each class Ci denoted as Ci(t);
End
Extract the descriptors of query image Q
For t = 1,2,...T</p>
        <p>compute dK(Q, Ci(t)) of each class
End
the image to class distance dK(Q, Ci) = average(dK(Q, Ci))
compute the probability p(Q, Ci)=exp(-a*dK(Q, Ci))</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Challenge Results</title>
      <sec id="sec-3-1">
        <title>Experimental Setup</title>
        <p>Here we describe the experimental setup of our algorithm. The images are
transformed into gray image, and resized to 300 pixels while keep the aspect ratio if
the image's length or width are larger than 300 pixels. 128 bin Dense SIFT image
features with the step of 8 pixels are extracted. The parameter T of Random
Sampling process is set to 10.</p>
        <p>
          We run the algorithm on a computer with Intel Core 2 Duo Q9400 CPU, 4 GB
Memory, 32 bit linux operating system. The SIFT extraction matlab program is
provided by Lazbnik[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], and the classi¯cation algorithm is coded by Python and
Numpy toolkit. For Approximate Nearest Neighbor Search, FLANN[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] using
randomized KD-Tree with Python interface is used.
        </p>
        <p>To evaluate the performance of the system, Mean Average Precision results
is performed as the main evaluation measure.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>
          In this section we discuss the results of our system(INSUNHIT). All the results
is showed on the website[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. We submit 5 runs: the best MAP result is 23.71%,
and the worst result is 22.51%. The detailed setups and the results are showed
in Table1.
        </p>
        <p>Our results achieved the centered of the overall results. We use di®erent setup,
and achieve similar results. This indicate that our algorithm is very stable.
Classifying the whole test image dataset takes about 12 hours. Although the
performance of our algorithm is far behind the state of art performance, further
improvements can be easily obtained. First, the 128-bin SIFT descriptor is too
large, while for the image classi¯cation, we can try other image descriptors or
dimension reduction techniques. Second, multiple image descriptors should be
used to improve the annotation results. In recent years, most of the promising
results on benchmark image datasets are performed by combining multiple image
features or multiple classi¯ers that computed on these image features.
Acknowledgments. This investigation was supported by the project of the
National Natural Science Foundation of China (grants No. 60973076), Special
Fund Projects for Harbin Science and Technology Innovation Talents(grants No.
2010RFXXG003) and Microsoft Fund Projects HIT.KLOF.2009021.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>N.K.</given-names>
            <surname>Logothetis</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.L.</given-names>
            <surname>Sheinberg</surname>
          </string-name>
          :
          <article-title>Visual object recognition</article-title>
          ,
          <source>Annual Review of Neuroscience</source>
          , vol.
          <volume>19</volume>
          ,
          <year>1996</year>
          , pp.
          <fpage>577</fpage>
          -
          <lpage>621</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>N.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.D.</given-names>
            <surname>Cox</surname>
          </string-name>
          , and
          <string-name>
            <surname>J.J.</surname>
          </string-name>
          <article-title>DiCarlo: Why is Real-World Visual Object Recognition Hard?</article-title>
          ,
          <source>PLoS Computational Biology</source>
          , vol.
          <volume>4</volume>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          .
          <year>2008</year>
          , p.
          <fpage>e27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.J.</given-names>
            <surname>Swain</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.H.</given-names>
            <surname>Ballard</surname>
          </string-name>
          :
          <article-title>Color indexing</article-title>
          ,
          <source>Int. J. Comput. Vision</source>
          , vol.
          <volume>7</volume>
          ,
          <issue>1991</issue>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona: A Bayesian Hierarchical</surname>
          </string-name>
          <article-title>Model for Learning Natural Scene Categories</article-title>
          ,
          <source>Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          . Vol.
          <volume>2</volume>
          , IEEE Computer Society,
          <year>2005</year>
          , pp.
          <fpage>524</fpage>
          -
          <lpage>531</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>X.</given-names>
            <surname>Qi</surname>
          </string-name>
          and Y. Han:
          <article-title>Incorporating multiple SVMs for automatic image annotation</article-title>
          ,
          <source>Pattern Recognition</source>
          , vol.
          <volume>40</volume>
          ,
          <year>2007</year>
          , pp.
          <fpage>728</fpage>
          -
          <lpage>741</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          <article-title>: Image Categorization by Learning and Reasoning with Regions</article-title>
          ,
          <source>J. Mach. Learn. Res</source>
          , vol.
          <volume>5</volume>
          ,
          <issue>2004</issue>
          , pp.
          <fpage>913</fpage>
          -
          <lpage>939</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>O.</given-names>
            <surname>Boiman</surname>
          </string-name>
          , E. Shechtman, and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Irani: In defense of Nearest-Neighbor based image classi¯cation</article-title>
          .
          <source>IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Wang: Web 2.0 dictionary, Proceedings of the 2008 international conference on Content-based image and video retrieval, Niagara Falls</article-title>
          , Canada: ACM,
          <year>2008</year>
          , pp.
          <fpage>591</fpage>
          -
          <lpage>600</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>O.</given-names>
            <surname>Chapelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ha</surname>
          </string-name>
          <article-title>®ner, and V. Vapnik: Support vector machines for histogrambased image classi¯cation, Neural Networks</article-title>
          , IEEE Transactions on, vol.
          <volume>10</volume>
          ,
          <year>1999</year>
          , pp.
          <fpage>1055</fpage>
          -
          <lpage>1064</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>S.</given-names>
            <surname>Lazebnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ponce</surname>
          </string-name>
          <article-title>: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories</article-title>
          ,
          <source>IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>2169</fpage>
          -
          <lpage>2178</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          <article-title>: Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          , vol.
          <volume>42</volume>
          ,
          <string-name>
            <surname>May</surname>
          </string-name>
          .
          <year>2001</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>A.</given-names>
            <surname>Frome</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Singer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          :
          <article-title>Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classi¯cation</article-title>
          ,
          <source>IEEE 11th International Conference on Computer Vision</source>
          ,
          <year>2007</year>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>M. Ranzato</surname>
            , Fu Jie Huang,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Boureau</surname>
          </string-name>
          , and
          <article-title>Yann LeCun: Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition</article-title>
          ,
          <source>IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2007</year>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <source>IMAGECLEF Visual Concept Detection and Annotation Task</source>
          <year>2010</year>
          , http://www. imageclef.org/2010/PhotoAnnotationMAPResults
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Nowak</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Huiskes</surname>
          </string-name>
          :
          <article-title>New Strategies for Image Annotation: Overview of the Photo Annotation Task at ImageCLEF 2010</article-title>
          <source>In the Working Notes of CLEF</source>
          <year>2010</year>
          , Padova, Italy,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>16. Spatial Pyramid Matching Code: http://www.cs.unc.edu/~lazebnik/research/ spatial_pyramid_code.zip</mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>17. FLANN Source Code: http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>