<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LSIS Scaled Photo Annotations: Discriminant Features SVM versus Visual Dictionary based on Image Frequency</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhong-Qiu ZHAO</string-name>
          <email>zhongqiuzhao@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Herve GLOTIN</string-name>
          <email>glotin@univ-tln.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emilie DUMONT</string-name>
          <email>emilie.r.dumont@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Managment]: Cross-Language Retrieval in Image Collections (ImageCLEF)|ImageCLEFphoto</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LDA, Visual Dictionary, Generalized Descriptor of Fourier</institution>
          ,
          <addr-line>Pro le Entropy Feature, SVM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer &amp; Information, Hefei University of Technology</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Sud Toulon-Var</institution>
          ,
          <addr-line>USTV</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we used only visual information to implement ImageCLEF2009 Photo Annotation Task. Firstly, we extract various visual features: HSV and EDGE histograms, Gabor, and recent Descriptor of Fourier and Pro le Entropy Features. Then for each concept and features, we compute Linear Discriminant Analysis (LDA) to decrease the high dimension impact. Finally, we train support vector machines (SVMs), for which the outputs are considered as the con dences with which the samples belong to the concept. Also we propose a second model, an improved version of a Visual Dictionary (VD), which is built by visual words extracted for frequency templates in the training set. We describe the results of these 2 models, topics by topics, and we give perspectives for our VD method, that is more faster than SVM, and better than SVM for some topics. We also show that among the 19 teams, our SVM(LDA) run attains the AUC score of 0.721, and then occupies the 8th AUC rank among the 19 teams involved in this campaign, while our VD models would occupy the 10th rank.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        We use various features described in nexte section: PEF[
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], HSV and EDGE histograms,
new Descriptor of Fourier [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and Gabor. The we use an LDA to reduce their dimension. Finally,
we use the Least Square support vector machine (LS-SVM) to produce concept similarity. Another
original method called Visual Dictionary is proposed and implemented in section 5.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Visual Features</title>
      <p>
        We use a new feature, the pixel 'pro le' entropy (PEF) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], giving the entropy of a pixel pro les
in horizontal and vertical directions. The advantage of PEF is to combine raw shape and texture
representations, with a low CPU cost feature, and already gave good performances (second best
rank in the o cial ImagEval 2006 campaign (see www.imageval.org)).
      </p>
      <p>
        Here we use extended PEF [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] using the harmonic mean of the pixel of each row or column.
The idea is that the object or pixel region distribution, which is lost in arithmetic mean projection,
could be partly represented by the harmonic mean. These two projections are then expected to
give complementary and/or concept dependant information. PEF are computed into three equal
horizontal and vertical image slices, yieding to a total of 150 dimensions.
      </p>
      <p>
        We also use classical features : HSV and EDGE histograms, and Gabor, and recent Descriptor
of Fourier robust to rotation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We train our two models on these features that represent a total
of 400 dimensions. We use LDA to reduce the feature dimensions as depicted in next section.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Linear Discriminant Analysis</title>
      <p>
        In general, the LDA [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is used to nd an optimal subspace for classi cation in which the ratio of
the between-class scatter and the within-class scatter is maximized. Let the between-class scatter
matrix be de ned as
      </p>
      <p>SB = Xc ni(Xi</p>
      <p>i=1
c
SW = X
i=1 Xk2Ci</p>
      <p>X (Xk</p>
      <p>X)(X
i</p>
      <p>X)T
Xi)(Xk</p>
      <p>i T
X )
and the within-class scatter matrix be de ned as
where X = (Pjn=1 Xj)=n is the mean image of the ensemble, and X i = (Pjn=i1 Xji)=ni is the mean
image of the ith class, ni is the number of samples in the ith class, c the number of classes, and
Ci the ith class. As a result, the optimal subspace, Eoptimal by the LDA can be determined as
follows:</p>
      <p>Eoptimal = arg max</p>
      <p>E jET SW Ej
jET SBEj
= [c1; c2; :::; cc 1]
where [c1; c2; :::; cc 1] is the set of generalized eigenvectors of SB and SW corresponding to the
largest generalized eigenvalues i; i = 1; 2; :::; c 1, i.e.,</p>
      <p>SBEi = iSW Ei; i = 1; 2; :::; c
1
Thus, the feature vector, P , for any query face images, X, in the most discriminant sense can be
calculated as follows:</p>
      <p>P = EoTptimalU T X</p>
      <p>In our image retrieval task, LDA output only 1 dimension since the classi cation problem for
each concept is 2-class.</p>
    </sec>
    <sec id="sec-4">
      <title>Fast classi cation using Least Squares Support Vector</title>
      <p>
        In order to design fast image retrieval systems, we use the Least Squares Support Vector Machine
(LS-SVM). The SVM [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] rst maps the data into a higher dimensional input space by some kernel
functions, and then learns a separating hyperspace to maximize the margin. Currently, because
of its good generalization capability, this technique has been widely applied in many areas such
as face detection, image retrieval, and so on [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. The SVM is typically based on an "-insensitive
cost function, meaning that approximation errors smaller than " will not increase the cost function
value. This results in a quadratic convex optimization problem. So instead of using an "-insensitive
cost function, a quadratic cost function can be used. The least squares support vector machines
(LS-SVM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] are reformulations to the standard SVMs which lead to solving linear KKT systems
instead, which is quite computationally attractive. Thus, in all our experiments, we will use the
LS-SVMlab1.5 (http://www.esat.kuleuven.ac.be/sista/lssvmlab/).
      </p>
      <p>In our experiments, the RBF kernel</p>
      <p>K(x1
x2) = exp( jx1
x2j2= 2)
is selected as the kernel function of our LS-SVM. So there is a corresponding parameter, , to be
tuned. A large value of 2 indicates a stronger smoothing. Moreover, there is another parameter,
, needing tuning to nd the tradeo between to stress minimizing of the complexity of the model
and to stress good tting of the training data points.</p>
      <p>We set these two parameters as
and
2 = [4 25 100 400 600 800 1000 2000]
respectively. So a total of hundred SVMs were constructed for each SVM model, and then we
selected the best SVM using the validation set.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Visual Dictionary Method</title>
      <p>
        The visual dictionary is an original method to annotated images which is an improvement of the
method proposed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We construct a Concept Visual Dictionary composed by visual words
intended to represent semantic concept which consists of ve steps :
      </p>
      <p>Visual elements. Images are decomposed into visual elements where a visual element is an
image area, i.e. images are split into a regular grid.</p>
      <p>
        Representation of visual elements. We use the most classical and intuitive approach
consisting in representing a visual word by usual features HSV, GABOR, EDGE, and also PEF [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
and DF [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Global Visual Dictionary. For each feature, we cluster visual elements using the K-Means
algorithm with a prede ned number of clusters and using the Euclidean distance in order to
group visual elements and to smooth some visual artifacts. And then, for each cluster, we
select the medoid to be a visual word and to compose the visual dictionary of a feature.
Image transcription. Based on the Global Visual Dictionary, we replace visual elements by
the nearest visual word in the visual dictionary. And then, the image representation is based
on the frequency of the visual words within the image for each feature.</p>
      <p>
        Concept Visual Dictionary. We select the most discriminative visual words for a concept
given to compose a Concept Visual Dictionary. To lter the words, we use a entropy-based
reduction, which is developed from work carried out in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        In a second step, we propose an adaptation of the common text-based paradigm to annotated
images. We used the tf-idf weighting scheme [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in the vector space model together with cosine
similarity to determine the similarity between a visual document and a concept. To use this
scheme, we represent an image by the frequency of the visual words within the image for di erent
features : HSV, GABOR, PEF, EDGE and DF.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Experimental Results</title>
      <p>The models based on SVM to implement the image retrieval in the task is shown in Figure 1 and
contains the following steps:</p>
      <p>Step 1) Split the VCDT labeled image dataset into 2 sets, namely training image dataset and
validation set.</p>
      <p>Step 2) Extract the visual features from the training image data using our extraction method;
Learn and perform LDA reduction on these features; train and generate lots of SVM with di erent
parameters.</p>
      <p>Step 3) Use the validation set to select the best model</p>
      <p>Step 4) Extract the visual features from the VCDT test image database using our extraction
method; perform LDA reduction on these features; and then use the best model to nd the best
discriminant feature.</p>
      <p>Step 5) Sort the test images by the distances from the positive training images and produce
the nal rank result.</p>
      <p>The same train and development sets have been used for the VD and SVM training. We
submitted ve runs to the o cial evaluation, from which the two best are depicted here :
Run SVM(LDA)
It consists in performing SVM on the LDA of [ PEF150 + HSV + EDGE + DF ] features to
reduce the impact of the highdimension malediction. The test of 10K images and 50 topics
costed 2 minutes on usual pentium IV PC.</p>
      <p>Run VD
Is is a vector search system, using small icons from the images. The visual features are the
HSV, edge, and Gabor. This model needs only 2 hours of training on a pentium IV 3Ghz,
4 GRam, and test is faster than SVM.</p>
      <p>The Area Under the Curve (Receiving Operator Caracteristisc integral) for each topic and
method are depicted in gure 2.</p>
      <p>
        We show in Table 2 that the SVM(LDA([PEF150+HSV+EDGE+DF])) is better than VD with
AUC = 0.72. It is our best run, it occupies the 8th rank among the 19 participating teams. The
same SVM(LDA) strategy has been applied on an another set of features (AVEIR group features
described in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]), but results to AUC = 0.50.
7
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>The SVM model has an higher average AUC than VD (0.722 against 0.682), but VD is lighter
than SVM, and is, for some topics, better than it. The table 3 gives the list of worst and best
topics for VD compared to SVM. The worst are for example "Snow, Winter, Sky, Desert, Beach
...", that are maybe topics with one clear visual representation, for example we can imagine a
dominant color and texture for snow or sky ... On the contrary, VD is better than SVM for
"Flower, Vehicle, Food, Autumn,..." that are maybe concepts with higher visual variations in
color, texture... Thus it suggests that statistics of simpler visual concepts are maybe better
modelized by SVM, while more complex visual concepts may be better represented by our Visual
Dictionary model. The respective performances of these two models shall also be tied with the
number of training samples. We currently investigate research on this promising improved VD,
and we propose an optimal fusion with SVM in order to bene t of the properties of the both.
0
.
8</p>
      <p>S
V
M
(
−
)
a
n
d
V
i
s
u
a
l
D
i
c
o
(
o
−
−
)</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgment</title>
      <p>This work was supported by French National Agency of Research (ANR-06-MDCA-002) and
Research Fund for the Doctoral Program of Higher Education of China (200803591024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Statistical learning theory</article-title>
          . John Wiley, New York (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Information retrieval and robust perception for a scaled multi-structuration,Thesis for habilitation of research direction</article-title>
          , University Sud Toulon-Var,
          <source>Toulon</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Z.Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ayache</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>E cient Image Concept Indexing by Harmonic &amp; Arithmetic Pro les Entropy</article-title>
          ,
          <source>IEEE International Conference on Image Processing, Cairo, Egypt, November</source>
          <volume>7</volume>
          -11, (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Waring</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Face detection using spectral histograms and SVMs</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <volume>35</volume>
          (
          <issue>3</issue>
          ),
          <volume>467</volume>
          {
          <fpage>476</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Tong</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Edward</surname>
          </string-name>
          , Chang :
          <article-title>Support Vector Machine active learning for image retrieval</article-title>
          ,
          <source>In Proceedings of the ninth ACM international conference on Multimedia, Canada</source>
          , pp.
          <volume>107</volume>
          {
          <issue>118</issue>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Suykens</surname>
            ,
            <given-names>J.A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandewalle</surname>
          </string-name>
          , J.:
          <article-title>Least Squares Support Vector Machine Classi ers</article-title>
          ,
          <source>In Neural Processing Letters</source>
          ,
          <volume>9</volume>
          , 293{
          <fpage>300</fpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Smach</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lemaitre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , Gauthier,
          <string-name>
            <given-names>J.P.</given-names>
            ,
            <surname>Miteran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Atri</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Generalized Fourier Descriptors with Applications to Objects Recognition in SVM Context</article-title>
          , In 30, J.
          <source>Math Imaging Vis</source>
          <volume>43</volume>
          {
          <fpage>71</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Dumont</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Merialdo</surname>
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Video search using a visual dictionary</article-title>
          .
          <source>In CBMI 2007, 5th International Workshop on Content-Based Multimedia Indexing</source>
          , June 25-27,
          <year>2007</year>
          , Bordeaux, France (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mcgill</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          : Introduction to Modern Information Retrieval,
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , Inc., New York, NY, USA (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jensen</surname>
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Shen</surname>
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Fuzzy-Rough Data Reduction with Ant Colony Optimization</article-title>
          ,
          <source>In Fuzzy Sets and Systems</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Belhumeur</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hespanha</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kriegman</surname>
            ,
            <given-names>D.J.:</given-names>
          </string-name>
          <article-title>Eigenfaces versus sher faces</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Machine Intell</source>
          .
          <volume>19</volume>
          ,
          <issue>711</issue>
          {
          <fpage>720</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Glotin</surname>
            <given-names>H.</given-names>
          </string-name>
          and al. :
          <article-title>Comparison of Various AVEIR Visual Concept Detectors with an Index of Carefulness, In ImageClef09 proceedings (</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Nowak</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dunker</surname>
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the CLEF 2009 Large Scale - Visual Concept Detection and Annotation Task</article-title>
          ,
          <source>CLEF working notes 2009</source>
          , Corfu, Greece, (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>