<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Novel Structural-Description Approach For Image Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christoph Rasche</string-name>
          <email>rasche15@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitatea Politehnica din Bucuresti Bucuresti 061071, RO</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <abstract>
        <p>We tested our image classification methodology in the photo-annotation task of the ImageCLEF competition [Nowak, 2010] using a visual-only approach performing automated labeling. Our labeling process consisted of three phases: 1) feature extraction using color histogramming and using a novel method of structural description, that was exploited in a statistical manner only; 2) classification using Linear Discriminant (LD) or Average-Retrieval Rank (ARR) methods that provided the confidence (scalar) values, which were then thresholded to obtain the binary values; 3) eliminating labels (setting binary values to 0) on the testing set thereby exploiting the calculated joint-probabilities for pairs of concepts from the training set. The results show that our present system performs better on 'whole-image' labels than on object labels.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The main novelty of the presented approach is the use of a decomposition of
structure as introduced in Rasche [Rasche, 2010]. The decomposition output is
particularly suited to represent the geometry of contours and the geometry of
their relations (pairs or clusters of contours), but it is applied here only in a
statstical form for reason of simplicity [Rasche, 2011], together with a color
histogramming approach as described in Vertan et al. [Vertan and Boujemaa, 2000b].
This statistical classification has already been shown to be useful for video
indexing [Ionescu et al., 2010].</p>
      <p>Looking at the provided photo annotations we realized that the spatial size
of the annotated object or scene can vary substantially in reference to the image
size: an annotation can describe either the image content as a whole and is thus
suitable for (semantic) image classification, or it can describe a part of a scene
(e.g isolated objects) and is thus rather suited for object-detection systems. A
clear distinction between whole or part annotation is difficult of course, but is
in our opinion desirable to better exploit the annotations, e.g. by providing a
scalar value denoting the size of the object (1=whole image, 0.3=part/object
covering ca. one third of the image). A typical recognition system is specialized
for one process, either image classification or object detection. Our methodology
is geared toward image classification and therefore is limitedly useful for ’part’
annotations.</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Feature Extraction</title>
        <p>Color and texture characterization The classical histogram image content
description approach was further refined by the classification of the image pixels
in several classes, according to a local attribute (such as the edge strength). We
can easily imagine a classification in three classes, consisting of pixels
characterized by a small, medium and high edge strength. The number of classes is thus
related to the number of quantization level of the pixel attribute’s. At the limit,
since every pixel has acquired a supplementary, highly relevant characteristic, we
can easily imagine a one pixel per class approach, which will certainly provide a
very accurate description of the image, but will require a very important size.</p>
        <p>In order to keep the balance between the histogram size and the
discrimination between pixels we propose to adaptively weight the contribution of each
pixel of the image into the color distribution [Vertan and Boujemaa, 2000b]. This
individual weighting allows a finer distinction between pixels having the same
color and the construction of a weighted histogram that accounts both color
distribution and statistical non-uniformity measures. Thus, we will use a modified
histogram, defined as:
(1)
(2)
(3)
h(c) =</p>
        <p>M−1 N−1
1 X X w(i, j)δ (f (i, j) − c) , ∀c ∈ C</p>
        <p>M N i=0 j=0
In the equation above w(i, j) is the weighting coefficient of the color at spatial
position (i, j). We may notice that, since w(i, j) must be a scalar, we cannot use
any color statistics (which are necessarily vector triples).</p>
        <p>Intuitively the accounting within the color distribution of some local measures
of each pixel could be considered as a way of integrating both color and texture,
provided that the local measure have a textural background. The
Laplacianweighted histograms [Vertan and Boujemaa, 2000b], [Vertan and Boujemaa, 2000a]
are defined as:</p>
        <p>M−1 N−1
h˜(c) = X X δ (f (i, j) − c)
i=0 j=0</p>
        <p>1
1 + Δ2(i, j)</p>
        <p>, ∀c ∈ C, or</p>
        <p>M−1 N−1
h˜(c) = X X δ (f (i, j) − c) Δ2(i, j), ∀c ∈ C</p>
        <p>i=0 j=0
The relation (2) emphasizes the weight of pixels that belong to constant
(uniform) regions: their Laplacian is very small, so they sum with an unitary weight;
the pixels placed on the edges are characterized by an important Laplacian and
thus their contribution to the corresponding c bin is very small. This behavior
is thought to reduce the influence of the uncertain colors, situated at the border
between different objects and is derived from the gray-scale image case of
choosing the segmentation thresholds as the minima of the histogram. The relation
from (3) corresponds to a dual behavior, counting the colors proportionally to
their edge strength.</p>
        <p>Colors are uniformly quantized with 6 bins per RGB color component,
yielding a 216 components feature vector per image.</p>
        <p>
          Structure characterization Images were downsampled to a maximum size of
300 pixels for any side length (width or height) to decrease computation time.
The structural processing started with contour extraction
          <xref ref-type="bibr" rid="ref1">([Canny, 1986])</xref>
          at 4
different scales (sigma=1,2,3 and 5). Contours were then partitioned and
represented as described in
          <xref ref-type="bibr" rid="ref5">(Rasche 2010)</xref>
          leading to 7 geometric and 5 appearance
parameters for each contour segment (arc, ’wiggliness’, curvature, circularity,
edginess, symmetry, contrast, ’fuzziness’). Contour segments are then paired and
clustered leading to another 58 parameters describing various distance
measurements (between segments end and center points) and structural biases (degree
of parallelism, T feature, L feature,...), see [Rasche, 2010] for details. For each
parameter a 10-bin histogram is generated; the histograms are then
concatenated to form a single vector of 700 dimensions. The average processing time for
structural processing is ca. 40 seconds per image on a 2.6 GHz machine.
        </p>
        <p>- Integration: The color and structural parameters are then concatenated to
a single image vector with 916 dimensions (700 structural and 216 color
parameters).
2.2</p>
        <p>Classification
- LDA: A Linear Discriminant Analysis was applied to train a one-versus-all
classifier for each of the 93 concepts (on the 8000 training images). This resulted
in an average number of 24.9 labels per photo, more than twice as much as the
average number of labels per training image (12.0). The posterior values of the
classifier are provided as confidence values.</p>
        <p>- ARR: The concepts for any test image are assigned based on a weighted
average retrieval rank (ARR) of all training images retrieved following the query by
example with the said test image. The binary concepts are obtained by a
conceptadaptive threshold; the concept thresholds are computed based on the training
image set annotations under the assumption that the test image database is
statistically similar to the train image database.
2.3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Label Elimination</title>
        <p>Because the LDA method (see above) returned a much larger proportion of
labels for the testing set (24.9 labels/image) than for the training set (12.0), we
attempted to reduce the number of labels by eliminating unlikely labels based on
the joint-probabilities observed in the training set. Within the training set, we
determined which pairs appeared as mutual exclusive (joint probability equal
0). If a testing image contained a pair of labels that are mutual exclusive in
the training set, then the one label (of the pair) was eliminated that showed
a lower posterior value (obtained from the LDA classifier) in reference to the
entire distribution of posterior values for each concept. After label elimination,
the average number of labels was lower by ca. 6 labels, see last column in table
number 1.
2.4</p>
      </sec>
      <sec id="sec-2-3">
        <title>Runs</title>
        <p>All runs contained the same structural preprocessing, but differed in their choice
of color processing; thus, each run was tested with 916 parameters. The runs
differed also in the choice of the classifier method and whether label elimination
was used.
Runs 1 and 2 performed substantially better than the other three runs, for
the MAP (Mean Average Precision; concept-based) and the F-ex measure
(Fmeasure; example-based). For the OS-fcs (Ontology Score with Flickr Context
Similarity costmap; example-based), runs 1 and 2 also performed better but not
as distinctively as for the other two measures:</p>
        <p>Thus, ARR classification without label elimination outperforms LDA
classification and label-elimination, but whether the performance difference is due to
choice of classifer (LDA) or the attempt to eliminate labels can not be
determined.</p>
        <p>The detailed results per concept are shown in figure 1. Concepts with low
average precision are skateboard, horse, cat, fish, etc. and tend to be objects,
some of them likely part of the image only. Concepts with high precision are
neutral illumination, no visual season, no blur, no persons etc. and tend to be
whole-image annotations. This is what we roughly expected.</p>
        <p>Sorted Concepts
1
0.9
0.8
0.7
ino0.6
s
i
c
e
reP0.5
g
a
r
e
vA0.4
0.3
0.2
0.1</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.2 Comparison to other groups</title>
        <p>In comparison to other classification systems, our best results (of run no. 1) rank
as 22nd out of 46 for the first two measures (MAP and F-ex) and 15th for the
ontology-score measure (OS-fcs).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Discussion</title>
      <p>Although we applied our structural decomposition in statistical manner only and
on down-scaled image resolutions, it achieved already a performance comparable
to other approaches. We do not expect a much better performance if the full
image resolution was employed; rather, the long-term improvement lies in exploiting
the individual contours and their relations, for which a proper learning algorithm
needs to be developed [Rasche, 2011]. The fact that image size is not crucial for
the extraction of semantic meaning - at least not for a human observer - was well
pointed out by Torralba and co-workers [Torralba et al., 2008]. A quick solution
to improve the present performance of our system could be to merge it with one
of the appearance-based methods [Shotton et al., 2008,Heitz et al., 2009].</p>
      <p>That label elimination did not lead to a significant improvement was
unexpected, indeed it may have been even detrimental. But we still think that a
proper exploitation of the joint probabilities can lead to a better performance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Canny</source>
          , 1986] Canny,
          <string-name>
            <surname>J.</surname>
          </string-name>
          (
          <year>1986</year>
          ).
          <article-title>A computational approach to edge-detection</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>8</volume>
          (
          <issue>6</issue>
          ):
          <fpage>679</fpage>
          -
          <lpage>698</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Heitz et al.,
          <year>2009</year>
          ] Heitz,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Elidan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Packer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            , and
            <surname>Koller</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Shapebased object localization for descriptive classification</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>84</volume>
          :
          <fpage>40</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Ionescu et al.,
          <year>2010</year>
          ] Ionescu,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Rasche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Vertan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            , and
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>A contour-color-action approach to automatic classification of several common video genres</article-title>
          .
          <source>In AMR 8th International Workshop on Adaptive Multimedia Retrieval. Linz</source>
          , Austria.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Nowak</source>
          , 2010] Nowak,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huiskes</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>New strategies for image annotation: Overview of the photo annotation task at imageclef 2010</article-title>
          .
          <source>In In the Working Notes of CLEF</source>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Rasche</source>
          , 2010] Rasche,
          <string-name>
            <surname>C.</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>An approach to the parameterization of structure for fast categorization</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>87</volume>
          :
          <fpage>337</fpage>
          -
          <lpage>356</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Rasche</source>
          , 2011] Rasche,
          <string-name>
            <surname>C.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Contour groupings and their description for structural recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , Under Review.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Shotton et al.,
          <year>2008</year>
          ] Shotton,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Blake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , and
            <surname>Cipolla</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Multi-scale categorical object recognition using contour fragments</article-title>
          .
          <source>IEEE Transactions of Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>30</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1270</fpage>
          -
          <lpage>1281</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Torralba et al.,
          <year>2008</year>
          ] Torralba,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            , and
            <surname>Freeman</surname>
          </string-name>
          , W. T. (
          <year>2008</year>
          ).
          <article-title>80 million tiny images: a large dataset for non-parametric object and scene recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)</source>
          <year>2008</year>
          ,
          <volume>30</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1958</fpage>
          -
          <lpage>1970</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Vertan and Boujemaa</source>
          , 2000a]
          <string-name>
            <surname>Vertan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Boujemaa</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2000a</year>
          ).
          <article-title>Spatially constrained color distributions for image indexing</article-title>
          .
          <source>In Proc. of CGIP</source>
          <year>2000</year>
          , pages
          <fpage>261</fpage>
          -
          <lpage>265</lpage>
          ,
          <string-name>
            <surname>Saint</surname>
            <given-names>Etienne</given-names>
          </string-name>
          , France.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Vertan and Boujemaa</source>
          , 2000b]
          <string-name>
            <surname>Vertan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Boujemaa</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2000b</year>
          ).
          <article-title>Upgrading color distributions for image retrieval: Can we do better</article-title>
          ? In Laurini, R., editor,
          <source>Advances in Visual Information Systems</source>
          , volume
          <volume>1929</volume>
          of Lectures Notes in Computer Science LNCS, chapter , pages
          <fpage>178</fpage>
          -
          <lpage>188</lpage>
          . Springer Verlag, Berlin, Germany.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>