<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Ob ject Recognition with Segment Faces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Falk Schmidsberger</string-name>
          <email>fschmidsberger@hs-harz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frieder Stolzenburg</string-name>
          <email>fstolzenburg@hs-harz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Harz University of Applied Sciences, Automation and Computer Sciences Department</institution>
          ,
          <addr-line>Friedrichstr. 57-59, 38855 Wernigerode</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>21</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>Recognizing objects from images becomes a more and more important research and application topic. There are diverse applications such as face recognition, analysis of aerial images from multicopters, object tracking, image-based web search, etc. Many existing approaches focus on shape retrieval from a single polygon of contour points, or they try to compare clouds of interesting points of an object. However, human object recognition concentrates on few points of the segments forming the object. Clearly, complex objects, strictly speaking the projections of their shape on the image plain, consist of several (polygonal) segments. Therefore, the procedure presented in this paper takes this into account, by composing objects hierarchically into a group of segments. We briefly introduce our procedure for semantic object recognition based on clusters of image segment contours and discuss the problem of recognizing objects from different perspectives.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Already in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], it has been stated, that the contour points of an object, for
instance of a cat, are of particular importance for human semantic object
recognition. By semantic, we mean in this context that we do not only want to recognize
abstract geometric forms but real complex objects, given by (non-preprocessed)
example images, which are not characterized by a single contour, but by a group
thereof. The surface of a complex object can be considered as consisting of
several polygons. When such an object is viewed from a certain viewpoint, its image
may be perspectively distorted. Nevertheless, several features remain invariant,
for instance segment neighborhood relations, among others. Therefore, we focus
in our approach on segment contours and their adjacency relations. Our overall
procedure of object recognition roughly works as follows (cf. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]):
      </p>
      <p>
        Each object in an image is decomposed into segments with different shapes
and colors. In order to recognize an object, e.g. a house, it is necessary to find out
which segments are typical for this object and in which neighborhood of other
segments they occur. A group of typical and adjacent segments for a certain
object defines the whole object in the image. Similar segments are clustered.
A hierarchical composition of these segment clusters enables model building,
taking into account the spatial relations of the segments in the image. The
procedure employs methods from machine learning, namely k-means clustering
and decision trees with boosting [
        <xref ref-type="bibr" rid="ref3 ref5 ref9">3, 5, 9</xref>
        ], and from computer vision, e.g. image
pyramid segmentation and contour signatures [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The rest of this paper is organized as follows: First, we introduce our
procedure for semantic object recognition in some more detail (Sect. 2). After that, we
consider the influence of perspective projection on object recognition (Sect. 3).
Next, we provide a brief evaluation of the approach (Sect. 4) and discuss some
other approaches on object recognition (Sect. 5). Finally, we summarize and
conclude the paper (Sect. 6).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Semantic Object Recognition</title>
      <p>In our procedure for semantic object recognition, at first, we train models with
sample images for each object category. After training, the models are used to
recognize objects in other images. This is done as sketched in Fig. 1. Most of
the steps and data in this process (marked in black in the figure) are identical
for the training and the recognition phase. The steps and data that are relevant
only for the training phase are marked with blue boxes and arrows, whereas the
data and steps that are relevant only for the recognition phase are marked in
green.
1. Image optimization and segmentation:</p>
      <p>
        For each pixel in the image, similar neighboring pixels are colored with a
uniform color by a flood fill algorithm. With an image pyramid segmentation
algorithm, the shapes of the resulting blobs of uniform color are extracted
as image segments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
2. Segment feature vector extraction and normalization:
      </p>
      <p>
        A feature vector is computed for each segment, using the data of four
normalized distance histograms, computed from the segment contour. A
distance histogram consists of a vector of distances computed with several
related methods: polar distance, contour signature, and ray distance [
        <xref ref-type="bibr" rid="ref1 ref10 ref15">1, 10, 15</xref>
        ]
(see Fig. 2).
3. Compute/use cluster models over the feature vectors:
During training, a cluster model with all feature vectors from all images of
one category is created. Each cluster represents a familiar segment of the
actual object category. During recognition, the cluster model of a category
is used to select the familiar segments in the actual image.
4. Segment tree creation, using only familiar segments of a category:
A segment tree comprises typical adjacent segments and the hierarchical
composition of segment clusters for a certain object. It defines the whole
object in the image. This histogram method is invariant against translation,
rotation, scaling, and partially also to perspective distortions.
5. Compute/use the decision tree models:
      </p>
      <p>During training, a decision tree model from the data of the segment tree for
each object category is created. During recognition, the decision tree model
of each category is used to recognize objects in the actual image.
image, category data base
copter camera
actual camera image
images, categories 1 .. n
actual camera image
optimized image
image segments
feature vectors
(*) (*) (*)
... cm_n
cm_2
s_2</p>
      <p>...
dt_2
cluster
analysis
category</p>
      <p>2
st_2
s_2
category
affiliation
prediction
marking
category
object
...
...
s_n
...
...</p>
      <p>ca_n
1
3
2</p>
      <p>4
1 – minimal tangential contour signatur distance
2 – maximal polar distance
3 – maximal ray distance through centroid
4 – maximal tangential contour signatur distance
Let us now consider the problem of perspective distortion in more detail, where
objects are viewed from different viewpoints. In this context, the shape of an
object is more or less defined by its surface. For the sake of simplicity, we
assume that the surface is given as a polygon mesh. One question then is, what
happens with the contour of a polygonal segment on the object surface, when
it is viewed from different perspectives. Here, we restrict attention to the case
where the segment is completely visible and not partially hidden. Let us consider
the projective image of a cube as an example (Fig. 3). Clearly, by perspective
projection, angles between lines and lengths of lines including their ratios may
deviate significantly from their original values. In addition, parallelism of lines
is not preserved, too.</p>
      <p>Formally, the image of an object can be roughly described by central
projection. Central projection (planar 3D projection) is determined in essence by
the point of the observer, called central or focal point c and the position of the
image plane, in particular its normal vector v, both consisting of 3D coordinates.</p>
      <p>This gives us approximately 6 parameters. In contrast, a polygonal surface with
n vertices is determined by n 2D coordinates, giving us 2n values. Since 2n &gt; 6
for n ≥ 4, it follows that, although the shape of a polygon may vary widely by
perspective distortion (cf. Fig. 4 and 5), for polygons with 4 or more vertices,
there are clear limits for the distortion.</p>
      <p>
        In fact, there are several invariants during perspective projection: First of all,
straight lines remain straight lines. This implies that polygons remain polygons
with the same number and order of vertices. Furthermore, left-right relations,
which are important for localization, navigation and exploration [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], stay
invariant, provided that the polygons are always viewed from the same side, which is
usually the outside of the object. From this it follows in particular, that convex
edges remain convex and never become concave by perspective projection, and
vice versa. Therefore, the image in Fig. 6 definitely cannot be the projection of a
cube, because the leftmost polygon in the image is concave, whereas a cube has
only squares on its surface that are convex. Last but not least, adjacent segments
remain adjacent. Hence, the image segment tree remains the same, although of
course not all polygonal segment faces may be visible from all viewpoints.
      </p>
      <p>In principle, full perspective projection can be taken into account: For this,
a point x of a segment contour, given as 3D column vector in homogeneous
coordinates, i.e. with an additional component (cf. [7, Sect. 5.6]), is projected
onto the 2D plane, at position y, that is a 2D column vector (also in homogeneous
coordinates). The corresponding mapping M0 : x → y consists first of a rotation
R, that is 3 × 3 orthonormal matrix, and a translation T , a 3D column vector,
and then the actual projection P from distance −d, as follows (cf. [7, Sect. 6.4]):
 1 0 0 0 </p>
      <p>P</p>
      <p>
        Let now y1 and y2 be the projections of a given object point x, whose position
usually is not known, on two images. Therefore, we have yi = Mi · x for i = 1, 2,
where M1 and M2 are mappings as above, in general different. From this, we
obtain (1) λ y2 = M · y1 with M = M2 · M1†, where † denotes the Moore-Penrose
pseudoinverse operator and λ a scale factor, which is needed because the 3 × 3
matrix M , mapping the homogeneous 2D coordinates, is not affine in general, i.e.,
its last row may be different from [0 · · · 0 1]. Eq. (1) can be expressed equivalently
without reference to λ by the cross product y2 × (M · y1) = 0. This leads to three
linear equations in the 9 components of the matrix M , of which only two are
linearly independent however, for each projection point pair (y1, y2). Given n
such pairs, we arrive at the matrix equation A · m = 0, where A is a (2n) × 9
matrix and m is the matrix M reshaped as column vector (cf. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]).
      </p>
      <p>Since A is overdetermined in general and it must be m = 0, the solution for
A · m = 0 can be found by minimizing the squared (L2) vector norm (2) ε =
|A·m|2 with respect to the condition |m| = 1. Eq. (2) is equivalent to ε = (A·m) ·
(A·m) = m ·(A ·A)·m, where denotes transposition, thus (A ·A)·m = ε m
(after multiplication with m). Hence, the problem of determining whether two
segment contours stem from the same object, taking perspective distortion into
account, can be reduced to an eigenvalue problem. As distance measure in the
clustering procedure, the smallest eigenvalue ε of the 9 × 9 matrix (A · A) can
be used. Nevertheless, a normalization with respect to the starting point of the
segment contour has to be done.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation of the Approach</title>
      <p>
        The object recognition method with distance histograms (as described in Sect. 2)
has been implemented in C++/OpenCV [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] by the first author. For the full
treatment of perspective distortion (Sect. 3), so far only an implementation in
Matlab/Octave [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] by the second author is available. All in all, our object
recognition procedure works in practice: The segment neighborhood relations in the
image segment tree remain invariant, even after perspective distortion. Rotation
and translation of polygons is treated by length normalization. Although
projections may vary a lot, they often show still strong similarity to the original
image. The clustering of different segments takes this into account. The
experiments with our C++ implementation for object recognition with recognition
rates usually between 50 and 100 % – far above chance – are encouraging in this
direction.
      </p>
      <p>
        To test and improve the first implemented algorithm in a controlled
environment, it was used to classify images from the butterfly image dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. For all
seven categories, the right category of an image is predicted with a success rate
of 99.5 % if the image is from the training set and 27.14 % if the image is from the
test set. A random guess would give us only a success rate of 1/7 = 14.28 %. On
images made by the first author the success rates were 100.00 % and 46.00 % (5
categories). Here, a random guess would have a success rate of 1/5 = 20 % only. It
takes about 0.7 seconds to classify a live image, which need not be pre-segmented
into foreground and background.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Related Works</title>
      <p>
        The problem of recognizing and locating objects is very important in applications
such as robotics and navigation. Therefore, there are numerous related works.
The survey [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] reviews literature on both the 3D model building process and
techniques used to match and identify free-form objects from imagery, including
recognition from 2D silhouettes.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] presents an object recognition system that uses local image features,
which are invariant to image scaling, translation, and rotation, and partially
invariant to illumination changes and affine or 3D projection. This proposed
model shares properties with the object recognition in primate vision. A
nearestneighbor indexing method is employed that identifies candidate object matches.
This approach is very successful in practice. However, as already said in the
introduction, here more or less only points of clouds and not groups of segment
faces are considered, which appears to be more cognitively adequate.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] performs shape retrieval by considering qualitative relations. This means
the qualitative relations of the line segments forming the contour of the polygon
are considered during the object recognition phase. The approach is eventually
based on the so-called double-cross calculus [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Each object is identified by
exactly one contour built from more and more polygon vertices. The approach is
successful, however it does not reflect the fact, that complex objects may consist
of several segment contours.
6
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>
        With our approach, we can recognize objects in digital images, independent
of scaling, translation, rotation, and perspective distortions of the object. To
do this, we train and use models based on the segment shapes of the objects
and the topological and spatial relations of these segments. The next step is to
implement the approach as a real-time object recognition process on autonomous
multicopters (cf. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alegre</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Alaiz-Rodr´ıguez, R.,
          <string-name>
            <surname>Barreiro</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruiz</surname>
          </string-name>
          , J.:
          <article-title>Use of contour signatures and classification methods to optimize the tool life in metal machining</article-title>
          .
          <source>Estonian Journal of Engineering</source>
          <volume>1</volume>
          (
          <year>2009</year>
          )
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Attneave</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Some informational aspects of visual perception</article-title>
          .
          <source>Psychological Review</source>
          <volume>61</volume>
          (
          <issue>3</issue>
          ) (
          <year>1954</year>
          )
          <fpage>183</fpage>
          -
          <lpage>193</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Berry</surname>
            ,
            <given-names>M.J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Linoff</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. 3rd edn</article-title>
          . John Wiley &amp; Sons Inc. (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bradski</surname>
            ,
            <given-names>G.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaehler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Learning OpenCV - computer vision with the OpenCV library: software that sees</article-title>
          .
          <source>O'Reilly</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olshen</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stone</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          : Classification and
          <string-name>
            <given-names>Regression</given-names>
            <surname>Trees</surname>
          </string-name>
          .
          <source>The Wadsworth Statistics/Probability Series. Wadsworth Publishing</source>
          (
          <year>1983</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Campbell</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flynn</surname>
            ,
            <given-names>P.J.:</given-names>
          </string-name>
          <article-title>A survey of free-form object representation and recognition techniques</article-title>
          .
          <source>Computer Vision and Image Understanding</source>
          <volume>81</volume>
          (
          <issue>2</issue>
          ) (
          <year>2001</year>
          )
          <fpage>166</fpage>
          -
          <lpage>210</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Foley</surname>
          </string-name>
          , J.D., van
          <string-name>
            <surname>Dam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feiner</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hughes</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          :
          <source>Computer Graphics: Principles and Practice. 2nd edn. The Systems Programming Series. Addison-Wesley Publishing Company</source>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gilat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>MATLAB: An Introduction with Applications. 2nd edn</article-title>
          . John Wiley &amp; Sons (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The Elements of Statistical Learning: Data Mining, Inference, and Prediction</article-title>
          .
          <source>2nd edn</source>
          . Springer Series in Statistics. Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Ja¨hne, B.:
          <source>Digital Image Processing. 6th revised and extended edn</source>
          . Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponce</surname>
          </string-name>
          , J.:
          <article-title>Semi-local affine parts for object recognition</article-title>
          .
          <source>In: Proceedings of the British Machine Vision Conference</source>
          . Volume
          <volume>2</volume>
          . (
          <year>2004</year>
          )
          <fpage>959</fpage>
          -
          <lpage>968</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          :
          <article-title>Object recognition from local scale-invariant features</article-title>
          .
          <source>Computer Vision</source>
          , IEEE International Conference on
          <volume>2</volume>
          (
          <year>1999</year>
          )
          <fpage>1150</fpage>
          -
          <lpage>1157</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Schmidsberger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stolzenburg</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Object recognition with multicopters</article-title>
          . In W¨olfl, S., ed.:
          <source>Poster and Demo Track of the 35th German Conference on Artificial Intelligence (KI-2012)</source>
          , Saarbru¨cken (
          <year>2012</year>
          )
          <fpage>83</fpage>
          -
          <lpage>87</lpage>
          Obtained Best Poster and Demo Presentation Award.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Schuldt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Shape retrieval with qualitative relations: The influence of part-order and approximation precision on retrieval performance and computational effort</article-title>
          . In Bach, J.,
          <string-name>
            <surname>Edelkamp</surname>
          </string-name>
          , S., eds.
          <source>: KI 2011: Advances in Artificial Intelligence - Proceedings of the 34th Annual German Conference on Artificial Intelligence. LNAI 7006</source>
          , Berlin, Springer (
          <year>2011</year>
          )
          <fpage>301</fpage>
          -
          <lpage>312</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Shuang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Shape representation and retrieval using distance histograms</article-title>
          .
          <source>Technical report</source>
          , Dept. of Computing Science, University of Alberta (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Stolzenburg</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Localization, exploration, and navigation based on qualitative angle information</article-title>
          .
          <source>Spatial Cognition and Computation: An Interdisciplinary Journal</source>
          <volume>10</volume>
          (
          <issue>1</issue>
          ) (
          <year>2010</year>
          )
          <fpage>28</fpage>
          -
          <lpage>52</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. Wikipedia: Projektionsmatrix - Wikipedia:
          <article-title>Die freie Enzyklop¨adie (</article-title>
          <year>2013</year>
          )
          <article-title>[Online; Stand 14</article-title>
          .
          <year>August 2013</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Zimmermann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freksa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Qualitative spatial reasoning using orientation, distance, and path knowledge</article-title>
          .
          <source>Applied Intelligence</source>
          <volume>6</volume>
          (
          <year>1996</year>
          )
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>