<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Software Architecture for Object Perception and Semantic Representation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Buoncompagni</string-name>
          <email>luca.buoncompagni@edu.unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fulvio Mastrogiovanni</string-name>
          <email>fulvio.mastrogiovanni@unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Genoa</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the near future, robots are expected to exhibit advanced capabilities when interacting with humans. In order to purposely understand humans and frame their requests in the right context, one of the major requirement for robot design is to develop a knowledge representation structure able to provide sensory data with a proper semantic description. This paper describes a software architecture aimed at detecting geometrical properties of a scene using an RGB-D sensor, and then categorising the objects within to associate them with a proper semantic annotation. Preliminary experiments are reported using a Baxter robot endowed with a Kinect RGB-D sensor.</p>
      </abstract>
      <kwd-group>
        <kwd>Perception</kwd>
        <kwd>Semantic knowledge</kwd>
        <kwd>RGB-D sensor</kwd>
        <kwd>Software architecture</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Advanced human-robot interaction processes in everyday environments are
expected to pose a number of challenges to robot design, specifically as far as
perception, knowledge representation and action are concerned. Examples where
advanced capabilities in robot cognition play a central role include robot
companions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and robot co-workers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], just to name a few.
      </p>
      <p>
        It is expected that a major role in robot cognitive capabilities for
humanrobot interaction will be played by a tight connection between robot perception
processes and their semantic representation. The latter is expected to provide
robot percepts with explicit contextual knowledge, which is implicitly assumed
to be present when two humans interact and, after an adaptation process, reach
a so-called mutual understanding state [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>In the long-term, we consider human-robot interaction processes where a
robot and a human share a workspace, and have to interact (physically, verbally
or by means of gestures) in order to perform a certain joint operation. Examples
of these processes include object handovers, joint manufacturing tasks, or typical
household activities. In this paper, we focus on robot perception capabilities: a
robot is able to detect and track the objects present in the shared workspace and,
if they belong to known categories, to provide them with semantic meaning. To
this aim, the proposed software architecture provides two main functionalities:
[A] Kinect
Environment</p>
      <p>RGB-D raw data
[B] Preprocessing ifltered point cloud [C] Clusterisation
1) down sampling.
2) depth feltering.
3) arm filtering.</p>
      <p>[E] Shape detection
dGeesoocbrmjiepectttiroicn 81)709))c)ypsclploianhnndeeereedrdededtteeeettcceettcciiottoiinono.n.n..</p>
      <p>11) shape evaluator.</p>
      <p>4) supports
segmentation.
5) objects
segmentation.</p>
      <p>Objects
point cloud
Positioning object
description
[D] Tracking
6) centroids
evalmuaemtioonryand
managment.</p>
      <p>
        – Clustering, tracking and categorisation. Starting from RGB-D data, the scene
is processed to detect individual clusters in the point cloud. The position
of each cluster, independently of its shape (in terms of the configuration
of constituent points) is tracked over time. If a cluster can be mapped to a
known basic geometric class (i.e., plane, cone, cylinder or sphere, a cube being
given by six planes in a specific configuration), they are labelled accordingly
using the Random Sample Consensus (RANSAC) algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
– Semantic description. When object categories are determined, an ontology is
dynamically updated to associate objects in the workspace (i.e., instances)
with their semantic description (i.e., concepts). To this purpose, Description
Logics (DLs) are used [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Once objects are classified, it is possible to tune
robot behaviours accordingly, for example associating them with specific
grasping behaviours, grounding them with proper verbal tags, using them in
action planning processes.
      </p>
      <p>The paper is organised as follows: Section 2 describes the proposed
architecture; Section 3 discusses preliminary results; Conclusions follow.
2</p>
    </sec>
    <sec id="sec-2">
      <title>System’s Architecture</title>
      <p>
        The proposed software architecture is based on the computational design
pattern [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], i.e., a sequence of basic computational steps carried out in sequence
according to a pipeline structure (Figure 1). First, a raw point cloud is acquired
by the Kinect component ([A] in the Figure) and, for each scan, depth data
are preprocessed ([B]). Then, the Clusterisation component ([C]) detects object
supports (e.g., tables or shelves), and segments the objects above them, by
generating a point cluster for each object. Such a mid-level representation is used
by the Tracking component ([D]), which compares each cluster in the current
point cloud with clusters already present in memory, updates their position and,
if a new cluster is detected, registers it in memory. Finally, the Shape detection
component ([E]) provides an estimate of the object basic primitive shape, as well
as its parameters.
Algorithm 1: The Tracking component
      </p>
      <p>Input: A vector C of n clusters belonging to the same support; a vector T of m tracked
clusters; a matrix D of n m distances between current and tracked clusters; a
vector U of m counters to check for how many scans a cluster has not been updated;
a vector F of m Boolean values to keep track of which clusters have been updated.</p>
      <p>Parameters: The radius ϵ 2 R and the threshold 2 N.
1 For each fk 2 F , fk false
2 foreach ci 2 C do
3 For each i = 1; : : : ; n and j = 1; : : : ; m, Di;j 1
4 foreach tj 2 T do
5 di;j dist(ci; tj)
6 if di;j &lt; ϵ then
7 Di;j di;j
8
9
10
11
12
13
14
15
16
if 9Di;j such that Di;j = 1 then
create tk 2 T using ci
add and initialise uk 2 U such that uk
else
0, fk 2 U such that fk
0
do = di;j argmini;j(D).
if fo = false then
update centroid and point cloud of to using weighted average between ci and tj
fo true
uo 0
17 foreach fk 2 F such that fk = false do
18 uk uk + 1
19 if uk &gt; then
20 delete tk 2 T , uk 2 U, fk 2 F</p>
      <p>
        The Preprocessing component performs a sequence of steps sequentially.
First, a downsampling step is carried out to decrease the number of points in the
point cloud provided by the RGB-D sensor [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Then, vectors normal to all the
surfaces are computed [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The data is limited to all the points belonging to a
semi-sphere around the Kinect-centred reference frame, which allows for focusing
on a particular area of the workspace thereby drastically improving the overall
system performance. Finally, the point cloud is also filtered to remove the points
related to the robot’s arms. This is done by enveloping the links related to robot
arms in bounding boxes, and checking for a point-in-parallelepiped inclusion.
      </p>
      <p>
        The Clustering component recursively applies RANSAC to find all the
horizontal planes in the scene. This information is used to determine the points
belonging to the objects located on planes (i.e., acting as supports). Finally, an
Euclidean clustering algorithm is applied to segment the objects in the scene [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
As a result, the component generates for each support a set of clusters related
to objects located above it. Each cluster i is represented by its visual centroid
civ = (xiv; yiv; ziv), computed as the mean of all the points in i.
      </p>
      <p>
        Although many approaches to obtain a robust tracking are available (see the
work in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the references therein), in this preliminary work we adopted
a simple geometrical approach. Our aim is to obtain and evaluate a hybrid
geometric/symbolic representation of objects. Previously detected objects are
stored in memory, specifically using their visual centroid and an associated list of
cloud points. Our current implementation of the Tracking component is depicted
in Algorithm 1. After an initialisation phase (lines 1-10), first an association
between current and tracked clusters is performed (lines 11-16), then old clusters
are removed (lines 17-20). Given two clusters i and j detected at time instants
t1 and t2, we refer to their visual centroids as civ(t1) and cjv(t2). We assume that
j is an updated representation of i if cjv(t2) is located within a sphere of radius ϵ
centred on civ(t1). A tracked cluster is removed from memory if it is not updated
for consecutively scans.
      </p>
      <p>Finally, the Shape detection component associates each cluster with a possible
primitive shape (i.e., plane, cone, cylinder or sphere, a cube being given by six
planes in a specific configuration), as well as its geometrical coeficients. To this
aim, we employ RANSAC to find the best fitting shape based on the relative
number of points belonging to those primitives. Once a cluster i is associated with
a primitive shape, its representation can be augmented with a shape tag (i.e., a
category), its coeficients (e.g., the axis for cones or the radius for cylinders), and
the geometrical centroid cig, which is computed using the primitive shape rather
than the point cloud. It is noteworthy that, in principle, cg is more accurate
than cv, since it considers not only the visible part of the object but its full
reconstruction provided by RANSAC.</p>
      <p>
        Currently, knowledge about primitive shapes is maintained within an
OWLbased ontology [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where all the geometric object properties can be described.
Two classes are used to model objects, namely VisualObj and GeomObj. The
former models objects in the form of clusters, whereas the latter represents the
associated primitive shape. GeomObj has a number of disjoint subclasses related
to primitive shapes, including SphereObj, ConeObj, CylinderObj, and PlaneObj.
Two data properties are used to describe visual and geometric centroids, namely
hasVisualCen and hasGeomCen, as well as properties to describe shape-specific
coeficients, e.g., a SphereObj has a radius specified using hasGeomRadius. As a
consequence, a description corresponding to a cluster i is an instance of VisualObj
if its property hasVisualCen contains a valid visual Centroid cv, and it does not
i
contain any valid description related to hasGeomCen. In formulas: VisualObj ⊑
9hasVisualCen:Centroid ⊓ :9hasGeomCen:Centroid. A similar description holds for
GeomObj.
Two experiments have been set-up, the first aimed at evaluating the performance
of the architecture in static conditions, the second on a set-up involving a Baxter
robot. The system has been implemented in ROS and the Point Cloud Library
(PCL) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The first experiment is aimed at estimating the errors in shape detection in
a static environment with multiple supports (Figure 2). Acquisition is made 500
times for every shape. Results are reported in the confusion matrix shown in
Table 1. It is possible to see that system’s performance is reliable specifically for
planes and spheres. Slightly lower recognition scores are present for cones and
cylinders.</p>
      <p>Visual Centroid over time</p>
      <p>Geometric Centroid over time
0.7</p>
      <p>The second experiment focuses on the performance of the tracker as well as
the visual and geometrical centroids (Figure 3 on the left hand side). A cone
has been fixed to the Baxter’s end-efector through a wire, in order to mimic a
pendulum-like behaviour. When the robot moves the arm, the cone oscillates.
The wire’s length and the cone’s mass are unknown to the robot. Figure 3 on
the right hand side shows the tracked point cloud. It is noteworthy that the
cluster does not represent the object completely, which afects the visual centroid,
whereas the geometrical representation of the object allows for computing a more
accurate centroid. Figure 4 shows the tracking of the two centroids. Intuitively,
it can be noticed that the variance associated with cv is higher than that of cg,
since the visible part of the object changes while the object oscillates. Moreover,
it can be observed that between the two plots there is an ofset due to the
geometric properties of the real object and its visible part.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>This paper describes an architecture to model and track a few (both geometrical
and semantic) properties of objects located on a table. The system is still
workin-progress. One the one hand, we are interested in exploring the possibility
of using symbolic-level information to model high-level object features, such as
afordances. On the other hand, we believe that the interplay between the two
representation levels can be exploited to increase the overall system’s capabilities.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baader</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Galvanise</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nardi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel-Schneider</surname>
            ,
            <given-names>P.F.</given-names>
          </string-name>
          :
          <article-title>The Description Logic handbook: theory, implementation, and applications</article-title>
          . Cambridge University Press, Cambridge, MA (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brugali</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Software Engineering for Experimental Robotics</article-title>
          . Springer, Heidelberg, Germany (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Endres</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sturm</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremers</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burgard</surname>
          </string-name>
          , W.:
          <article-title>3-D mapping with an RGB-D camera</article-title>
          .
          <source>IEEE Transactions on Robotics</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <fpage>177</fpage>
          -
          <lpage>187</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Haddadin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suppa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuchs</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bodenmuller</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Albu-Schäfer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirzinger</surname>
          </string-name>
          , G.:
          <article-title>Towards the robotic co-worker</article-title>
          .
          <source>In: Proceedings of the 2009 International Symposium on Robotics Research (ISRR</source>
          <year>2009</year>
          ). Lucerne,
          <string-name>
            <surname>Switzerland</surname>
          </string-name>
          (
          <year>September 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harmelen</surname>
            ,
            <given-names>F.V.</given-names>
          </string-name>
          :
          <article-title>OWL web ontology language overview</article-title>
          .
          <source>W3C Recommendation</source>
          (
          <year>2004</year>
          -03) (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rusu</surname>
          </string-name>
          , R.:
          <article-title>Semantic 3D object maps for everyday robot manipulation</article-title>
          . Springer, Heidelberg, Germany (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cousins</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>3D is here: Point Cloud Library (PCL)</article-title>
          .
          <source>In: Proceedings of the 2011 IEEE International Conference on Robotics and Automation (ICRA</source>
          <year>2011</year>
          ). Shanghai, China (May
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Schnabel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wahl</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
          </string-name>
          , R.:
          <article-title>Eficient RANSAC for point-cloud shape detection</article-title>
          .
          <source>Computer Graphics Forum</source>
          <volume>26</volume>
          (
          <issue>2</issue>
          ),
          <fpage>214</fpage>
          -
          <lpage>226</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Walters</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Syrdal</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dautenhahn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Boekhorst</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.L.K.</surname>
          </string-name>
          :
          <article-title>Avoiding the uncanny valley: robot appearance, personality and consistency of behaviours in an attention-seeking home scenario for a robot companion</article-title>
          .
          <source>Autonomous Robots</source>
          <volume>24</volume>
          (
          <issue>2</issue>
          ),
          <fpage>159</fpage>
          -
          <lpage>178</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Whelan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaess</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fallon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johannsson</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leonard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Kintinuous: spatially extended KinectFusion</article-title>
          .
          <source>In: Proceedings of the 2012 RSS Workshop</source>
          on RGB-D:
          <article-title>Advanced Reasoning with Depth Cameras</article-title>
          . Sydney,
          <string-name>
            <surname>Australia</surname>
          </string-name>
          (
          <year>July 2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zlotowski</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sumioka</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nishio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glas</surname>
            ,
            <given-names>D.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bartneck</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ishiguro</surname>
          </string-name>
          , H.:
          <article-title>Persistence of the uncanny valley: the influence of repeated interactions and a robot's attitude on its perception</article-title>
          .
          <source>Frontiers in Psychology 6</source>
          ,
          <issue>883</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>