Introduction

A Software Architecture for Object Perception and Semantic Representation

Luca Buoncompagni

luca.buoncompagni@edu.unige.it 0

Fulvio Mastrogiovanni

fulvio.mastrogiovanni@unige.it 0 0 University of Genoa

In the near future, robots are expected to exhibit advanced capabilities when interacting with humans. In order to purposely understand humans and frame their requests in the right context, one of the major requirement for robot design is to develop a knowledge representation structure able to provide sensory data with a proper semantic description. This paper describes a software architecture aimed at detecting geometrical properties of a scene using an RGB-D sensor, and then categorising the objects within to associate them with a proper semantic annotation. Preliminary experiments are reported using a Baxter robot endowed with a Kinect RGB-D sensor.

Perception Semantic knowledge RGB-D sensor Software architecture

Introduction

Advanced human-robot interaction processes in everyday environments are expected to pose a number of challenges to robot design, specifically as far as perception, knowledge representation and action are concerned. Examples where advanced capabilities in robot cognition play a central role include robot companions [ 9 ] and robot co-workers [ 4 ], just to name a few.

It is expected that a major role in robot cognitive capabilities for humanrobot interaction will be played by a tight connection between robot perception processes and their semantic representation. The latter is expected to provide robot percepts with explicit contextual knowledge, which is implicitly assumed to be present when two humans interact and, after an adaptation process, reach a so-called mutual understanding state [ 11 ].

In the long-term, we consider human-robot interaction processes where a robot and a human share a workspace, and have to interact (physically, verbally or by means of gestures) in order to perform a certain joint operation. Examples of these processes include object handovers, joint manufacturing tasks, or typical household activities. In this paper, we focus on robot perception capabilities: a robot is able to detect and track the objects present in the shared workspace and, if they belong to known categories, to provide them with semantic meaning. To this aim, the proposed software architecture provides two main functionalities: [A] Kinect Environment

RGB-D raw data [B] Preprocessing ifltered point cloud [C] Clusterisation 1) down sampling. 2) depth feltering. 3) arm filtering.

[E] Shape detection dGeesoocbrmjiepectttiroicn 81)709))c)ypsclploianhnndeeereedrdededtteeeettcceettcciiottoiinono.n.n..

11) shape evaluator.

4) supports segmentation. 5) objects segmentation.

Objects point cloud Positioning object description [D] Tracking 6) centroids evalmuaemtioonryand managment.

– Clustering, tracking and categorisation. Starting from RGB-D data, the scene is processed to detect individual clusters in the point cloud. The position of each cluster, independently of its shape (in terms of the configuration of constituent points) is tracked over time. If a cluster can be mapped to a known basic geometric class (i.e., plane, cone, cylinder or sphere, a cube being given by six planes in a specific configuration), they are labelled accordingly using the Random Sample Consensus (RANSAC) algorithm [ 8 ]. – Semantic description. When object categories are determined, an ontology is dynamically updated to associate objects in the workspace (i.e., instances) with their semantic description (i.e., concepts). To this purpose, Description Logics (DLs) are used [ 1 ]. Once objects are classified, it is possible to tune robot behaviours accordingly, for example associating them with specific grasping behaviours, grounding them with proper verbal tags, using them in action planning processes.

The paper is organised as follows: Section 2 describes the proposed architecture; Section 3 discusses preliminary results; Conclusions follow. 2

System’s Architecture

The proposed software architecture is based on the computational design pattern [ 2 ], i.e., a sequence of basic computational steps carried out in sequence according to a pipeline structure (Figure 1). First, a raw point cloud is acquired by the Kinect component ([A] in the Figure) and, for each scan, depth data are preprocessed ([B]). Then, the Clusterisation component ([C]) detects object supports (e.g., tables or shelves), and segments the objects above them, by generating a point cluster for each object. Such a mid-level representation is used by the Tracking component ([D]), which compares each cluster in the current point cloud with clusters already present in memory, updates their position and, if a new cluster is detected, registers it in memory. Finally, the Shape detection component ([E]) provides an estimate of the object basic primitive shape, as well as its parameters. Algorithm 1: The Tracking component

Input: A vector C of n clusters belonging to the same support; a vector T of m tracked clusters; a matrix D of n m distances between current and tracked clusters; a vector U of m counters to check for how many scans a cluster has not been updated; a vector F of m Boolean values to keep track of which clusters have been updated.

Parameters: The radius ϵ 2 R and the threshold 2 N. 1 For each fk 2 F , fk false 2 foreach ci 2 C do 3 For each i = 1; : : : ; n and j = 1; : : : ; m, Di;j 1 4 foreach tj 2 T do 5 di;j dist(ci; tj) 6 if di;j < ϵ then 7 Di;j di;j 8 9 10 11 12 13 14 15 16 if 9Di;j such that Di;j = 1 then create tk 2 T using ci add and initialise uk 2 U such that uk else 0, fk 2 U such that fk 0 do = di;j argmini;j(D). if fo = false then update centroid and point cloud of to using weighted average between ci and tj fo true uo 0 17 foreach fk 2 F such that fk = false do 18 uk uk + 1 19 if uk > then 20 delete tk 2 T , uk 2 U, fk 2 F

The Preprocessing component performs a sequence of steps sequentially. First, a downsampling step is carried out to decrease the number of points in the point cloud provided by the RGB-D sensor [ 10 ]. Then, vectors normal to all the surfaces are computed [ 6 ]. The data is limited to all the points belonging to a semi-sphere around the Kinect-centred reference frame, which allows for focusing on a particular area of the workspace thereby drastically improving the overall system performance. Finally, the point cloud is also filtered to remove the points related to the robot’s arms. This is done by enveloping the links related to robot arms in bounding boxes, and checking for a point-in-parallelepiped inclusion.

The Clustering component recursively applies RANSAC to find all the horizontal planes in the scene. This information is used to determine the points belonging to the objects located on planes (i.e., acting as supports). Finally, an Euclidean clustering algorithm is applied to segment the objects in the scene [ 7 ]. As a result, the component generates for each support a set of clusters related to objects located above it. Each cluster i is represented by its visual centroid civ = (xiv; yiv; ziv), computed as the mean of all the points in i.

Although many approaches to obtain a robust tracking are available (see the work in [ 3 ] and the references therein), in this preliminary work we adopted a simple geometrical approach. Our aim is to obtain and evaluate a hybrid geometric/symbolic representation of objects. Previously detected objects are stored in memory, specifically using their visual centroid and an associated list of cloud points. Our current implementation of the Tracking component is depicted in Algorithm 1. After an initialisation phase (lines 1-10), first an association between current and tracked clusters is performed (lines 11-16), then old clusters are removed (lines 17-20). Given two clusters i and j detected at time instants t1 and t2, we refer to their visual centroids as civ(t1) and cjv(t2). We assume that j is an updated representation of i if cjv(t2) is located within a sphere of radius ϵ centred on civ(t1). A tracked cluster is removed from memory if it is not updated for consecutively scans.

Finally, the Shape detection component associates each cluster with a possible primitive shape (i.e., plane, cone, cylinder or sphere, a cube being given by six planes in a specific configuration), as well as its geometrical coeficients. To this aim, we employ RANSAC to find the best fitting shape based on the relative number of points belonging to those primitives. Once a cluster i is associated with a primitive shape, its representation can be augmented with a shape tag (i.e., a category), its coeficients (e.g., the axis for cones or the radius for cylinders), and the geometrical centroid cig, which is computed using the primitive shape rather than the point cloud. It is noteworthy that, in principle, cg is more accurate than cv, since it considers not only the visible part of the object but its full reconstruction provided by RANSAC.

Currently, knowledge about primitive shapes is maintained within an OWLbased ontology [ 5 ], where all the geometric object properties can be described. Two classes are used to model objects, namely VisualObj and GeomObj. The former models objects in the form of clusters, whereas the latter represents the associated primitive shape. GeomObj has a number of disjoint subclasses related to primitive shapes, including SphereObj, ConeObj, CylinderObj, and PlaneObj. Two data properties are used to describe visual and geometric centroids, namely hasVisualCen and hasGeomCen, as well as properties to describe shape-specific coeficients, e.g., a SphereObj has a radius specified using hasGeomRadius. As a consequence, a description corresponding to a cluster i is an instance of VisualObj if its property hasVisualCen contains a valid visual Centroid cv, and it does not i contain any valid description related to hasGeomCen. In formulas: VisualObj ⊑ 9hasVisualCen:Centroid ⊓ :9hasGeomCen:Centroid. A similar description holds for GeomObj. Two experiments have been set-up, the first aimed at evaluating the performance of the architecture in static conditions, the second on a set-up involving a Baxter robot. The system has been implemented in ROS and the Point Cloud Library (PCL) [ 7 ].

The first experiment is aimed at estimating the errors in shape detection in a static environment with multiple supports (Figure 2). Acquisition is made 500 times for every shape. Results are reported in the confusion matrix shown in Table 1. It is possible to see that system’s performance is reliable specifically for planes and spheres. Slightly lower recognition scores are present for cones and cylinders.

Visual Centroid over time

Geometric Centroid over time 0.7

The second experiment focuses on the performance of the tracker as well as the visual and geometrical centroids (Figure 3 on the left hand side). A cone has been fixed to the Baxter’s end-efector through a wire, in order to mimic a pendulum-like behaviour. When the robot moves the arm, the cone oscillates. The wire’s length and the cone’s mass are unknown to the robot. Figure 3 on the right hand side shows the tracked point cloud. It is noteworthy that the cluster does not represent the object completely, which afects the visual centroid, whereas the geometrical representation of the object allows for computing a more accurate centroid. Figure 4 shows the tracking of the two centroids. Intuitively, it can be noticed that the variance associated with cv is higher than that of cg, since the visible part of the object changes while the object oscillates. Moreover, it can be observed that between the two plots there is an ofset due to the geometric properties of the real object and its visible part. 4

Conclusions

This paper describes an architecture to model and track a few (both geometrical and semantic) properties of objects located on a table. The system is still workin-progress. One the one hand, we are interested in exploring the possibility of using symbolic-level information to model high-level object features, such as afordances. On the other hand, we believe that the interplay between the two representation levels can be exploited to increase the overall system’s capabilities.

1. Baader , F. , Galvanise , D. , McGuinness , D.L. , Nardi , D. , Patel-Schneider , P.F. : The Description Logic handbook: theory, implementation, and applications . Cambridge University Press, Cambridge, MA ( 2007 )

2. Brugali , D. : Software Engineering for Experimental Robotics . Springer, Heidelberg, Germany ( 2007 )

3. Endres , F. , Hess , J. , Sturm , J. , Cremers , D. , Burgard , W.: 3-D mapping with an RGB-D camera . IEEE Transactions on Robotics 30 ( 1 ), 177 - 187 ( 2013 )

4. Haddadin , S. , Suppa , M. , Fuchs , S. , Bodenmuller , T. , Albu-Schäfer , A. , Hirzinger , G.: Towards the robotic co-worker . In: Proceedings of the 2009 International Symposium on Robotics Research (ISRR 2009 ). Lucerne, Switzerland ( September 2009 )

5. McGuinness , D.L. , Harmelen , F.V. : OWL web ontology language overview . W3C Recommendation ( 2004 -03) ( 2004 )

6. Rusu , R.: Semantic 3D object maps for everyday robot manipulation . Springer, Heidelberg, Germany ( 2013 )

7. Rusu , R. , Cousins , S.: 3D is here: Point Cloud Library (PCL) . In: Proceedings of the 2011 IEEE International Conference on Robotics and Automation (ICRA 2011 ). Shanghai, China (May 2011 )

8. Schnabel , R. , Wahl , R. , Klein , R.: Eficient RANSAC for point-cloud shape detection . Computer Graphics Forum 26 ( 2 ), 214 - 226 ( 2007 )

9. Walters , M.L. , Syrdal , D.S. , Dautenhahn , K. ,

Boekhorst , K.L.K. : Avoiding the uncanny valley: robot appearance, personality and consistency of behaviours in an attention-seeking home scenario for a robot companion . Autonomous Robots 24 ( 2 ), 159 - 178 ( 2008 )

10. Whelan , T. , Kaess , M. , Fallon , M. , Johannsson , H. , Leonard , J. , McDonald , J. : Kintinuous: spatially extended KinectFusion . In: Proceedings of the 2012 RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras . Sydney, Australia ( July 2012 )

11. Zlotowski , J.A. , Sumioka , H. , Nishio , S. , Glas , D.F. , Bartneck , C. , Ishiguro , H.: Persistence of the uncanny valley: the influence of repeated interactions and a robot's attitude on its perception . Frontiers in Psychology 6 , 883 ( 2015 )