GMM-based Spatial Change Detection from
Bimanual Tracking and Point Cloud Differences

             Riccardo Monica, Andrea Zinelli, and Jacopo Aleotti

     Robotics and Intelligent Machines Laboratory, Department of Information
                     Engineering, University of Parma, Italy,
                                rmonica@ce.unipr.it
                          andrea.zinelli1@studenti.unipr.it
                                 aleotti@ce.unipr.it


      Abstract. Robots that detect changes in the environment can attain
      better context awareness and increased autonomy. In this work, a spatial
      change detection approach is presented which uses a single fixed depth
      camera to identify environment changes caused by human activities. The
      proposed method combines hand tracking and the difference between
      organized point clouds. Bimanual movements are recorded in real-time
      and encoded in Gaussian Mixture Models (GMMs). We show that GMMs
      enable change detection in presence of occlusions. We also show that the
      GMM analysis narrows down potential salient regions of space where
      manipulation actions are carried out. Experiments have been performed
      in an indoor environment for object placement, object removal and object
      repositioning tasks.

      Keywords: Gaussian Mixture Models, range sensing, human motion
      tracking


1   Introduction

In this work, a method for spatial change detection is presented that identifies
environment changes due to human activities. The experimental setup includes
a single fixed depth camera (Kinect V2). Detection of the salient regions of the
environment, where human actions have been carried out, is achieved by com-
puting the difference between two organized point clouds, that are acquired at
the beginning and at the end of each experimental session. Moreover, bimanual
movements are tracked in real-time and encoded using GMMs. GMM analysis
enables change detection in presence of occlusions and it reduces the number
of false positives. Spatial change detection has mainly been investigated with
object-based approaches by exploiting cameras or depth sensors [2, 9, 1, 6, 8, 13,
3, 7]. Petsch et al. [15] presented a framework for sensor-based detection of unex-
pected (surprising) events where manipulation events are detected from human
observation and by placing markers on objects. In [4] GMMs were investigated
for 3D data segmentation and novelty detection in the context of mobile robotics.
Several authors proposed advanced approaches for segmentation of human hand
2       Riccardo Monica, Andrea Zinelli, and Jacopo Aleotti

trajectories [11, 12, 5, 10, 16]. In particular, in [11] Gaussian Mixture Models have
been applied for automatic segmentation of full-body motion trajectories.


2     GMM-based spatial change detection
2.1    Depth image processing
The proposed approach operates on two depth frames acquired automatically
from a fixed Kinect V2, when no human motion is detected by the skeletal
tracker: one (frame P ) when the user has not yet entered the area, and the
other (frame N ) after the user left the scene. A difference depth image Duv is
computed (Eq. 1) given the one-to-one pixel correspondence in the two frames
at coordinates (u, v).

         
         NaN    if |Nuv − Puv | ≤ Th
                                                    (
                                                         NaN   if |Nuv − Puv | > Th
    Duv = Nuv    if Puv − Nuv > Th (1)       Iuv =                                  (2)
                                                        Nuv   otherwise
           Puv   if Nuv − Puv > Th
         


Th is a threshold set from to the noise model of the sensor. According to Eq.
1, in case of a significant change between the depth values the nearest point is
selected. Indeed, if the user performs an object placement task in direct sight
of the camera depth image N contains relevant information about the newly
placed object, while depth image P contains information about the background.
The opposite occurs if an object is removed in direct sight of the camera. An
invariant image is also computed as in Eq. 2, containing the pixels that do not
change significantly. Both Duv and Iuv are converted into point clouds called
difference point cloud and invariant point cloud respectively.

2.2    Modeling of bimanual movements
Motion tracking of both hands is performed in real-time during the execution
of the experiment using the Kinect V2 skeletal tracker. The trajectory of each
hand is represented as a set of 3D points with time stamp, i.e. {xk , yk , zk , tk }.
The proposed approach first generates two separate GMMs, ML and MR , one for
each 4D hand trajectory using Expectation Maximization (EM). The number of
Gaussians of ML and MR was chosen to minimize the BIC index [11, 14] as shown
in Alg. 1. The two GMMs are then fused in a single GMM M . In particular,
the priors wi of the Gaussian components in M are computed as the original
priors wLi and wRi in ML and MR weighted by the ratio between the number
of Gaussian components in ML and MR and the total number of Gaussians in
M , i.e.: wi = wLi · |ML |/|M | and wi = wRi · |MR |/|M |. The weighting factor
gives more importance to long trajectories. It turns out that isotropic (circularly
symmetric) Gaussian components are likely to correspond to regions of space
where salient manipulation activities are carried out. This fact can be explained
by observing that user’s actions are usually performed at slow speed and involve
      GMM-based Spatial Change Detection from Bimanual Tracking and . . .           3

Algorithm 1 Iterative Expectation Maximization
Input: S: set of points, with timestamp;      5:   if BIC < M inBIC then
Output: BestGM M : the GMM model;             6:      M inBIC ← BIC;
 1: M inBIC ← +∞; GC ← 1;                     7:      BestGM M ← GM M ;
 2: repeat                                    8:   end if
 3:   GM M ← EM(S,GC);                        9:   GC ← GC + 1;
 4:   BIC ← ComputeBIC(GM M ,S);             10: until BIC > M inBIC + BICT h;


Fig. 1. Flowchart of the proposed approach for Regions of Interest (ROIs) extraction.

changes of hand direction, hence more points are sampled without a dominant
direction. Thus, a Gaussian saliency value δi is computed as follows:
                                      minj∈{1,2,3} σi,j
                               δi =                                               (3)
                                      maxj∈{1,2,3} σi,j
where σi,j is the j-th eigenvalue of the covariance matrix. A Gaussian is consid-
ered salient if its prior
                         wi and its saliency δi are both greater than their average
values in M , i.e.: wi > w̄ ∧ δi > δ̄ .

2.3   Region of Interest extraction
Information extracted from depth image processing and GMMs is exploited to
compute regions of interests (ROIs) corresponding to human activities as illus-
trated in Fig. 1. First, a set of salient spheres of fixed radius r is generated, each
centered at the mean position of a salient Gaussian component in M . Then, a
salient point cloud is computed as the part of the difference point cloud inside
any of the salient spheres. Clusters of connected components in the salient point
cloud (defined as salient clusters) are extracted. Salient clusters are likely to
represent ROIs for manipulated objects in direct sight of the camera, i.e. objects
that have been placed in the environment, moved or removed, as both human
trajectory analysis and the difference point cloud agree. Each salient cluster gen-
erates a spherical ROI, centered at the cluster centroid with radius equal to the
distance between the center and the farthest point of the cluster.
    Salient spheres that contain at least TI points of the invariant point cloud
are also added to the list of the regions of interest. In fact, such spheres are
likely to represent regions of space where user activity was detected from motion
trajectory analysis although the Kinect depth frame could not locate any changes
due to occlusions. Points of the invariant point cloud within a distance of r′
from any salient cluster centroid (rejection region) are not counted to prevent
duplicate ROI detections.
4         Riccardo Monica, Andrea Zinelli, and Jacopo Aleotti


Fig. 2. Images of an example experiment (from left to right). The user picks up a red
cone from inside a box (first salient action) and moves it on top of a table (second
salient action). Images were recorded by an external camera close to the Kinect V2
sensor.


    Fig. 3. Two regions of interest (blue       Fig. 4. A difference point cloud (white
    and red sphere) are detected from the       points) where sensor noise at the image
    experiment in Fig. 2.                       border is filtered out by the GMM.
                          In direct sight               Occluded
               Type TP FP FN Prec Rec TP FP FN Prec Rec
                 1    19 0 1 1.00 0.95 8 0 2 1.00 0.80
                 2     0 0 0           -      - 17 4 3 0.81 0.85
                 3    26 1 4 0.96 0.87 0 0 0                      -     -
Table 1. True positives, false positives, false negatives, precision and recall for in direct
sight and occluded actions.


3      Experiments
The proposed approach was evaluated by different users in an environment of
size 5 by 4 meters. In each experiment the user entered the workspace, performed
multiple manipulation actions and then left the scene. Fig. 2 shows an example
experiment. Results of the spatial change detection algorithm are shown in Fig.
3. Fig. 4 illustrates, in a simpler experiment for clarity, the benefit of the GMM
trajectory analysis when an action is performed in direct sight: the sensor noise at
the depth image borders is filtered out. A quantitative evaluation was carried out
on a dataset consisting of three experiments, with 10 trials for each experiment.
Experiments of the first type consist of a sequence of three actions, two performed
in direct sight of the sensor and one in an occluded region. The second type
involves two relevant actions, both in occluded areas. Experiments of the third
type consist again of three actions, all in direct sight of the sensor. Experiments
have been performed with r = 80 cm, r′ = 40 cm, Th = 10 cm and TI = 50.
Results are summarized in table 3. Precision and recall are above 87% for actions
in direct sight of the sensor and above 80% for actions in occluded regions.
     GMM-based Spatial Change Detection from Bimanual Tracking and . . .           5

References
 1. E. E. Aksoy, A. Abramov, J. Dörr, K. Ning, B. Dellen, and F. Wörgötter. Learn-
    ing the semantics of object-action relations by observation. Int. J. Rob. Res.,
    30(10):1229–1249, September 2011.
 2. P. Alimi, D. Meger, and J.J. Little. Object persistence in 3D for home robots. In
    The Semantic Perception, Mapping, and Exploration (SPME) workshop, 2012.
 3. R. Ambrus, N. Bore, J. Folkesson, and P. Jensfelt. Meta-rooms: Building and
    maintaining long term spatial models in a dynamic world. In IEEE/RSJ Inter-
    national Conference on Intelligent Robots and Systems (IROS), pages 1854–1861,
    Sept 2014.
 4. P. Drews, P. Núñez, R. Rocha, M. Campos, and J. Dias. Novelty detection and
    segmentation based on gaussian mixture models: A case study in 3d robotic laser
    mapping. Robotics and Autonomous Systems, 61(12):1696–1709, 2013.
 5. D. R. Faria, R. Martins, J. Lobo, and J. Dias. Extracting data from human
    manipulation of objects towards improving autonomous robotic grasping. Robotics
    and Autonomous Systems, 60(3):396–410, 2012.
 6. A. Fathi and J.M. Rehg. Modeling actions through state changes. In IEEE Con-
    ference on Computer Vision and Pattern Recognition (CVPR), 2013.
 7. T. Fäulhammer, R. Ambrus, C. Burbridge, M. Zillich, J. Folkesson, N. Hawes,
    P. Jensfelt, and M. Vincze. Autonomous learning of object models on a mobile
    robot. IEEE Robotics and Automation Letters, 2(1):26–33, Jan 2017.
 8. R. Finman, T. Whelan, M. Kaess, and J. J. Leonard. Toward lifelong object
    segmentation from change detection in dense RGB-D maps. In 2013 European
    Conference on Mobile Robots (ECMR), pages 178–185, Sept 2013.
 9. E. Herbst, Xiaofeng Ren, and D. Fox. RGB-D object discovery via multi-scene
    analysis. In IEEE/RSJ Intl Conference on Intelligent Robots and Systems (IROS),
    pages 4850–4856, Sept. 2011.
10. Sing Bing Kang and K. Ikeuchi. Toward Automatic Robot Instruction from
    Perception-Temporal Segmentation of Tasks from Human Hand Motion. IEEE
    Transactions on Robotics and Automation, 11(5):670–681, Oct 1995.
11. Sang Hyoung Lee, Il Hong Suh, S. Calinon, and R. Johansson. Learning Basis Skills
    by Autonomous Segmentation of Humanoid Motion Trajectories. In IEEE-RAS
    Intl Conference on Humanoid Robots, 2012.
12. J.F.-S. Lin and D. Kulic. Online Segmentation of Human Motion for Automated
    Rehabilitation Exercise Analysis. IEEE Transactions on Neural Systems and Re-
    habilitation Engineering, 22(1):168–180, 2014.
13. J. Mason, B. Marthi, and R. Parr. Object disappearance for object discovery. In
    IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
    pages 2836–2843, Oct 2012.
14. R. Monica, J. Aleotti, and S. Caselli. A kinfu based approach for robot spatial
    attention and view planning. Robotics and Autonomous Systems, 75, Part B:627–
    640, 2016.
15. S. Petsch and D. Burschka. Representation of manipulation-relevant object prop-
    erties and actions for surprise-driven exploration. In IEEE/RSJ Intl Conference
    on Intelligent Robots and Systems (IROS), 2011.
16. M. Yeasin and S. Chaudhuri. Toward automatic robot programming: learning hu-
    man skill from visual data. IEEE Transactions on Systems, Man, and Cybernetics,
    Part B: Cybernetics, 30(1):180–185, Feb 2000.