V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 34–39
http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 R. Gargalík, Z. Tomori


                  Control of Depth-Sensing Camera via Plane of Interaction

                                                 Radoslav Gargalík12 and Zoltán Tomori2
                                             1Inst. of Computer Science, Faculty of Science
                                               P. J. Šafárik University in Košice, Slovakia
                                                    radoslav.gargalik@gmail.com
                                                      2 Inst. of Experimental Physics

                                             Slovak Academy of Sciences, Košice, Slovakia
                                                            tomori@saske.sk

Abstract: Depth-sensing cameras (e.g. Kinect or Creative                   Projection-based augmented reality [3, 4] projects im-
Gesture Camera) are exploited in many computer vision                   ages onto the real surface using one or several projectors.
and augmented reality applications. They can also serve                 Depth-sensing camera can make such applications inter-
as a key component of natural user interaction via virtual              active – user can change the shape or the position of the
keyboard, body pose or hand gestures. We integrated both                real objects which is followed by the change of projected
these functions proposing the “Plane of Interaction” (POI)              color, image, animation etc.
which is a solid flat surface placed on the reference plane                In addition to the interaction with an object, sometimes
(table top, floor). Calibrated camera/projector system au-              is required also the interaction controlling the program it-
tomatically identifies the position of POI surface, projects            self (e.g. change the mode of operation). The use of a
the virtual menu buttons onto it and recognizes which but-              mouse or a keyboard would be complicated and unnatural
ton was “clicked” by hand (fingertip).                                  in this situation. Therefore we created the specific object
The proposed POI was tested with the camera/projector                   (plane of interaction) which is a natural part of augmented
prototyping setup. POI allows natural and quite robust in-              reality environment but its function is to control the pro-
teraction in this specific environment.                                 gram via the projected virtual buttons.


1    Introduction                                                       2     Related Work
Human-computer interaction is one of the most progres-                  There are many application areas, in which a pair Kinec-
sive research areas of computer science because it simpli-              t/projector can be used for augmented reality. Many
fies the control of electronic devices and opens the door for           researchers developed different augmented/virtual reality
new potential users. Natural User Interface (NUI) repre-                graphical user interfaces over the time. They differ in dif-
sents the latest stage trying to exploit gestures, voice com-           ferent hardware requirements (such as haptic pens, special
mands, gaze tracking, brain computer interface based on                 virtual glasses, etc.) and features.
the analysis of EEG signals etc. Probably the most pop-                    Szalavari in his dissertation [6] proposed the augmented
ular NUI outcome are touch gestures applied within last                 reality panel called Personal Interaction Panel (PIP). PIP
few years into the touch tablets and smart phones.                      consisted of a black board (as a panel), a haptic pen and a
   Low-cost depth-sensing camera Microsoft Kinect [11,                  head mounted display. The panel and the pen were tracked
12] launched in Nov. 2010 opened a new era of 3D com-                   with Polhemus Fastrak (six degree-of-freedom) tracker,
puter vision applications. Color camera combined with the               where the receiver was mounted to the head mounted dis-
depth sensor dramatically simplifies some computer vision               play. As the head mounted display the Virtual I/O i-
algorithms like e.g. segmentation of 3D scene. Easy seg-                glasses! was used. The base principle of this approach
mentation of the human body or hand allows calculation                  is, that electomagnetic tracker tracks the panel and hap-
of the 3D coordinates of corresponding skeletons. This                  tic pen. The head mounted display is then used to over-
representation simplified the recognition of hand gestures              lay graphics onto the real environment (mainly the panel).
like “wave”, “circle”, “swipe”, “pinch” etc. Some new                   Electromagnetic emitter and receiver worked at 30 Hz.
3D cameras with built-in gestures recognition capabilities              The disadvantage of this approach is additional hardware
like “Creative Senz3D Camera” [14] or “Leap Motion”                     requirements (haptic pen, head mounted display and elec-
[13] appeared recently. Open source libraries e.g. OpenNI               tromagnetic emitter/receiver).
and OpenCV [15] support the broader range of such cam-                     Poupyrev et al. in [7] proposed the concept of a Generic
eras so applications like “virtual keyboard”, “air harp” and            Augmented-Reality Interface. They used tiles, which are
many others are available via internet. On the other hand,              printed paper cards (15 × 15 cm each) with simple square
very quick progress resulted in missing official standards              patterns consisting of a thick black border and unique
in this area. Interpretation of some gestures or the other              symbols in the middle. According to authors, any sym-
forms of interaction can be natural in some application but             bol can be used for identification. There are two type of
they are quite confusing in some others.                                tiles: physical icons and phicons. Phicons propose a close
Control of Depth-Sensing Camera via Plane of Interaction                                                                  35


coupling between physical and virtual properties so that       augmented reality environment using the depth sensor in-
their shape and appearance mirror their corresponding vir-     stead of the mouse or keyboard. Proposed “Plane of In-
tual object or functionality. The user can freely manipu-      teraction” (POI) is a part of scene and exploits the same
late with tiles, which is also a default way of interaction    depth-sensing camera as the basic program. From the
with real objects represented by tiles. User must wear         technical point of view, POI is a planar surface which can
a lightweight Sony Glasstron PLMS700 head-set. Main            be placed anywhere inside the field of view of the camera,
steps of this approach include tracking rectangular mark-      it should be automatically identified by the camera and ex-
ers of known size, calculating the relative camera position    ploited as a virtual touch sensor.
and orientation in real-time, and finally rendering virtual        In this section we will describe how to acquire, cali-
objects on the physical paper cards. The system runs at        brate and extract the important information using the POI.
30 FPS and was implemented with the open-source AR-            In this chapter we will describe how to acquire, calibrate
ToolKit software library. Although this concept is interest-   and extract the important information using the POI.
ing and the developed generic augmented-reality interface
can be used in many practical ways, there is also the same
                                                               3.1   Mechanical Parts
disadvantage as in the previous approach – need of special
head mounted display.                                          For prototyping purposes and for experiments with
   The similar approach was used by Geiger et al. in [8]       projector-based augmented reality we constructed the
to construct the ARGUI augmented-reality system. The           setup shown on Fig. 1. Projector P and 3D camera C share
system was constructed with utilization of ARToolKit,          the same mount attached to the massive stand. The transla-
OpenGL and GLUT libraries. Depending on the way the            tion and possible rotation between them compensates cali-
2D cursor is positioned on the augmented reality pattern,      bration software (see subsection 3.4 – Calibration). Plane
two modes are available in the system: cursor based and        of Interaction is a solid planar plate, which can placed any-
marker based interaction. Cursor based movement means          where inside the field of view of the camera.
that the 2D mouse cursor is moved using a suitable input
device (such as mouse or tablet). Marker based movement
means that the video camera is moved to position the aug-
mented reality pattern-object under the “static” mouse cur-
sor. A mixture of both modes is also possible. In real ap-
plication (augmenting real paintings with additional infor-
mation about artist, painting techniques, historical infor-
mation and other important information), a head mounted
display (Eyetrek) was used with a mounted USB camera
(Phillips ToUcam). To control the cursor, a remote con-
trol with gyrotechnology was used. Again, as in all pre-
vious approaches, additional hardware (head mounted dis-
play and GyroControl) must be used.
   Benko et al. in [9] presents a Projected Augmented Re-
ality Tabletop. One of the system features is a freehand in-
teraction. The system consists of the Kinect depth-sensing
camera, stereo projector (Accer H5360), shutter glasses
(Nvidia 3D vision), stereo sync emitter (Nvidia 3D vi-
sion) and a table. The Kinect scans objects before the
table, so they can be projected as a mirror view. To pro-
vide correct 3D perspective view of the virtual scene, the
user’s head location and gaze must be tracked. To track
the user’s head, the disturbing reflectivity of the shutter
glasses is used. The reflectivity creates “holes” in the ac-   Figure 1: Setup for Projection-Based Augmented Reality
quired depth map, so the aggregate location of those holes     Experiments. P – Projector, C – 3D Camera, POI – Plane
can be tracked. Freehand physically realistic interactions     of Interaction.
are simulated using a commercial Nvidia PhysX game en-
gine.
                                                               3.2   Hardware and Software
3 Plane of Interaction                                         We exploited projector BENQ MX613ST (aspect ratio 4:3,
                                                               through ratio cca. 1.0) and Microsoft Kinect [11] depth-
Our main motivation was to find a simple and easy way          sensing camera. We used OpenNI library to control the
how to interact with the program during experiments with       camera acquisition process. It offers a set of functions to
36                                                                                                               R. Gargalík, Z. Tomori


acquire RGB and depth images and basic functions to pro-          the projector, where the point [0, 0, 0] is inside the projec-
cess them.                                                        tor and the z-axis is in the direction of the projection.

                                                                                            VP0 = AVC                             (2)
3.3   Reference Plane Detection

We can imagine the reference plane as a sea level where                                                          
the height of all objects is measured as a distance from it.               xP0    r11             r12   r13    t1      xC
POI is oriented parallel to the reference plane (floor, table             yP0  r21             r22   r23    t2  yC 
                                                                                                                     
                                                                           =                                     ∗            (3)
top). Despite the precise adjustable mounting we cannot                   zP0  r31             r32   r33    t3  zC 
guarantee that both Kinect and the projector are perpen-                    1      0               0     0     1        1
dicular to the floor. Therefore we acquire the background
image which is the image of flat surface (floor, desktop)         To transform the point from the 3D coordinate system of
without any objects placed on it. Then we fit the back-           the projector (VP0 = (xP0 , yP0 , z p0 )T ) to the 2D coordinate
ground image by the analytical equation of the plane using        system of the projector (VP = (xP , yP )T ), we can use the
the RANSAC algorithm [1]. From the resulting analytical           following equations
solution we generated again the background image which
                                                                                         c1 xP0                c2 yP0
will be subtracted from every captured depth image.                               xP =                  yP =                      (4)
                                                                                          zP0                   zP0

3.4   Calibration                                                 where c1 and c2 are unknown coefficients related to the
                                                                  ratio of the projector and scaling from millimeters to pix-
                                                                  els. Further expressing of equations 4 and modification
                                                                  of matrix A leads to a system of 11 linear algebraic equa-
                                                                  tions (more details can be found in[2]). If we enter more
                                                                  pairs of corresponding points (VP , VC ) than the number
                                                                  of equations is, then we obtain an overdetermined sys-
                                                                  tem which can be solved by the QR matrix decomposition.
                                                                  The resulting matrix contains the coefficients describing
                                                                  the transformation between the Kinect 3D camera and the
Figure 2: Part of the Calibration Process. Projected Grid         projector coordinates.
to the Surface (Left) and a Small Disc Placed Into the Grid          In practice, we need cca. 30 pairs of corresponding
Intersection (Right).                                             points to achieve reasonable precision. We have devel-
                                                                  oped a special program which simplifies their acquisition.
   The camera and the projector have different poses              We project a grid to the surface and place a small disk into
(translated and rotated relative to each-other). They have        all grid intersections (see Fig. 2). The program on each
the different resolution and we have to take into account         intersection automatically acquires the depth image of the
also the recalculation of pixels to length units (millime-        disk, finds its contour and fits the contour by the model
ters). All of these problems should be solved by the fol-         of an ellipse. The center of such a disk determines coordi-
lowing geometrical transformation                                 nates (xC , yC ) and the height of the disc above the reference
                                                                  plane is zC . Coordinates of the projector points (xP , yP ) are
        xP = Tx (xC , yC , zC )   yP = Ty (xC , yC , zC )   (1)   the known grid intersections.
                                                                     As the positions of the camera and the projector are sta-
where Tx and Ty are linear transformations. It means that         ble, the calibration should be performed only once. Then
each 3D point (xC , yC , zC ) measured by the 3D camera is        the calibration data are saved as a part of the configuration
transformed into the 2D point (xP , yP ) displayed by the         file. The calibration coefficients are the same for the POI
projector. This problem is similar to “Bundle adjustment”         rectangle and for the rest of the depth image.
method in [1]. The way how to determine transformation               It should be noted, that to obtain optimal precision of
functions Tx , Ty is to find corresponding pairs of camera        the transformation, not all pairs of corresponding points
points VC and projector points VP and minimize the re-            should lie in one plane. If all pairs of corresponding points
projection error (solve the least square problem). Let us         lie in one plane, then the error raising from inaccurate
denote VC = (xC , yC , zC , 1)T as the point measured by the      transformation between the 3D Kinect camera and the pro-
3D camera and VP = (xP , yP )T as the same point, which is        jector raises with the absolute distance of the measured
projected by the projector. Assuming that the difference          3D points from the plane in which pairs of corresponding
between position and orientation of the 3D camera and the         points were acquired. In that case some serious inaccu-
projector can be expressed by rotation and translation, we        racy can be seen on the image projected to the POI surface
can first translate the point VC from the 3D camera coor-         if the POI surface (plane) is not quite near to the plane in
dinate system to the 3D orthonormal coordinate system of          which the pair of corresponding points lie.
Control of Depth-Sensing Camera via Plane of Interaction                                                                        37


Figure 3: In-memory Image of Menu Buttons (Left) and
Projected Menu Buttons Onto the POI Surface (Right).                            Figure 4: Detection of the Fingertip.


3.5    Detection of Finger Position
The image of menu buttons is projected onto the surface of
POI (see Fig. 3 Right). The goal is to identify the button
rectangle touched by a fingertip.
   It should be noted, that menu buttons should be pro-
jected onto the surface regardless of the orientation of the
surface (see Fig. 3 Right). So if the surface is rotated, then
we must ensure proper orientation of the menu buttons,
too.
The first step in our approach is to detect POI surface. Af-         Figure 5: In-memory Image of Menu Buttons (Left), Bi-
ter that we try to find the smallest possible rectangle (min-        nary Image of Detected Fingertip with Noise (Middle) and
imal area), which will contain all points from POI surface.          Detected Fingertip After OPEN Morphology Operation
We also calculate its rotation.                                      Performed and with the Zero-based Index of the “Clicked”
   Once this is known, we calculate the affine transform             Virtual Button (Right).
between in-memory image, which is not rotated and the
projected image, which is rotated according to the calcu-
lated angle. Let us denote this affine matrix as M. To cal-
culate M we need three 2D corresponding points between               image. In our experiments we used 3 × 3 rectangle as a
in-memory image and projected image. Those points are                kernel with an anchor point in the center of the rectan-
depicted on Fig. 3 as points A, B and C, where superscript           gle and OPEN morphology operation was applied in two
S denotes source image and superscript D denotes destina-            iterations. The difference can be seen on Fig. 5 (Mid-
tion image. Clearly, all six points (AS , BS , CS , AD , BD , CD )   dle and Right). To finally find out which virtual button
can be calculated automatically.                                     is “clicked”, we count non-zero pixels in each virtual but-
    Now the depth sensor watches the hand above the ROI              ton rectangle from the binary image and we select the one,
surface from the top. The reference plane (see Fig. 4) is            which has the maximum number of white pixels in its rect-
labeled R, the POI is h units above the R. Two threshold             angle. To eliminate recognition of some small region as a
levels T1 and T2 represent the range of sensitive distances          fingertip, we use another threshold value T f and we “se-
from POI. 3D points having z-coordinate (depth) from the             lect” some virtual button only if the counted white pixels
interval <h + T1 , h + T2 > create a binary image represent-         exceed the threshold value T f .
ing a rough approximation of the fingertip (in our experi-              In our experiments we used T f = 150 pixels. In case
ments we used T1 = 5 and T2 = 20 millimeters). As stated             we expect some smaller fingers (such as fingertips of chil-
above, the POI surface can be rotated, so to find out which          dren), the threshold value T f should be set to something
button is “clicked”, we first transform the binary image of          between 70 and 100 pixels.
the fingertip approximation to the in-memory menu im-                   It should be noted that the orientation of the hand is not
age coordinates using the inverse transform of matrix M.             critical to this approach, which is clearly an advantage. It
The result of this step is depicted on Fig. 5 (Middle). To           is possible to place the hand from any direction to the POI
eliminate noise (the outline of the hand and other fingers),         surface and as we stated previously, the POI surface can
we apply the OPEN morphology operation to the binary                 be freely rotated on the reference plane.
38                                                                                                       R. Gargalík, Z. Tomori


4      Augmented Reality Sandbox                                  • Periodical innovation and upgrade of exhibits is the
                                                                    necessary condition to achieve repeating visits of the
                                                                    same people.
                                                                  • The museum is open daily for more than 100 of visi-
                                                                    tors per day so the time for the installation and testing
                                                                    of exhibits in real conditions is limited.
                                                                From the beginning, our sandbox operated only in a single
                                                                mode as described above. However, we plan to implement
                                                                new features and modes in the near future. POI concept
                                                                described in this paper allows the easy way how to switch
                                                                between various modes of operation by “clicking” the vir-
                                                                tual buttons by hand.


                                                                5 Results
                                                                We proposed a simple system of interaction with pro-
Figure 6: Augmented Reality Sandbox. The Color Pro-             gram exploiting depth-sensing camera called “Plane of
jected to Sand Depends on the Height.                           Interaction”. POI is the integral part of virtual reality
                                                                environment with specific function – recognize the posi-
   Science museum (called also “Theme Park”, “Science           tion of finger (fingertip) and return the index of “clicked”
center”, “Discovery Center”) is a modern type of museum         rectangle representing the virtual button.
where most of exhibits are interactive. One such museum            The proposed system works in real-time. In our experi-
was opened last year in our city where our group partici-       ments we achieved more than 30 frames per second, which
pated in the construction of several exhibits. Augmented        is enough for real-time augmented reality applications. In
reality sandbox is one of them, which exploits the depth-       fact, because the FPS of the Kinect depth-sensing camera
sensing camera.                                                 is 30, more than 30 FPS is not needed.
   The construction of our interactive sandbox was in-             We tested POI on the prototyping system with satis-
spired by [4]. Both, Kinect camera and projector are at-        factory results. Currently, we test POI in real conditions
tached to the ceiling above the box with sand (see Fig.         of the science museum with augmented reality sandbox
6). Kinect measures the elevation of sand terrain and the       exhibit.
projector illuminates the sand by the corresponding col-
ors (hills by brown color, lakes by blue one etc.). The
change of terrain is followed by the change of color with       6 Future Work
minimal delay. The table defining colors for individual
heights intervals can be defined by the user. The sandbox       The augmented reality sandbox installed in the science
was calibrated by the same algorithm as described in the        museum was continuously tested by real visitors. Al-
subsection 3.4 – Calibration.                                   though the anonymous survey declared mostly high and
                                                                very high satisfaction of visitors, practical experience re-
                                                                vealed some problems and showed new possible improve-
4.1     Specifics of User Interaction in the Science
                                                                ments. The calibrated Kinect/Projector/Sandbox system
        Museum
                                                                can be easily extended to create other augmented reality
Conditions in the science museum are very specific:             applications.
                                                                   Anyway, technological aspects are only one side of the
     • Most of visitors are groups of children requiring sim-   coin because they must be in balance with other exhibits
       ple, robust and self-explanatory control of the ex-      in the science museum. Therefore any significant changes
       hibits.                                                  in exhibits must be consulted with other authors and de-
                                                                signers.
     • No extra hardware such as keyboard, mouse, cables           The functionality of the POI can also be extended, so
       etc. is acceptable.                                      it can be used in the similar way as default touch screens
                                                                and displays widely used today. We plan to extend the
     • The virtual menu should exploit the same camera as       functionality of the POI concept described in this paper
       the exhibit itself.                                      with the following features:
     • The function of menu must be very simple – usually         • Multi-select. This feature enables users to use more
       just select the mode of operation.                           than one finger to perform a multi-select. It can be
Control of Depth-Sensing Camera via Plane of Interaction                                                                      39


      utilized to selected more objects at once or, for exam-   References
      ple, to “check” multiple virtual check-boxes.
                                                                [1] R. I. Hartley and A. Zisserman, Multiple View Geometry
   • Select and move. This feature is very similar to the           in computer vision, 2nd ed. Cambridge University Press,
                                                                    2004.
     widely used drag & drop feature and enables users to
     select some object (or with the combination with the       [2] J. Hrdlicka, Kinect-projector calibration, human-mapping,
                                                                    3dsense interactive technologies blog, 2013.
     previous features to select multiple objects at once)
                                                                    http://blog.3dsense.org/programming/
     and to move it somewhere else. The “select” part
                                                                    kinect-projector-calibration-human-mapping-2/
     of this feature will be exactly the same concept as
                                                                [3] M. Mine, D. Rose, B. Yang, J. Vanbaar and A. Grund-
     the one described in this paper, which is similar to
                                                                    hofer, Projection-Based Augmented Reality in Disney Theme
     the mouse down event. To simulate mouse up event,              Parks, Computer 45, 7, 32-40. 2012.
     we just check for moving the fingertip (or fingertips)
                                                                [4] O. Kreylos, Augmented Reality Sandbox, UC Davis. 2013.
     away from the POI surface.                                     http://idav.ucdavis.edu/~okreylos/ResDev/
                                                                    SARndbox/
   • Recognition of basic gestures. If it is possible to sim-   [5] Z. Tomori, R. Gargalik and I. Hrmo, Active Segmentation in
     ulate “click and move”, then we can use the points             3D using Kinect Sensor, In Proceedings of the 20th Interna-
     obtained during the “move” phase as a gesture and try          tional Conference on Computer Graphics, Visualization and
     to recognize it. Basic gestures (such as swipe finger-         Computer Vision ’2012 (WSCG 2012), Part 2 (Pilsen, Czech
     tip left/right/up/down) can be recognized quite eas-           Republic, June 26-29, 2012, 2012). Vaclav Skala Ed., Pilsen.
     ily. To recognize more complicated gestures one can        [6] Z. Szalavari and M. Gervautz, The Personal Interaction
     use Dynamic Time Warping algorithm [10] or utilize             Panel – a Two-Handed Interface for Augmented Reality,
     some machine learning approach.                                Computer Graphics Forum, pp. 335-346, 1997.
                                                                [7] I. Poupyrev, D. S. Tan, M. Billinghurst, H. Kato, H. Regen-
   • The recognition of basic drawing shapes. With the              brecht and N. Tetsutani, Developing a GenericAugmented-
                                                                    Reality Interface, Computer (Volume 35, Issue 3), pp. 44-
     aid of the “select and move” feature, we can enable
                                                                    50, 2002.
     users to draw some basic shapes (such as a line, cir-
                                                                [8] Ch. Geiger, L. Oppermann and Ch. Reirnann, 3D-Registered
     cle, rectangle, etc.). To recognize such drawn shape,
                                                                    Interaction-Surfaces in Augmented Reality Space, In Aug-
     we can use pattern matching or in case of simple
                                                                    mented Reality Toolkit Workshop, pp. 5-13, 2003.
     shapes (such as triangle, rectangle, circle, etc.) we
                                                                [9] H. Benko, R. Jota and A. D. Wilson, MirageTable: Freehand
     can find a contour of drawn shape and try to fit the
                                                                    Interaction on a Projected Augmented Reality Tabletop, In
     contour with a polygon. After that we can count the            Proceedings of the SIGCHI Conference on Human Factors
     number of vertices of the polygon to recognize this            in Computing Systems, pp. 199-208, 2012.
     basic shape.                                               [10] S. Salvador and P. Chan, Toward Accurate Dynamic Time
                                                                    Wrapping in Linear Time and Space, In Intelligent Data
It should be noted, that not all improvements listed above          Analysis, 11(5):561-580, 2007.
are directly connected to the augmented reality sandbox         [11] Microsoft, Kinect for XBox 360.
described in section 4. Some of the improvements are                http://www.xbox.com/en-US/kinect
planned to be used in the future in our other projects.         [12] Microsoft, Kinect for Windows.
                                                                    http://www.microsoft.com/en-us/
                                                                    kinectforwindows/
7 Conclusion                                                    [13] Leap Motion, Inc., Leap Motion.
                                                                    https://www.leapmotion.com/
                                                                [14] Creative Technology Ltd., Creative Senz3D.
The plane of Interaction is the alternative to control the          http://us.creative.com/p/web-cameras/
exhibits in science centers with the majority of children           creative-senz3d
visitors. It is simple, self-explanatory and robust. For in-    [15] Willow Garage and Itseez, OpenCV library.
teraction it does not require any extra hardware only this          http://opencv.org/
included in the exhibit.


Acknowledgment

This work was supported by Slovak research grant agen-
cies APVV (grant 0526-11).We thank to US Steel Košice
as the main contributor of Steel Park science museum and
“City of Košice” as co-founder.