V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 34–39 http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 R. Gargalík, Z. Tomori Control of Depth-Sensing Camera via Plane of Interaction Radoslav Gargalík12 and Zoltán Tomori2 1Inst. of Computer Science, Faculty of Science P. J. Šafárik University in Košice, Slovakia radoslav.gargalik@gmail.com 2 Inst. of Experimental Physics Slovak Academy of Sciences, Košice, Slovakia tomori@saske.sk Abstract: Depth-sensing cameras (e.g. Kinect or Creative Projection-based augmented reality [3, 4] projects im- Gesture Camera) are exploited in many computer vision ages onto the real surface using one or several projectors. and augmented reality applications. They can also serve Depth-sensing camera can make such applications inter- as a key component of natural user interaction via virtual active – user can change the shape or the position of the keyboard, body pose or hand gestures. We integrated both real objects which is followed by the change of projected these functions proposing the “Plane of Interaction” (POI) color, image, animation etc. which is a solid flat surface placed on the reference plane In addition to the interaction with an object, sometimes (table top, floor). Calibrated camera/projector system au- is required also the interaction controlling the program it- tomatically identifies the position of POI surface, projects self (e.g. change the mode of operation). The use of a the virtual menu buttons onto it and recognizes which but- mouse or a keyboard would be complicated and unnatural ton was “clicked” by hand (fingertip). in this situation. Therefore we created the specific object The proposed POI was tested with the camera/projector (plane of interaction) which is a natural part of augmented prototyping setup. POI allows natural and quite robust in- reality environment but its function is to control the pro- teraction in this specific environment. gram via the projected virtual buttons. 1 Introduction 2 Related Work Human-computer interaction is one of the most progres- There are many application areas, in which a pair Kinec- sive research areas of computer science because it simpli- t/projector can be used for augmented reality. Many fies the control of electronic devices and opens the door for researchers developed different augmented/virtual reality new potential users. Natural User Interface (NUI) repre- graphical user interfaces over the time. They differ in dif- sents the latest stage trying to exploit gestures, voice com- ferent hardware requirements (such as haptic pens, special mands, gaze tracking, brain computer interface based on virtual glasses, etc.) and features. the analysis of EEG signals etc. Probably the most pop- Szalavari in his dissertation [6] proposed the augmented ular NUI outcome are touch gestures applied within last reality panel called Personal Interaction Panel (PIP). PIP few years into the touch tablets and smart phones. consisted of a black board (as a panel), a haptic pen and a Low-cost depth-sensing camera Microsoft Kinect [11, head mounted display. The panel and the pen were tracked 12] launched in Nov. 2010 opened a new era of 3D com- with Polhemus Fastrak (six degree-of-freedom) tracker, puter vision applications. Color camera combined with the where the receiver was mounted to the head mounted dis- depth sensor dramatically simplifies some computer vision play. As the head mounted display the Virtual I/O i- algorithms like e.g. segmentation of 3D scene. Easy seg- glasses! was used. The base principle of this approach mentation of the human body or hand allows calculation is, that electomagnetic tracker tracks the panel and hap- of the 3D coordinates of corresponding skeletons. This tic pen. The head mounted display is then used to over- representation simplified the recognition of hand gestures lay graphics onto the real environment (mainly the panel). like “wave”, “circle”, “swipe”, “pinch” etc. Some new Electromagnetic emitter and receiver worked at 30 Hz. 3D cameras with built-in gestures recognition capabilities The disadvantage of this approach is additional hardware like “Creative Senz3D Camera” [14] or “Leap Motion” requirements (haptic pen, head mounted display and elec- [13] appeared recently. Open source libraries e.g. OpenNI tromagnetic emitter/receiver). and OpenCV [15] support the broader range of such cam- Poupyrev et al. in [7] proposed the concept of a Generic eras so applications like “virtual keyboard”, “air harp” and Augmented-Reality Interface. They used tiles, which are many others are available via internet. On the other hand, printed paper cards (15 × 15 cm each) with simple square very quick progress resulted in missing official standards patterns consisting of a thick black border and unique in this area. Interpretation of some gestures or the other symbols in the middle. According to authors, any sym- forms of interaction can be natural in some application but bol can be used for identification. There are two type of they are quite confusing in some others. tiles: physical icons and phicons. Phicons propose a close Control of Depth-Sensing Camera via Plane of Interaction 35 coupling between physical and virtual properties so that augmented reality environment using the depth sensor in- their shape and appearance mirror their corresponding vir- stead of the mouse or keyboard. Proposed “Plane of In- tual object or functionality. The user can freely manipu- teraction” (POI) is a part of scene and exploits the same late with tiles, which is also a default way of interaction depth-sensing camera as the basic program. From the with real objects represented by tiles. User must wear technical point of view, POI is a planar surface which can a lightweight Sony Glasstron PLMS700 head-set. Main be placed anywhere inside the field of view of the camera, steps of this approach include tracking rectangular mark- it should be automatically identified by the camera and ex- ers of known size, calculating the relative camera position ploited as a virtual touch sensor. and orientation in real-time, and finally rendering virtual In this section we will describe how to acquire, cali- objects on the physical paper cards. The system runs at brate and extract the important information using the POI. 30 FPS and was implemented with the open-source AR- In this chapter we will describe how to acquire, calibrate ToolKit software library. Although this concept is interest- and extract the important information using the POI. ing and the developed generic augmented-reality interface can be used in many practical ways, there is also the same 3.1 Mechanical Parts disadvantage as in the previous approach – need of special head mounted display. For prototyping purposes and for experiments with The similar approach was used by Geiger et al. in [8] projector-based augmented reality we constructed the to construct the ARGUI augmented-reality system. The setup shown on Fig. 1. Projector P and 3D camera C share system was constructed with utilization of ARToolKit, the same mount attached to the massive stand. The transla- OpenGL and GLUT libraries. Depending on the way the tion and possible rotation between them compensates cali- 2D cursor is positioned on the augmented reality pattern, bration software (see subsection 3.4 – Calibration). Plane two modes are available in the system: cursor based and of Interaction is a solid planar plate, which can placed any- marker based interaction. Cursor based movement means where inside the field of view of the camera. that the 2D mouse cursor is moved using a suitable input device (such as mouse or tablet). Marker based movement means that the video camera is moved to position the aug- mented reality pattern-object under the “static” mouse cur- sor. A mixture of both modes is also possible. In real ap- plication (augmenting real paintings with additional infor- mation about artist, painting techniques, historical infor- mation and other important information), a head mounted display (Eyetrek) was used with a mounted USB camera (Phillips ToUcam). To control the cursor, a remote con- trol with gyrotechnology was used. Again, as in all pre- vious approaches, additional hardware (head mounted dis- play and GyroControl) must be used. Benko et al. in [9] presents a Projected Augmented Re- ality Tabletop. One of the system features is a freehand in- teraction. The system consists of the Kinect depth-sensing camera, stereo projector (Accer H5360), shutter glasses (Nvidia 3D vision), stereo sync emitter (Nvidia 3D vi- sion) and a table. The Kinect scans objects before the table, so they can be projected as a mirror view. To pro- vide correct 3D perspective view of the virtual scene, the user’s head location and gaze must be tracked. To track the user’s head, the disturbing reflectivity of the shutter glasses is used. The reflectivity creates “holes” in the ac- Figure 1: Setup for Projection-Based Augmented Reality quired depth map, so the aggregate location of those holes Experiments. P – Projector, C – 3D Camera, POI – Plane can be tracked. Freehand physically realistic interactions of Interaction. are simulated using a commercial Nvidia PhysX game en- gine. 3.2 Hardware and Software 3 Plane of Interaction We exploited projector BENQ MX613ST (aspect ratio 4:3, through ratio cca. 1.0) and Microsoft Kinect [11] depth- Our main motivation was to find a simple and easy way sensing camera. We used OpenNI library to control the how to interact with the program during experiments with camera acquisition process. It offers a set of functions to 36 R. Gargalík, Z. Tomori acquire RGB and depth images and basic functions to pro- the projector, where the point [0, 0, 0] is inside the projec- cess them. tor and the z-axis is in the direction of the projection. VP0 = AVC (2) 3.3 Reference Plane Detection We can imagine the reference plane as a sea level where       the height of all objects is measured as a distance from it. xP0 r11 r12 r13 t1 xC POI is oriented parallel to the reference plane (floor, table yP0  r21 r22 r23 t2  yC     = ∗  (3) top). Despite the precise adjustable mounting we cannot zP0  r31 r32 r33 t3  zC  guarantee that both Kinect and the projector are perpen- 1 0 0 0 1 1 dicular to the floor. Therefore we acquire the background image which is the image of flat surface (floor, desktop) To transform the point from the 3D coordinate system of without any objects placed on it. Then we fit the back- the projector (VP0 = (xP0 , yP0 , z p0 )T ) to the 2D coordinate ground image by the analytical equation of the plane using system of the projector (VP = (xP , yP )T ), we can use the the RANSAC algorithm [1]. From the resulting analytical following equations solution we generated again the background image which c1 xP0 c2 yP0 will be subtracted from every captured depth image. xP = yP = (4) zP0 zP0 3.4 Calibration where c1 and c2 are unknown coefficients related to the ratio of the projector and scaling from millimeters to pix- els. Further expressing of equations 4 and modification of matrix A leads to a system of 11 linear algebraic equa- tions (more details can be found in[2]). If we enter more pairs of corresponding points (VP , VC ) than the number of equations is, then we obtain an overdetermined sys- tem which can be solved by the QR matrix decomposition. The resulting matrix contains the coefficients describing the transformation between the Kinect 3D camera and the Figure 2: Part of the Calibration Process. Projected Grid projector coordinates. to the Surface (Left) and a Small Disc Placed Into the Grid In practice, we need cca. 30 pairs of corresponding Intersection (Right). points to achieve reasonable precision. We have devel- oped a special program which simplifies their acquisition. The camera and the projector have different poses We project a grid to the surface and place a small disk into (translated and rotated relative to each-other). They have all grid intersections (see Fig. 2). The program on each the different resolution and we have to take into account intersection automatically acquires the depth image of the also the recalculation of pixels to length units (millime- disk, finds its contour and fits the contour by the model ters). All of these problems should be solved by the fol- of an ellipse. The center of such a disk determines coordi- lowing geometrical transformation nates (xC , yC ) and the height of the disc above the reference plane is zC . Coordinates of the projector points (xP , yP ) are xP = Tx (xC , yC , zC ) yP = Ty (xC , yC , zC ) (1) the known grid intersections. As the positions of the camera and the projector are sta- where Tx and Ty are linear transformations. It means that ble, the calibration should be performed only once. Then each 3D point (xC , yC , zC ) measured by the 3D camera is the calibration data are saved as a part of the configuration transformed into the 2D point (xP , yP ) displayed by the file. The calibration coefficients are the same for the POI projector. This problem is similar to “Bundle adjustment” rectangle and for the rest of the depth image. method in [1]. The way how to determine transformation It should be noted, that to obtain optimal precision of functions Tx , Ty is to find corresponding pairs of camera the transformation, not all pairs of corresponding points points VC and projector points VP and minimize the re- should lie in one plane. If all pairs of corresponding points projection error (solve the least square problem). Let us lie in one plane, then the error raising from inaccurate denote VC = (xC , yC , zC , 1)T as the point measured by the transformation between the 3D Kinect camera and the pro- 3D camera and VP = (xP , yP )T as the same point, which is jector raises with the absolute distance of the measured projected by the projector. Assuming that the difference 3D points from the plane in which pairs of corresponding between position and orientation of the 3D camera and the points were acquired. In that case some serious inaccu- projector can be expressed by rotation and translation, we racy can be seen on the image projected to the POI surface can first translate the point VC from the 3D camera coor- if the POI surface (plane) is not quite near to the plane in dinate system to the 3D orthonormal coordinate system of which the pair of corresponding points lie. Control of Depth-Sensing Camera via Plane of Interaction 37 Figure 3: In-memory Image of Menu Buttons (Left) and Projected Menu Buttons Onto the POI Surface (Right). Figure 4: Detection of the Fingertip. 3.5 Detection of Finger Position The image of menu buttons is projected onto the surface of POI (see Fig. 3 Right). The goal is to identify the button rectangle touched by a fingertip. It should be noted, that menu buttons should be pro- jected onto the surface regardless of the orientation of the surface (see Fig. 3 Right). So if the surface is rotated, then we must ensure proper orientation of the menu buttons, too. The first step in our approach is to detect POI surface. Af- Figure 5: In-memory Image of Menu Buttons (Left), Bi- ter that we try to find the smallest possible rectangle (min- nary Image of Detected Fingertip with Noise (Middle) and imal area), which will contain all points from POI surface. Detected Fingertip After OPEN Morphology Operation We also calculate its rotation. Performed and with the Zero-based Index of the “Clicked” Once this is known, we calculate the affine transform Virtual Button (Right). between in-memory image, which is not rotated and the projected image, which is rotated according to the calcu- lated angle. Let us denote this affine matrix as M. To cal- culate M we need three 2D corresponding points between image. In our experiments we used 3 × 3 rectangle as a in-memory image and projected image. Those points are kernel with an anchor point in the center of the rectan- depicted on Fig. 3 as points A, B and C, where superscript gle and OPEN morphology operation was applied in two S denotes source image and superscript D denotes destina- iterations. The difference can be seen on Fig. 5 (Mid- tion image. Clearly, all six points (AS , BS , CS , AD , BD , CD ) dle and Right). To finally find out which virtual button can be calculated automatically. is “clicked”, we count non-zero pixels in each virtual but- Now the depth sensor watches the hand above the ROI ton rectangle from the binary image and we select the one, surface from the top. The reference plane (see Fig. 4) is which has the maximum number of white pixels in its rect- labeled R, the POI is h units above the R. Two threshold angle. To eliminate recognition of some small region as a levels T1 and T2 represent the range of sensitive distances fingertip, we use another threshold value T f and we “se- from POI. 3D points having z-coordinate (depth) from the lect” some virtual button only if the counted white pixels interval create a binary image represent- exceed the threshold value T f . ing a rough approximation of the fingertip (in our experi- In our experiments we used T f = 150 pixels. In case ments we used T1 = 5 and T2 = 20 millimeters). As stated we expect some smaller fingers (such as fingertips of chil- above, the POI surface can be rotated, so to find out which dren), the threshold value T f should be set to something button is “clicked”, we first transform the binary image of between 70 and 100 pixels. the fingertip approximation to the in-memory menu im- It should be noted that the orientation of the hand is not age coordinates using the inverse transform of matrix M. critical to this approach, which is clearly an advantage. It The result of this step is depicted on Fig. 5 (Middle). To is possible to place the hand from any direction to the POI eliminate noise (the outline of the hand and other fingers), surface and as we stated previously, the POI surface can we apply the OPEN morphology operation to the binary be freely rotated on the reference plane. 38 R. Gargalík, Z. Tomori 4 Augmented Reality Sandbox • Periodical innovation and upgrade of exhibits is the necessary condition to achieve repeating visits of the same people. • The museum is open daily for more than 100 of visi- tors per day so the time for the installation and testing of exhibits in real conditions is limited. From the beginning, our sandbox operated only in a single mode as described above. However, we plan to implement new features and modes in the near future. POI concept described in this paper allows the easy way how to switch between various modes of operation by “clicking” the vir- tual buttons by hand. 5 Results We proposed a simple system of interaction with pro- Figure 6: Augmented Reality Sandbox. The Color Pro- gram exploiting depth-sensing camera called “Plane of jected to Sand Depends on the Height. Interaction”. POI is the integral part of virtual reality environment with specific function – recognize the posi- Science museum (called also “Theme Park”, “Science tion of finger (fingertip) and return the index of “clicked” center”, “Discovery Center”) is a modern type of museum rectangle representing the virtual button. where most of exhibits are interactive. One such museum The proposed system works in real-time. In our experi- was opened last year in our city where our group partici- ments we achieved more than 30 frames per second, which pated in the construction of several exhibits. Augmented is enough for real-time augmented reality applications. In reality sandbox is one of them, which exploits the depth- fact, because the FPS of the Kinect depth-sensing camera sensing camera. is 30, more than 30 FPS is not needed. The construction of our interactive sandbox was in- We tested POI on the prototyping system with satis- spired by [4]. Both, Kinect camera and projector are at- factory results. Currently, we test POI in real conditions tached to the ceiling above the box with sand (see Fig. of the science museum with augmented reality sandbox 6). Kinect measures the elevation of sand terrain and the exhibit. projector illuminates the sand by the corresponding col- ors (hills by brown color, lakes by blue one etc.). The change of terrain is followed by the change of color with 6 Future Work minimal delay. The table defining colors for individual heights intervals can be defined by the user. The sandbox The augmented reality sandbox installed in the science was calibrated by the same algorithm as described in the museum was continuously tested by real visitors. Al- subsection 3.4 – Calibration. though the anonymous survey declared mostly high and very high satisfaction of visitors, practical experience re- vealed some problems and showed new possible improve- 4.1 Specifics of User Interaction in the Science ments. The calibrated Kinect/Projector/Sandbox system Museum can be easily extended to create other augmented reality Conditions in the science museum are very specific: applications. Anyway, technological aspects are only one side of the • Most of visitors are groups of children requiring sim- coin because they must be in balance with other exhibits ple, robust and self-explanatory control of the ex- in the science museum. Therefore any significant changes hibits. in exhibits must be consulted with other authors and de- signers. • No extra hardware such as keyboard, mouse, cables The functionality of the POI can also be extended, so etc. is acceptable. it can be used in the similar way as default touch screens and displays widely used today. We plan to extend the • The virtual menu should exploit the same camera as functionality of the POI concept described in this paper the exhibit itself. with the following features: • The function of menu must be very simple – usually • Multi-select. This feature enables users to use more just select the mode of operation. than one finger to perform a multi-select. It can be Control of Depth-Sensing Camera via Plane of Interaction 39 utilized to selected more objects at once or, for exam- References ple, to “check” multiple virtual check-boxes. [1] R. I. Hartley and A. Zisserman, Multiple View Geometry • Select and move. This feature is very similar to the in computer vision, 2nd ed. Cambridge University Press, 2004. widely used drag & drop feature and enables users to select some object (or with the combination with the [2] J. Hrdlicka, Kinect-projector calibration, human-mapping, 3dsense interactive technologies blog, 2013. previous features to select multiple objects at once) http://blog.3dsense.org/programming/ and to move it somewhere else. The “select” part kinect-projector-calibration-human-mapping-2/ of this feature will be exactly the same concept as [3] M. Mine, D. Rose, B. Yang, J. Vanbaar and A. Grund- the one described in this paper, which is similar to hofer, Projection-Based Augmented Reality in Disney Theme the mouse down event. To simulate mouse up event, Parks, Computer 45, 7, 32-40. 2012. we just check for moving the fingertip (or fingertips) [4] O. Kreylos, Augmented Reality Sandbox, UC Davis. 2013. away from the POI surface. http://idav.ucdavis.edu/~okreylos/ResDev/ SARndbox/ • Recognition of basic gestures. If it is possible to sim- [5] Z. Tomori, R. Gargalik and I. Hrmo, Active Segmentation in ulate “click and move”, then we can use the points 3D using Kinect Sensor, In Proceedings of the 20th Interna- obtained during the “move” phase as a gesture and try tional Conference on Computer Graphics, Visualization and to recognize it. Basic gestures (such as swipe finger- Computer Vision ’2012 (WSCG 2012), Part 2 (Pilsen, Czech tip left/right/up/down) can be recognized quite eas- Republic, June 26-29, 2012, 2012). Vaclav Skala Ed., Pilsen. ily. To recognize more complicated gestures one can [6] Z. Szalavari and M. Gervautz, The Personal Interaction use Dynamic Time Warping algorithm [10] or utilize Panel – a Two-Handed Interface for Augmented Reality, some machine learning approach. Computer Graphics Forum, pp. 335-346, 1997. [7] I. Poupyrev, D. S. Tan, M. Billinghurst, H. Kato, H. Regen- • The recognition of basic drawing shapes. With the brecht and N. Tetsutani, Developing a GenericAugmented- Reality Interface, Computer (Volume 35, Issue 3), pp. 44- aid of the “select and move” feature, we can enable 50, 2002. users to draw some basic shapes (such as a line, cir- [8] Ch. Geiger, L. Oppermann and Ch. Reirnann, 3D-Registered cle, rectangle, etc.). To recognize such drawn shape, Interaction-Surfaces in Augmented Reality Space, In Aug- we can use pattern matching or in case of simple mented Reality Toolkit Workshop, pp. 5-13, 2003. shapes (such as triangle, rectangle, circle, etc.) we [9] H. Benko, R. Jota and A. D. Wilson, MirageTable: Freehand can find a contour of drawn shape and try to fit the Interaction on a Projected Augmented Reality Tabletop, In contour with a polygon. After that we can count the Proceedings of the SIGCHI Conference on Human Factors number of vertices of the polygon to recognize this in Computing Systems, pp. 199-208, 2012. basic shape. [10] S. Salvador and P. Chan, Toward Accurate Dynamic Time Wrapping in Linear Time and Space, In Intelligent Data It should be noted, that not all improvements listed above Analysis, 11(5):561-580, 2007. are directly connected to the augmented reality sandbox [11] Microsoft, Kinect for XBox 360. described in section 4. Some of the improvements are http://www.xbox.com/en-US/kinect planned to be used in the future in our other projects. [12] Microsoft, Kinect for Windows. http://www.microsoft.com/en-us/ kinectforwindows/ 7 Conclusion [13] Leap Motion, Inc., Leap Motion. https://www.leapmotion.com/ [14] Creative Technology Ltd., Creative Senz3D. The plane of Interaction is the alternative to control the http://us.creative.com/p/web-cameras/ exhibits in science centers with the majority of children creative-senz3d visitors. It is simple, self-explanatory and robust. For in- [15] Willow Garage and Itseez, OpenCV library. teraction it does not require any extra hardware only this http://opencv.org/ included in the exhibit. Acknowledgment This work was supported by Slovak research grant agen- cies APVV (grant 0526-11).We thank to US Steel Košice as the main contributor of Steel Park science museum and “City of Košice” as co-founder.