-

Supplement

10.1016/j.apergo.2011.09.011

Towards New User Interfaces Based on Gesture and Sound Identification

2013

3 0 1995 2006

The relatively low price of devices that enable capture of 3D data such as Microsoft Kinect will certainly accelerate the development and popularization of a new generation of user interaction in the business application domain. Although the application interfaces and libraries that make it easier to communicate with these devices continue to be in the process of developing and maturing, they can still be used for the development of business solutions. In addition, gestures and sounds provide more natural and e ective ways of human-computer interaction. In this paper we present an overview and a basic comparison of the available sensing devices together with the experience gained during the development of solution ADORA, which main purpose is to assist surgeons with the help of contactless interaction. General Terms: Natural user interfaces,Human Computer Interaction Additional Key Words and Phrases: depth vision, natural user interfaces, healthcare, sensors, kinect The average user communicates with the computer using a keyboard and a computer mouse. The keyboard remains at the core computer interaction since the first commercial computer in 1984. In more than half a century since the invention of the first computer mouse many pointing devices have been introduced, of which best known are a tracking mouse (trackball) and light pen. None of these devices worked out to be better than the keyboard, so the majority of human-computer interaction (HCI) is performed via computer keyboard and mouse, which are still the same as they were at the time of the invention. With the rapid advances of technology other ways of HCI were developed. In the last decade, a noticeable use of touch screen devices and other innovative gaming interfaces were developed and successfully used in practice. At the same time with technology development new challenges emerged, e.g. ”How to communicate with computers using complex commands without direct physical contact?” The solution would facilitate and optimize work in many specialized domains. Alexander Shpunt [Dibbell 2011] introduced three-dimensional (3D) computer vision. Simple communication and control of a computer by using the user's movements (gestures) and voice commands was enabled. The sensing device records observed space, takes an image and converts it into a synchronized data stream. Data stream consists of depth data (3D vision) and color data (similar to human vision). Depth vision technology was invented in 2005 by Alexander Shpunt, Zeev Zalevsky, Aviad Maizels and Javier

Categories and Subject Descriptors: H.5.2 [Information Interfaces and Presentation]: User Interfaces—Evaluation/ methodology

1. INTRODUCTION

6:46 Garcia [Zalevsky et al. 2007]. The technology has been established in the world of consumer technology (gaming consoles). Massachusetts Institute of Technology (MIT) ranked gesture interface based technology among the top ten most successful technologies in 2011 [Dibbell 2011]. The major console manufacturers (Microsoft, Sony and Nintendo) upgraded their gaming experience with an advanced motion-sensing interfaces. Sony and Nintendo have developed a wireless controllers (PlayStation Move and Wii MotionPlus), while Microsoft’s Xbox console used a completely noncontact approach with the new Kinect sensor. Microsoft Kinect sensor is currently the most visible and easily accessible gaming controller on the market. Microsoft’s decision to publish development libraries Kinect for Windows SDK, enabled rapid growth and paved the way for a wide range of potential applications in different domains (medicine, robotics, interactive whiteboards, etc.). Microsoft Kinect is the first commercial composite interface that combines camera to detect body movements and facial and voice recognition. Gartner’s hype cycle for HCI technologies [Prentice and Ghubril 2012] mentioned that gesture interface control technology is steadily moving towards greater productivity, a mature market and higher returns of value. Company analytics at Markets & Markets [Marketsandmarkets.com 2013] have evaluated the market value of the hardware and software that allows you to control your computer using gestures and voice commands, such as the Microsoft Kinect, to U.S. $ 200 million in 2010. According to recent market research data, the value of contactless interaction and gesture recognition in 2018 will reach 15 billion U.S. dollars [Marketsandmarkets.com 2013]. HCI defines user experience. Gesture interface control enables the recognition and understanding of human body movement for the purpose of interaction and control of computer systems without direct physical contact [Prentice and Ghubril 2012]. The term “natural user interface” is used to describe systems that communicate with the computer, without any intermediate devices. In the last decade a big leap forward was made from the traditional ways of managing computer software with keyboard and mouse, to non-contact HCI, which was primarily used in the gaming field.

2. SENSING DEVICES FOR HCI

Three types of sensing technologies mainly occur in computer vision research: stereo cameras, timeof-flight (ToF) cameras and structured light [Chen et al. 2013]. Stereo machine vision is biomimetic, where 3D structure is gathered from different viewpoints, similar to human vision. Time of flight cameras estimate distance to an object with the help of light pulses from a single camera. Information such as time to reach the object and speed of light define the distance from the measured point. ToF devices have high precision and are very expensive. Structured light sensors started to develop with the help of PrimeSense technology, which was acquired by Microsoft and built into their Kinect sensor. Main advantage of structured light sensors is the price-performance balance. Microsoft Kinect is priced at consumer level and still obtains sufficient precision level [Gonzalez-Jorge et al. 2013].

2.1 Depth sensing

Vision sensing devices capture distance from the real world, which cannot be obtained directly from an image. Depth images require pre and post-processing. Depth view enables body part detection, pose estimation and action recognition. Body parts and pose detection is a popular topic, while action recognition is starting to receive more research attention [Chen et al. 2013]. A lot of research [Clark et al. 2012; Dutta 2012; Gabel et al. 2012; Khoshelham and Elberink 2012; Stoyanov et al. 2012] is done in the field of body data detection and its transformation to active skeletons. There are still some precision issues as mentioned by [Khoshelham and Elberink 2012], but if there is no need for high precision tracking, currently available devices handle the issues quite successfully. Microsoft first announced the Kinect sensor in 2009 under the name ”Project Natal”. In 2012 Kinect for Windows sensor was announced, that enabled use of advanced gesture based functionality in the business application domain. The sensor was accompanied by software development library – Kinect for Windows SDK. Until February 2013 Microsoft sold 24 million units of the Kinect sensor. In the first sixty days on the market eight million units of the Kinect sensor were sold, which gave Kinect the title of fastest selling consumer electronic device and was recorded in the Guinness Book of Records1. The sensor has the ability to capture audio, color video and depth data (figure 1). Depth data is detected with the use of infrared light, with which a covered skeleton (figure 2) of a tracking person is formed. Kinect sensor captures depth image by using two separate data streams. The first stream presents the data from the Kinect sensor to the nearest object in millimeters and the second one presents the segmented data from a tracked person. Deep video supports the default resolution of 640x480 pixels, the value can be set to 320x240 or 80x60 pixels. Sensor can recognize up to six different people. According to [Dutta 2012] Kinect sensor was able to capture relative 3D coordinates of markers of 0.0065 m, 0.0109 m, 0.0057 m in the x, y, and z directions. Tests provided such accuracy over the range from 1.0 m to 3.0 m. This means Kinect provides very accurate data in defined ranges.

2.3 Sensors from PrimeSense family

PrimeSense sensor family exists on the HCI market from the very beginning. Inventors [Zalevsky et al. 2007] of depth vision founded their own company called PrimeSense. Their technology expanded with mass production of the Kinect sensor. PrimeSense sensor family consists of Carmine and Capri sensors. Capri 3D sensor is an embedded sensor, which main advantage is in its small size. Sensor aims at the market of mobile devices (mobile phones, tablet computers, smart televisions). Despite its small size it has great potential. At this year’s conference Google IO 2013 a prototype tablet integration with Capri 3D sensor was presented [Crabb 2013]. All sensors are accompanied with software developer kits in the form of open source libraries like OpenNi supported by NITE algorithms.

1Guiness World Records http://www.guinessworldrecords.com.

6:48

2.4 Asus sensor Xtion

Xtion sensor is available in two versions, Xtion and Xtion Pro Live. It is based on the same depth vision technology as the Kinect sensor family. Xtion is promoted exclusively as a PC sensor and does not require additional power supply [Gonzalez-Jorge et al. 2013]. Asus sensors are used in the same way as Kinect, but for supporting software library OpenNi framework is used. Framework can be used in conjunction with Microsoft Kinect or other sensors based on PrimeSense technology.

2.5 Leap Motion sensor

Leap Motion sensor is a small device with enormous potential and aims to change the way we interact with computers. Sensor is installed next to the keyboard on the office desk and it provides a very accurate individual fingers detection. It is much more accurate than Kinect sensor (up to 200 times) [Hodson 2013]. This makes user interaction very precise and domain specific. The main purpose of Leap Motion sensor is not the depth vision of 3D space, but the exact finger detection and integration of these functionalities with existing applications 2.6

MYO Sensor

MYO is a not a depth vision sensing device but an intelligent armband. It detects motion in two ways: muscle activity and motion sensing [Nuwer 2013]. The MYO uses Bluetooth 4.0 Low Energy to communicate with the paired devices. It features on-board, rechargeable Lithium-Ion batteries and an ARM processor. Sensor is outfitted with proprietary muscle activity sensors. It also features a 9-axis inertial measurement unit. Muscle activity is defined by measuring electric activity in muscles, known as electromyography (EMG). Table I shows a basic comparison between the above-mentioned and other currently available devices.

DEVICE

CONTROL KINECT XBOX 360

contact-free KINECT FOR WIN- contact-free DOWS KINECT ONE contact-free ASUS XTION contact-free PRIMESENSE CAPRI PRIMESENSE CARMINE LEAP MOTION SONY MOVE contact-free contact-free contact-free with controller WII MOTIONPLUS with controller

MYO armband A-audio, V-video, IR-infrared; fps – Frames per Second GS – Gesture Support, VC – voice control EMG – Electromyography

3. APPLICATION COMMUNICATION INTERFACES

The quick adoption of gesture based devices was enabled with the development of supporting software libraries (SDK). Libraries enable development of new innovative solutions and provide an upgrade to existing applications. Several libraries are publicly available. OpenKinect is an open community that uses Kinect sensor for research. The framework has a very active community, which contributes with a large set of libraries and plug-ins, thereby extending the OpenNi framework. The OpenNi core consists of several components that take care of: (i) analysis of the entire body, (ii) analysis of individual items (finger detection), (iii) recognition of user gestures and (iv) analysis of the environment (objects and people) [OpenNI 2013].

Kinect SDK [Microsoft 2013] enables software developers to develop interactive applications with support for voice command and gestures. Figure 3 shows the architecture components of Kinect for Windows SDK. These components include the following: (1) Hardware components, including the Kinect sensor and the USB hub (2) Windows drivers for the Kinect, which are installed as part of the SDK The Kinect drivers support: (a) Kinect microphone array as a kernel-mode audio device that you can access through the standard audio APIs in Windows. (b) Audio and video streaming controls for streaming audio and video (color, depth, and skeleton). (c) Device enumeration functions that enable an application to use more than one Kinect. (3) Audio and Video Components (4) DirectX Media Object (DMO) for microphone array beam forming and audio source localization. (5) Windows standard APIs - audio, speech, and media APIs in Windows as described in the SDK and

Microsoft Speech SDK [Microsoft 2013].

Kinect library includes advanced components known as Microsoft.Kinect.Toolkit.Controls plugin. Toolkit controls enable realization of the advanced functionality for desktop applications.

The Point Cloud Library (PCL) [PointCloud 2013] is a standalone, large scale, open framework for 2D/3D image and point cloud processing that enables advanced 3D data analysis. It contains numerous state-of-the art algorithms that can be used to filter outliers from noisy data, stitch 3D point clouds together, segment relevant parts of a scene, extract keypoints and compute descriptors to recognize objects. PCL library can create surfaces from point clouds and visualize them [Rusu and Cousins ]. The table II below summarizes current software libraries for HCI.

4. LESSONS LEARNED DURING DEVELOPMENT OF A GESTURE BASED SOLUTION ADORA

The objective of any health care institution is to optimize the length of a surgical procedure and increase its quality. ADORA2 is an interactive physician’s assistant enabling a unique presentation of information about a patient before and during surgical procedures. ADORA offers a comprehensive and integrated natural user interface experience for physicians. It is a product of field knowledge, modern information and communication technologies as well as advanced hardware. With its simple use of contact-free interaction it shortens the duration of surgeries and indirectly affects the environmental and economic aspects of healthcare. By using ADORA, physicians are able to actively participate in a surgery also outside the operating theatre. Modern methods of HCI (gestures and voice support) have been integrated into ADORA solution during development. Physicians gained control over patient data with the help of contact-free interaction. Lessons learned and best practices are summarized below and consist of following challenges: (1) The design of graphical natural user interfaces adapted to support gestures and sound. (2) Calibration of the sensor and the correct choice of the active detected person. (3) Proper detection and identification of the sound source. (4) The correct interpretation of voice command. (5) Implementation of advanced gestures and functionality that are not supported in the basic development library Kinect (point based rotation, dynamic zoom with feedback, traceable and flexible display of DICOM images [Ko¨nig 2005]).

The first challenge was to design an intuitive and adaptive graphical interface that supported communication with the Kinect sensor. It is very important to include clear feedback to the user (either graphical or voice), especially when designing interactive applications. It is also recommended to include interactive help where the user has the ability to practice applications gestures and voice commands. Interactive help enables a user to learn which gestures and sound commands are required

2Advanced Doctor’s Operational Research Assistant, http://www.adora-med.com.

to control desired functionalities. During the graphical user interface design, we had to move away from the ”classic” design of user interfaces and address the challenges associated with new ways of interaction that is typical for gesture based applications. Analysis of existing gesture based solutions and Kinect user interface design guidelines helped us in the process of natural user interface design. The user interface had to be adapted to the new user controls. Another challenge was Kinect sensor calibration. Problems arose during user detection from a variety of distances from the sensor and the detection of primary user in the presence of other surgeons. In the operating room there are at least three or more people standing close to each other. Kinect Sensor detects all users as active bodies (skeletons). Kinects correction factors helped us to calibrate the sensor according to the scope and area of usage. Detection and tracing of primary user was solved with the use of voice commands. Kinect sensor detects sound source area, which allowed us to locate the primary person in the room. Primary user is selected by executing a voice command “Follow me”. The user who executed the command becomes active person and a tracked skeleton. Detection and understanding of voice commands represented a special challenge that required knowledge of foreign languages and perfect pronunciation of these. Kinect language support is limited to thirteen languages. Recognition of voice commands takes place in stages. First, the sensor compares the sounds with the selected grammar. Grammar can be determined dynamically or is defined statically in the form of an XML document. Synonyms for each voice commands can also be defined. For example, the voice command “next” can have synonyms such as “forward” and “continue”. Detection of voice command returns sensor detection level called “confidence level ratio”. The ratio has a value from zero to one, where zero means a very weak detection and the value of one very powerful detection. In order to successfully detect voice commands the proper accent, pronunciation and intonation must be satisfied. Sound commands allow navigation through the application, item selection and object manipulation. Special digital medical image format DICOM (Digital Imaging and Communications in Medicine) [Blazona and Koncar 2007] is used in healthcare. One of the challenges was the magnification functionality. In addition to vertical and horizontal axes, depth information from the Z-axis was needed. Z-axis represents the distance from between the person and the sensor. A correct zoom factor had to be calculated. The problem was solved by adapting to the speed of hand movement. This kind of image magnification proved to be very efficient, since the user is able to control the speed and direction of zooming. It was also necessary to determine the center point of the zoom. The problem was solved with coordinate transformation from the current hand position on the widget, to a pixel point on the image itself.

5. CONCLUSION

The revolution that began in the living rooms of innovative researchers continues. Technological challenges of 3D vision have been successfully addressed. Rapid growth of sensors that enable skeleton identification, finger gestures and voice command is expected, especially the rise of applications that will incorporate their functionality. Applications that will enable contactless control of computers and other devices will slowly but surely begin to replace existing ones. Devices are already changing the way we communicate with computers. The internet of things effect is becoming more and more visible. New technologies are replacing the use of devices such as a keyboard and mouse. Touchscreens are already transforming the way we interact with mobile devices. The next step is utilization of gestures and voice commands for computer interaction in our every day. There is still some work needed to be done. Accuracy and precision decrease with range and makes usage of such devices limited to special domains. Accuracy issues can be solved with alternative types of sensing, that provide a different way of gathering motion data that is not based on depth vision (MYO armband). Sensing devices will be built into monitors, laptops, televisions and mobile devices. Given the incredible development of increasingly advanced and intuitive electronic devices we can predict that in the near future systems with similar (if not better) functionalities, as seen in science fiction movies, will be used.