=Paper=
{{Paper
|id=Vol-1190/paper2
|storemode=property
|title=User Interface Paradigms for Visually Authoring Mid-Air Gestures: A Survey and a Provocation
|pdfUrl=https://ceur-ws.org/Vol-1190/paper2.pdf
|volume=Vol-1190
|dblpUrl=https://dblp.org/rec/conf/eics/BaytasYO14
}}
==User Interface Paradigms for Visually Authoring Mid-Air Gestures: A Survey and a Provocation==
User Interface Paradigms for Visually Authoring Mid-Air Gestures: A Survey and a Provocation Mehmet Aydın Baytaş1, Yücel Yemez2, Oğuzhan Özcan1 1 2 Design Lab Department of Computer Engineering Koç University, 34450 İstanbul Koç University, 34450 İstanbul {mbaytas, yyemez, oozcan}@ku.edu.tr ABSTRACT adoption. One issue that contributes to the low rate of Gesture authoring tools enable the rapid and experiential adoption is the difficulty of balancing the trade-offs prototyping of gesture-based interfaces. We survey visual between complexity and expressive power of the paradigm authoring tools for mid-air gestures and identify three used to represent and manipulate gesture information: paradigms used for representing and manipulating gesture Interfaces employed for gesture authoring may become information: graphs, visual markup languages and convoluted and difficult to use in order to fully tap into the timelines. We examine the strengths and limitations of expressive power of human gesture; or they may omit these approaches and we propose a novel paradigm to useful features as they aim for usability and rapidity. authoring location-based mid-air gestures based on space discretization. In this paper, we survey existing paradigms for visually authoring mid-air gestures and present a provocation, a Author Keywords novel gesture authoring paradigm, which we have Gestural interaction; gesture authoring; visual implemented in the form of an end-to-end application for programming; interface prototyping. introducing gesture control to existing software and novel prototypes. ACM Classification Keywords H.5.2 Information Interfaces & Presentation (e.g. HCI): The rest of this paper is organized as follows: We first User Interfaces present three user interface paradigms – graphs, visual markup languages and timelines – used in current visual INTRODUCTION gesture authoring tools. Existing implementations of each The recent proliferation of commercial input devices that paradigm are examined and discussed in terms of their can sense mid-air gestures, led by the introduction of the capabilities and limitations. Results from evaluations with Nintendo Wii and the Microsoft Kinect, has enabled both real users, if published, are emphasized. We then present a professional developers and end-users to harness the power provocation in the form of a novel user interface paradigm of full-body gestural interaction. However, despite the for authoring mid-air gestures, based on space discretization availability of the hardware, applications that leverage and influenced by existing paradigms. We discuss future gestural interaction have not been thriving. A striking fact is work and conclude by presenting a summary of our results. that while the Kinect has broken records as the fastest- PARADIGMS FOR AUTHORING MID-AIR GESTURES selling consumer electronics device in history, sales of Authoring tools for mid-air gestural interfaces are still in games that utilize the Kinect have been poor [5]. This has their infancy. Development tools provided by vendors of been associated with design and user experience issues gesture-sensing input devices are focused on textual stemming from difficulties in designing and developing programming. Ongoing research suggests a set of diverse software [7]. Specifically, for both adept programmers and approaches to the problem of how to represent and comparatively non-technical but creative users such as manipulate three-dimensional gesture data. Existing works students, designers, artists and hobbyists, the amounts of approach the issue in three ways that constitute distinct time, effort and domain-specific knowledge required to paradigms. These are: implement custom gestural interactions is prohibitive. 1. using 2-dimensional graphs of the data from the Ongoing research aims to support gestural interaction sensors that detect movement; design and development with gesture authoring tools. These tools aim at enabling rapid and experiential prototyping, 2. using a visual markup language; and, which are essential practices for creating compelling 3. representing movement information using a timeline of designs [2]. However, few projects have gained widespread frames. These paradigms often interact with two programming EGMI 2014, 1st International Workshop on Engineering Gestures for Multimodal Interfaces, June 17 2014, Rome, Italy. approaches: Demonstration and declaration. Programming Copyright © 2014 for the individual papers by the papers' authors. by demonstration enables developers to describe behavior Copying permitted only for private and academic purposes. This volume is by example. In the case of gestures, many examples of the published and copyrighted by its editors. http://ceur-ws.org/Vol-1190/. 8 same behavior are often provided in order to account for the differences in gesturing between users and over time. Declarative programming of gestures involves describing behavior using a high-level specification language. This specification language may be textual or graphical. The paradigms we list above do not have to be used exclusively, and nor do demonstration and declarative programming. Aspects of different paradigms may find their place within the same authoring tool. A popular approach to authoring gestures is to introduce gestures by demonstration, convert gesture data into a visual Figure 1: The Exemplar gesture authoring environment. [3] From left to right, the interface reflects the developer’s representation, and then declaratively modify it workflow: Data from various sensors connected to the system In this section, we describe the above approaches in detail, is displayed as thumbnails and the sensor of interest is with examples from the literature. We comment on their selected; filters are applied to the incoming signal; areas of strengths and weaknesses based on evaluations conducted interest are marked for pattern recognition or thresholds are set; and the resulting gesture is mapped to output events. with software that implement them. Exemplar’s user studies suggest that this implementation of Using Graphs of Movement Data the paradigm is successful in increasing developer Visualizing and manipulating movement data using 2- engagement with the workings and limitations of the dimensional graphs that represent low-level kinematic sensors used. Possible areas of improvement include a information is a popular approach for authoring mid-air technique to visualize multiple sensor visualizations and gestures. This approach is often preferred when gesture events and finer control over timing for pattern matching. detection is performed using inertial sensors such as accelerometers and gyroscopes. It also accommodates other System for Multiple Action Gesture Interface Creation sensors that read continuously variable data such as (MAGIC) bending, light and pressure. Commonly the horizontal axis Ashbrook and Starner’s MAGIC [1] is another tool that of the graph represents time while the vertical axis implements the 2-dimensional graphing paradigm. The corresponds to the reading from the sensor. Often a “multi- focus of MAGIC is programming by demonstration. It waveform” occupies the graph, in order to represent data supports the creation of training sets with multiple coming in from multiple axes of the sensor. Below, we examples of the same gesture. It allows the developer to study three software tools that implement graphs for that keep track of the internal consistency of the provided representing gesture data: Exemplar, MAGIC and GIDE. training set; and check against conflicts with other gestures in the vocabulary and an “Everyday Gesture Library” of Exemplar unintentional, automatic gestures that users perform during Exemplar [3] relies on demonstration to acquire gesture daily activities. MAGIC uses the graph paradigm only to data and from a variety of sensors - accelerometers, visualize gesture data and does not support manipulation on switches, light sensors, bend sensors, pressure sensors and the graph. (Figure 2) joysticks. Once a signal is acquired via demonstration, on One important feature in MAGIC is that the motion data the resulting graph, the developer marks the area of interest graph may be augmented by a video of the gesture example that corresponds to the desired gesture. The developer may being performed. Results from user studies indicate that this interactively apply filters on the signal for offset, scaling, feature has been highly favored by users, during both smoothing and first-order differentiation. (Figure 1) gesture recording and retrospection. Interestingly, it is Exemplar offers two methods for recognition: One is reported that the “least-used visualization” in MAGIC “was pattern matching, where the developer introduces many the recorded accelerometer graph;” with most users being examples of a gesture using the aforementioned method and “unable to connect the shape of the three lines” that new input is compared to the examples. The other is correspond to the 3 axes of the accelerometer reading “to thresholding, where the developer manually introduces the arm and wrist movements that produced them.” thresholds on the raw or filtered graph and gestures are Features preferred by developers turned out to be the recognized when motion data falls between the thresholds. videos, “goodness” scores assigned to each gesture This type of thresholding also supports hysteresis, where according to how they match gestures in and not in their the developer introduces multiple thresholds that must be own class, and a sorted list depicting the “distance” of a crossed for a gesture to be registered. selected example to every other example. 9 Figure 3: The “follow” mode in the GIDE interface. [8] Discussion Graphs that display acceleration data seem to be the standard paradigm for representing mid-air gestures tracked using acceleration sensors. This paradigm supports direct manipulation for segmenting and filtering gesture data, but manipulating acceleration data directly to modify gestures Figure 2: MAGIC’s gesture creation interface. [2] is unwieldy. User studies show that graphs depicting accelerometer (multi-)waveforms are not effective as the sole representation of a gesture, but work well as a component within a multimodal representation along with Gesture Interaction Designer (GIDE) video. More recently, GIDE [8] features an implementation of the graph paradigm for authoring accelerometer-based mid-air Visual Markup Languages gestures. GIDE leverages a “modified” hidden Markov Using a visual markup language for authoring gestures can model approach to learn from a single example for each allow for rich expression and may accommodate a wide gesture in the vocabulary. The user interface implements variety of gesture-tracking devices, e.g. accelerometers and two distinct features: (1) Each gesture in the vocabulary is skeletal tracking, at the same time. The syntax of these housed in a “gesture editor” component which contains the visual markup languages can be of varying degrees of sensor waveform, a video of the gesture being performed, complexity, but depending on the sensor(s) used for gesture an audio waveform recorded during the performance, and detection, making use of the capabilities of the hardware other information related to the gesture. (2) A “follow” may not require a very detailed syntax. In this section we mode allows the developer to perform gestures and get real- examine a software tool, EventHurdle, that implements a time feedback on the system’s estimate of which gesture is visual markup language for gesture authoring; and we being performed (via transparency and color) and where discuss a gesture spotting approach based on control points they are within that gesture. (Figure 3) This feedback on the which has not been implemented as a gesture authoring temporal position within a gesture is multimodal: The tool, but provides valuable insight. sensor multi-waveform, the video and the audio waveform from the video are aligned and follow the gestural input. EventHurdle GIDE also supports “batch testing” by recording a Kim and Nam describe a declarative hurdle-driven visual continuous performance of multiple gestures and running it gesture markup language implemented in the EventHurdle against the whole vocabulary to check if the correct authoring tool [6]. The EventHurdle syntax supports gesture gestures are recognized at the correct times. input from single-camera-based, physical sensor-based and User studies on GIDE reveal that the combination of multi- touch-based gesture input. In lieu of a timeline or graph, waveform, video and audio was useful in making sense of EventHurdle projects gesture trajectory onto a 2- gesture data. Video was favored particularly since it allows dimensional workspace. The developer may perform the developers to still remember the gestures they recorded gestures, see the resulting trajectory on the workspace, and after an extended period of not working on the gesture declaratively author gestures on the workspace by placing vocabulary. Another finding from the user studies was the “hurdles” that intersect the gesture trajectory. Hurdles may suggestion that the “batch testing” feature where the be placed in ways that result in serial, parallel and/or developer records a continuous flow of many gestures to recursive compositions. (Figure 4) “False hurdles” are test against could be leveraged as a design strategy – available for specifying unwanted trajectories. While an gestures could be extracted from a recorded performance of intuitive way to visualize movement data from pointing continuous movement. devices, touch gestures and blob detection; this approach does not support the full range of expression inherent in 3- dimensional mid-air gesturing. 10 surrounded by boundaries whose size can be adjusted to introduce spatial flexibility and accommodate “noisy” gestures. Third, boundaries can be set for negation when the variation in the gesture trajectory is too much. The authors discuss linear or planar negation boundaries only, but introducing negative control points into the syntax could also be explored. Finally, a “coupled recognition process” is introduced, where a trained classifier can be called to distinguish between potentially conflicting gestures; e.g. a circle and a rectangle that share the same control points. One limitation of this approach is the lack of support for Figure 4: EventHurdle's visual markup language allows for a scale invariance. One way of introducing scale invariance variety of compositions: (from top left) a simple gesture with may be to automatically scale boundary sizes and temporal one hurdle; serial and parallel compositions; combinations of constraints with the distance between control points. serial and parallel compositions; recursive gesturing. [6] However, it is likely that the relationship between optimal values for these variables is nonlinear, which could make automatic scaling infeasible. Gestures defined in EventHurdle are configurable to be location-sensitive or location-invariant. By design, Discussion orientation- and scale-invariance are not implemented in The expressive power and usability of a visual markup order to avoid unnecessary technical options that may language may vary drastically depending on the specifics of distract from “design thinking.” the language and the implementation. The general User studies on EventHurdle comment that the concept of advantage of this paradigm is that it is suitable for hurdles and paths is “easily understood” and it “supports describing and manipulating location-based gesture advanced programming of gesture recognition.” Other than information (rather than acceleration-based information this, supporting features, rather than the strengths and commonly depicted using graphs). This makes using a weaknesses of the paradigm or comparison with other visual markup language suitable for mid-air gestures paradigms, have been the focus of user studies. detected by depth-sensing cameras, where the interaction space is fixed and the limbs of the users move in relation to Control Points each other. Either the motion sensing device or certain parts Hoste, De Rooms and Signer describe a versatile and of the skeletal model could be used to define a reference promising approach that uses spatiotemporal constraints frame and gesture trajectories could be authored in a around control points to describe gesture trajectories [4]. location-based manner using a visual markup language. While the focus of the approach is on gesture spotting (i.e. segmentation of a continuous trajectory into discrete Timelines gestures) and not gesture authoring, they do propose a Timelines of frames are commonly used in video editing human-readable and manipulable external representation. applications. They often consist of a series of ordered (Figure 5) This external representation has significant thumbnails and/or markers that represent the content of the expressive power and support for programming constructs moving picture and any editing done on it, such as adding such as negation (for declaring unwanted trajectories) and transitions. user-defined temporal constraints. While the authors’ approach is to infer control points for a desired gesture from an example, the representation they propose also enables the manual placement of control points. The authors do not describe an implementation that has been subjected to user studies. However, they discuss a number of concepts that add to the expressive power of using control points as a visual markup language to represent and manipulate gesture information. The first is that it is possible to add temporal constraints to the markup; i.e. a floor or ceiling value can be specified for the time Figure 5: Using control points to represent gestures [4]. (Left) taken by the tracked limb or device to travel between A “noisy” gesture still gets picked up due to relaxed control points. This is demonstrated not on the graphical boundaries around control points. (Right) Negation is introduced via vertical boundaries so that large movements in markup (which can be done easily), but on textual code the vertical axis are distinguished from the desired gesture. generated to describe a gesture – another valuable feature. The second such concept is that the control points are 11 System UI Paradigm Programming Approach Insights from user studies Increases engagement with sensor Exemplar [3] Graphs Demonstration workings and limitations. Users unable to connect waveform MAGIC [1] Graphs (multi-waveform) Demonstration to physical movements. Optional video is favored. Graphs (multi-waveform Multimodal representation helps GIDE [8] Demonstration with video) make sense of gesture data. Easily understood. Supports EventHurdle [6] Visual markup language Declaration “advanced” programming. Control Points [4] Visual markup language Declaration / Demonstration Not implemented. Gesture Studio 1 Timeline Demonstration Not published. Table 1: Summary of studies on systems that exemplify three user interface paradigms for visually authoring mid-air gestures. paradigm used to represent gesture information appears to Gesture Studio One application that implements a timeline to visualize be projecting the sensor waveforms onto a graph. Graphs gesture information is the commercial Gesture Studio.1 The appear to work well as components that represent sensor- application works only with sensors that detect gestures based gestures, allow experimentation with filters and gesture recognition methods, and support direct through skeletal tracking using an infrared depth camera. manipulation to some extent. User studies show that while Developers introduce gestures in Gesture Studio by the graphs alone may not allow developers to fully grasp demonstration, through performing and recording examples. The timeline is used to display thumbnails for the connection between movements and the waveform [1], each frame of the skeleton information coming from the they have been deemed useful as part of a multimodal depth sensor. The timeline is updated after the developer gesture representation [8]. Using hurdles as a visual markup finishes recording a gesture, while during recording a language offers an intuitive and expressive medium for rendering of the skeletal model tracked by the depth sensor gesture authoring, but it is not able to depict fully 3- provides feedback. After recording, the developer may dimensional gestures. Using spherical control points may be remove unwanted frames from the timeline to trim gesture more conducive to direct manipulation while still affording data for segmentation. Reordering frames is not supported an expressive syntax, but no implementation of this since gestures are captured at a high frame rate (depending paradigm exists for authoring mid-air gestures. Finally, timelines of frames may come in handy for visualizing on the sensor, usually around 30 frames per second), which dynamic gestures with many moving elements, such as in would make manual frame-by-frame editing inconvenient. The process through which these features have been skeletal tracking; but used in this fashion they allow only selected is opaque, since there are no published studies that visualization and not manipulation. present the design process or evaluate Gesture Studio in There are paradigms that allow for the authoring of sensor- use. based gestures both declaratively and through demonstration. For skeletal tracking interfaces, tools based Discussion on demonstration exist, but we have not come across visual In gesture authoring interfaces, timelines make sense when declarative programming tools for skeletal tracking gesture tracking encompasses many limbs and dynamic interfaces. In the next section, we propose a user interface movements that span more than a few seconds. Spatial and paradigm for declaratively authoring mid-air gestures for temporal concerns for gestures in 2 dimensions, such as skeletal tracking interfaces. those performed on surfaces, can be represented on the same workspace. The representation of mid-air gestures PROVOCATION: SPACE DISCRETIZATION AS A NOVEL requires an additional component such as a timeline to PARADIGM FOR AUTHORING MID-AIR GESTURES show the change over time. The paradigms that we surveyed above each have their strengths and weaknesses. We wish to propose a novel Discussion paradigm for declaratively authoring mid-air gestures, We have presented a number of systems that exemplify which we will call space discretization. This paradigm three user interface paradigms for visually authoring mid- conceptually supports both declaration and demonstration air gestures for computing applications (see Table 1 for a as ways to introduce gestures, and direct manipulation to summary). For sensor-based gesturing, the standard edit them. The paradigm is adaptable for sensor-based interactions and touch gestures. We will present a rendition 1 aimed at authoring gestures for skeletal tracking interfaces. http://gesturestudio.ca/ 12 Figure 7: A 3-dimensional “swipe” gesture to be performed Figure 6: A 2-dimensional “Z” gesture defined using ordered with the right hand, implemented in Hotspotizer. The front hotspots in discretized space. view (A) and the side view (B) depict the third frame, selected from the timeline (C). The 3D viewport (D) depicts all three Overview and Implementation frames, using transparency to imply the order. We have implemented this paradigm as part of an However, gestures in Hotspotizer are always location- application called Hotspotizer. The application has been dependent with respect to the gesturing limb’s position developed as an end-to-end suite to facilitate rapid relative to the rest of the body. Scale- and orientation- prototyping of gesture-based interactions and adapting invariance are not automatically supported, but it is possible arbitrary interface for gesture control. Collections of to arrange hotspots in creative ways that allow the same gestures can be created, saved, loaded, modified and mapped to a keyboard emulator within the application. The gesture to be executed on different scales. current version is configured to work with the Microsoft Splitting gesture data into frames, which are navigated Kinect sensor and is available online as a free download.2 using a timeline, supports authoring dynamic movements. The paradigm we implemented works by partitioning the The side view and front view grids only display hotspots that belong to one frame at a time, since placing all of the space around the tracked skeletal model into discrete spatial hotspots that belong to different frames of a gesture on the compartments. In a manner that is similar to the use of same grids results in a convoluted interface. During gesture control points in Hoste, De Rooms and Signer’s approach, tracking, if the tracked limb enters any one of the hotspots these discrete compartments can be marked and activated to that belongs to a frame, the entire frame registers a “hit.” become “hotspots” that register movement when a tracked For a gesture to be registered, its frames must be hit in the limb enters them. (Figure 6) Our approach may be likened correct order and the time that elapses between subsequent to modifying the control points paradigm to use cubic frames registering a hit must not exceed a pre-defined instead of spherical boundaries and allow the placement of control points only at discrete locations in space. This is timeout. Conceptually the timeout could be adjustable; in due to the difficulty of manipulating continuously moveable the current implementation, for the sake of a simple user control points in 3 dimensions. Furthermore, using discrete interface, it is hard-coded to 500ms in Hotspotizer. hotspots instead of control points allows for the boundaries In essence, we propose a design for an expressive user of the control points to be in custom shapes rather than interface paradigm for authoring mid-air gestures detected spheres only. Considering the precision of current skeletal through skeletal tracking. Aspects of this design are based tracking devices, the difficulty of manipulating free-form on the control points paradigm described in [4]. We regions rather than discrete compartments does not pay off. modified the paradigm to confine the locations of the control points to discrete pre-defined locations and use In Hotspotizer, the compartments are cubes that measure 15 cubic control point boundaries of fixed size, which can be cm on each side and the workspace is a cube, 300 cm on added together to create custom shapes. We also introduce a each side, the centroid of which is fixed to the tracked skeleton’s “hip center” joint returned by the Kinect sensor. timeline component so that spatial and temporal constraints (Figure 7) The workspace has been sized to accommodate can be manipulated unambiguously. larger users, and the compartments have been sized, Future Work through empirical observations, to reflect the sensor’s Future work includes features to enrich the expressiveness precision. The alignment of the workspace to the user’s of the paradigm and evaluating its performance in use. body results in gestures being location-invariant with respect to the user’s position relative to the depth camera. The current implementation of the paradigm in Hotspotizer supports only declaration – manually specifying hotspots by selecting relevant areas on a grid. The interface may be 2 http://designlab.ku.edu.tr/design-thinking-research- extended to allow the introduction of gestures through group/hotspotizer/ 13 demonstration, by inferring hotspots automatically from supporting features onto this paradigm and evaluate its recorded gestures. performance in use by developers. “Negative hotspots” to mark compartments that should not ACKNOWLEDGEMENT be crossed when gesturing are a possibility for future The work presented in this paper is part of research iterations on Hotspotizer. So is supporting gestures supported by the Scientific and Technological Research performed by multiple limbs; possibly by using a multi- Council of Turkey (TÜBİTAK), project number 112E056. track timeline and coupling keyframes where movements of the limbs should be synchronized. REFERENCES In order to describe more complex gestures, it may make 1. Ashbrook, D. and Starner, T. MAGIC: A Motion Gesture sense to introduce classifier-coupled gesture recognition. Design Tool. Proceedings of the 28th international conference on Human factors in computing systems - CHI ’10, ACM One shortage of the paradigm is that it does not Press (2010), 2159. accommodate the repeated usage of hotspots within different frames of a gesture well. If a gesture requires that 2. Buxton, B. Sketching User Experiences: Getting the Design a certain hotspot be hit twice, for example, the current Right and the Right Design. Morgan Kaufmann, Boston, 2007. implementation does not afford a way of detecting whether 3. Hartmann, B., Abdulla, L., Mittal, M., and Klemmer, S.R. the first or the second hit is registered as a user performs the Authoring sensor-based interactions by demonstration with gesture. direct manipulation and pattern recognition. Proceedings of the SIGCHI conference on Human factors in computing Finally, as the precision of skeletal tracking devices systems - CHI ’07, ACM Press (2007), 145. increases and in order to accommodate devices that track smaller body parts such as the hands, adjustable workspace 4. Hoste, L., De Rooms, B., and Signer, B. Declarative Gesture and compartment sizing may be introduced. Spotting Using Inferred and Refined Control Points. Proceedings of the 2nd International Conference on Pattern Formative evaluations have been conducted throughout the Recognition Applications and Methods (ICPRAM 2013), development Hotspotizer, focusing on prioritizing features (2013). and the visual design of the interface. Results of these, 5. Hughes, D. Microsoft Kinect shifts 10 million units, game along with summative evaluations that compare the sales remain poor. HULIQ, 2012. application to existing solutions and uncover user strategies http://www.huliq.com/10177/microsoft-kinect-shifts-10- for using the tool will be published in the future. million-units-game-sales-remain-poor. 6. Kim, J.-W. and Nam, T.-J. EventHurdle. Proceedings of the CONCLUSION SIGCHI Conference on Human Factors in Computing Systems We reviewed existing paradigms for authoring mid-air - CHI ’13, ACM Press (2013), 267. gestures and discussed how graphs of sensor waveforms are suitable components that represent acceleration-based 7. Stein, S. Kinect, 2011: Where art thou, motion? CNET, 2011. http://www.cnet.com/news/kinect-2011-where-art-thou- gesture data; how visual markup languages are better suited motion/. for location-based gesture data; and how timelines are used to communicate dynamic gesturing. We presented a novel 8. Zamborlin, B., Bevilacqua, F., Gillies, M., and D’inverno, M. gesture authoring paradigm for authoring mid-air gestures Fluid gesture interaction design. ACM Transactions on sensed by skeletal tracking: a visual markup language based Interactive Intelligent Systems 3, 4 (2014), 1–30. on space discretization supported by a timeline to visualize temporal aspects of gesturing. Future work may build 14