Web-Powered Virtual Site Exploration Based on Augmented 360 Degree Video via Gesture-Based Interaction Maarten Wijnants Gustavo Rovelo Ruiz Donald Degraen Peter Quax Kris Luyten Wim Lamotte Hasselt University – tUL – iMinds, Expertise Centre for Digital Media Wetenschapspark 2, 3590 Diepenbeek, Belgium firstname.lastname@uhasselt.be ABSTRACT The last few years, technological advances in divergent research Physically attending an event or visiting a venue might not always disciplines have emerged that hold great promise to increase the be practically feasible (e.g., due to travel overhead). This article persuasiveness of interactive video-driven virtual explorations. presents a system that enables users to remotely navigate in and A first important example is situated in the video capturing field, interact with a real-world site using 360° video as primary content in the form of 360° video authoring. 360° video cameras produce format. To showcase the system, a demonstrator has been built video footage that has an omni-directional (i.e., cylindrical or that affords virtual exploration of a Belgian museum. The system spherical) Field of View. Compared to classical video, 360° video blends contributions from multiple research disciplines into a holis- content unlocks options for increased user immersion and tic solution. Constituting technological building blocks include engagement. Secondly, the traditional keyboard/mouse inter- 360° video, the Augmented Video Viewing (AVV) methodology action technique is recently witnessing increasing competition that allows for Web-driven annotation of video content, a walk- from more natural alternatives like, for example, touch- or up-and-use mid-air gesture tracking system to enable natural user gesture-based schemes. In this context, an exploratory study interaction with the system, Non-Linear Video (NLV) constructs performed by Bleumers et al. has found mid-air gesture-based to unlock semi-free visitation of the physical site, and the MPEG- interaction to be the preferred input method for the consumption DASH (Dynamic Adaptive Streaming over HTTP) standard for of the 360° video content format [1]. A final notable evolution adaptive media delivery purposes. The system’s feature list will is the increasing maturity of the Web as a platform for media be enumerated and a high-level discussion of its technological dissemination and consumption. The HTML5 specification foundations will be provided. The resulting solution is completely for instance covers all necessary tools to develop a 360° video HTML5-compliant and therefore portable to a gamut of devices. player that affords typical Pan-Tilt-Zoom (PTZ) adaptation of the viewing angle inside the omni-directionally captured video scene. Author Keywords Virtual exploration; HTML5; 360° video; Non-Linear Video; SYSTEM OVERVIEW AND USE CASE DESCRIPTION gestural interaction; MPEG-DASH; Augmented Video Viewing. The work described in this article can best be summarized as of- ACM Classification Keywords fering an interactive multimedia content consumption experience akin to Google Street View, yet hereby relying on 360° video C.2.5 Computer-Communication Networks: Local and Wide- instead of static imagery, and at the same time offering advanced Area Networks—Internet; H.5.1 Information Interfaces and Pre- interaction opportunities that go beyond basic “on rails” virtual sentation: Multimedia Information Systems—Artificial, aug- navigation. The specific use case that will be focused on in this mented, and virtual realities; H.5.2 Information Interfaces and manuscript, is the virtual visitation of a particular Belgian museum. Presentation: User Interfaces—Input devices and strategies, In- Users can move along predefined paths that have been video cap- teraction styles; H.5.4 Information Interfaces and Presentation: tured in 360 degrees. When reaching the end of such a path, users Hypertext/Hypermedia are offered the choice between a number of alternative contiguous INTRODUCTION AND MOTIVATION traveling directions, very much like choosing a direction at a cross- There is an increasing tendency to disclose real-world events and road. As such, a NLV scenario arises in which the user is granted spaces in the virtual realm, to enable hindered people to participate a considerable amount of freedom to tour the museum at his per- from a remote location. Systems that allow for cyber presence at sonal discretion. While navigating along the predetermined paths, physically distant sites hold value for heterogeneous application users can play/pause the video sequence, dynamically change domains, including tourism, entertainment and education. their viewing direction, and perform zoom operations. The use case furthermore includes a gamification component in the form of simple “treasure hunting” gameplay. In particular, users are encouraged to look for salient museum items that have been transparently annotated to increase user engagement with the content. Finally, both mouse-based and gestural interaction is sup- ported by the use case demonstrator. The two control interfaces are identical expressive-wise, in the sense that they grant access 3rd International Workshop on Interactive Content Consumption at TVX’15, June 3rd, 2015, Brussels, Belgium. to the exact same set of functionality (i.e., direct manipulation Copyright is held by the author(s)/owner(s). of the user’s viewport into the 360° video content, making navigational decisions and performing gamification interaction through pointing and selection, video playback control). IMPLEMENTATION Except for its gesture-related functionality, the demonstrator has exclusively been realized using platform independent Web standards. The typical execution environment of the player component of the demonstrator is therefore a Web browser. The involved media content was recorded using an omni- Figure 1. Gesture-based interaction with the demonstrator. directional sensor setup consisting of 7 GoPro Hero3+ Black cameras mounted in a 360Heros rig (http://www.360heros.com/). The resulting video material was temporally segmented according to the physical layout of the museum in order to yield individual clips for each of the traversable paths. The collection of paths (and their mutual relationships) is encoded as a directed graph. This graph dictates the branching options in the NLV playback. Media streaming is implemented by means of MPEG-DASH. The separate video clips from the content authoring phase were each transcoded into multiple qualities, temporally split into Figure 2. Three applications of the AVV methodology in the demonstrator. consecutive media segments of identical duration (e.g., 2 seconds), and described by means of a MPD. The resulting content was pub- methodology [2]. This methodology (and its Web-compliant lished by hosting it on an off-the-shelf HTTP server. The W3C implementation) is intended to transform video consumption Media Source Extensions specification is exploited to allow for from a passive into a more lean-forward type of experience by the HTML5-powered decoding and rendering of media segments providing the ability to superimpose (potentially interactive) that are downloaded in an adaptive fashion using JavaScript code. overlays on top of the media content. The navigation options While the playback of a path is active, the initial media segments are represented as arrows indicating the direction of potential that pertain to each of the potential follow-up routes (as derived follow-up paths; their visualization is toggled when the playback from the NLV graph representation) are pre-fetched from the of the current path is about to end. The treasure hunt objects on HTTP server. The total number of media segments to pre-fetch is the other hand are (invisibly) annotated by means of a transparent dictated by the corresponding path’s minBufferTime MPD overlay. When the user points to such an object, the visual style attribute. By making initial media data locally available ahead of of the associated AVV annotation is on-the-fly transformed time, the startup delay of the selected follow-up path is minimized. (through CSS operations) into a semi-transparent one. If the item is subsequently selected, an informative AVV-managed call-out The gesture set that was defined for the demonstrator consists widget is visualized. Finally, the AVV methodology is also of composite gestures in the sense that they involve performing exploited to present visual feedback of the user’s current pointing a sequence of discrete, gradually refining postures. As such, it location. This is realized by dynamically instantiating an overlay becomes feasible to organize the available gestures in a tree-like (carrying a green hand icon) as soon as the user enters pointing topology, where intermediate layers represent necessary steps mode and by continuously updating its coordinates as the pointing towards reaching a leaf node, at which point the gesture (and operation is being performed. When pointing offscreen (only its corresponding action) is actually actuated. It also allows possible with the gestural interface), the hand icon is clamped to gesture clustering and organization on the basis of their respective the nearest on-screen position and turns red. Figure 2 illustrates sequence of encompassed postures. Two gestures whose posture the three applications of the AVV approach in the demonstrator. series are identical up to some intermediate level, share a branch in the tree up to that level and only then diverge topologically. ACKNOWLEDGMENTS The research leading to these results has received funding The gestural interface is implemented by means of a mid-air from the European Union’s Seventh Framework Programme gesture recognizer (which currently relies on a Kinect 2.0 for (FP7/2007-2013) under grant agreement n° 610370, ICoSOLE skeleton tracking purposes). It adheres to a walk-up-and-use (“Immersive Coverage of Spatially Outspread Live Events”, design, which implies that it provides supportive measures that http://www.icosole.eu). empower users to leverage the system without requiring training. The supportive measures take the form of a hierarchical gesture REFERENCES guidance system that exploits the tree-like organization of the 1. Bleumers, L., Van den Broeck, W., Lievens, B., gesture set to visually walk the user through the subsequent steps and Pierson, J. Seeing the Bigger Picture: A User Perspective needed to perform a particular gesture (see Figure 1). on 360◦ TV. In Proc. EuroITV 2012, ACM (2012), 115–124. To encode and present the navigation options at NLV decision 2. Wijnants, M., Leën, J., Quax, P., and Lamotte, W. Augmented points and to add interactivity to the treasure hunt objects that Video Viewing: Transforming Video Consumption into an Ac- appear in the video footage, the demonstrator resorts to the AVV tive Experience. In Proc. MMSys 2014, ACM (2014), 164–167.