49 Designing Multimodal Interactive Systems Using EyesWeb XMI Gualtiero Volpe Maurizio Mancini Abstract Paolo Alborno Alberto Massari This paper introduces the EyesWeb XMI platform (for eX- Antonio Camurri Radoslaw Niewiadomski tended Multimodal Interaction) as a tool for fast prototyping Paolo Coletta Stefano Piana of multimodal systems, including interconnection of multiple Simone Ghisio Roberto Sagoleo smart devices, e.g., smartphones. EyesWeb is endowed University of Genova University of Genova DIBRIS DIBRIS with a visual programming language enabling users to com- Genova, Italy Genova, Italy pose modules into applications. Modules are collected in gualtiero.volpe@unige.it maurizio.mancini@unige.it several libraries and include support of many input devices paoloalborno@gmail.com alby@infomus.org (e.g., video, audio, motion capture, accelerometers, and antonio.camurri@unige.it radoslaw.niewiadomski@dibris.unige.it physiological sensors), output devices (e.g., video, audio, paolo.coletta@unige.it stefano.piana@dist.unige.it 2D and 3D graphics), and synchronized multimodal data simoneghisio@gmail.com sax@infomus.org processing. Specific libraries are devoted to real-time anal- ysis of nonverbal expressive motor and social behavior. The EyesWeb platform encompasses further tools such EyesWeb Mobile supporting the development of customized Graphical User Interfaces for specific classes of users. The paper will review the EyesWeb platform and its compo- nents, starting from its historical origins, and with a particu- lar focus on the Human-Computer Interaction aspects. Author Keywords Multimodal interactive systems; visual programming lan- Copyright is held by the author/owner(s). guages; EyesWeb XMI AVI, June 07–10, 2016, Bari, Italy. ACM Classification Keywords H.5.2 [Information interfaces and presentation (HCI)]: User interfaces 50 Introduction Since then EyesWeb was reworked, improved, and ex- Summer 1999, Opening of Salzburg Festival, Austria: In tended along years and went through five major versions, the music theatre opera Cronaca del Luogo by Italian com- being always available for free1 . Nowadays, it is employed poser Luciano Berio, a major singer (David Moss) plays a in various application domains, going beyond the original schizophrenic character at times appearing wise and calm, area of computer music and performing arts, and including while other times appearing crazy, with nervous and jerky for example active experience of cultural heritage, exergam- movements. Some of his movement qualities are automat- ing, education and technology-enhanced learning, therapy ically extracted by using sensors embedded in his clothes and rehabilitation. and a flashing infrared light on his helmet, synchronized with video cameras positioned above the stage. This infor- This paper is organized as follows: the next section presents mation is used to morph the singer’s voice from profound some related work, i.e., other modular platforms endowed (wise) to a harsh, sharp (crazy) timbre. The impact with with a visual programming language with a particular ref- such concrete real-world applications of multimodal anal- erence to multimedia and multimodal systems; then, the ysis and mapping was of paramount importance to shape major components of the EyesWeb platform are introduced; the requirements for our first publicly available version of finally, the different classes of users for EyesWeb and the EyesWeb [1]. reasons that make it suitable for fast prototyping of appli- cations including interconnection of smart objects are dis- In particular, the need for fast prototyping tools made us cussed under an HCI perspective. leave the concept of EyesWeb as a monolithic applica- tion, to be recompiled and rebuilt after any possible minor Related work change, and made us move to a more flexible approach, Whereas general-purpose tools such as, for example, Math- which was being already adopted by other software plat- works’ Simulink exists since long time, platforms especially forms both in the tradition of computer music programming devoted to (possible real-time) analysis of multimodal sig- languages and tools and in other domains such as simula- nals are far less common. Max [9] is a platform and a visual tion tools for system engineering. EyesWeb was thus con- programming language for music and multimedia, originally ceived as a modular software platform, where a user can conceived by Miller Puckette at IRCAM, Paris, and nowa- assemble the single modules in an application by means of days developed and maintained by Cycling ’74. Born for a visual programming language. As such, EyesWeb sup- sound and music processing in interactive computer mu- ports its users in designing and developing interactive mul- sic, it is also endowed with packages for real-time video, timodal systems is several ways, such as for example (i) by 3D graphics, and matrix processing. Pd (Pure Data) [10] providing built-in input/output capabilities for a broad range is similar in scope and design to Max. It also includes a vi- of sensor and capture systems, (ii) by enabling to easily de- sual programming language and it is intended to support fine and customize how data is processed and feedback development of interactive sound and music computing ap- is generated, and (iii) by offering tools for creating a wide plications. The addition of GEM (Graphics Environment for palette of interfaces for different classes of users. 1 http://www.casapaganini.org The actual released version is EyesWeb 5.6.0.0. 51 Multimedia) enables real-time generation and processing of video, OpenGL graphics, images, and so on. Moreover, Pd is natively designed to enable live collaboration across net- works or the Internet. vvvv [13] is a hybrid graphical/textual programming environment for easy prototyping and devel- opment. It has a special focus on real-time video synthe- sis and it is designed to facilitate the handling of large me- dia environments with physical interfaces, real-time motion graphics, audio and video that can interact with many users simultaneously. Isadora [6] is an interactive media presen- tation tool created by composer and media-artist Mark Coniglio. It mainly includes video generation, processing, and effects and is intended to support artists in developing interactive performances. In the same field of performing arts, Eyecon [4] aims at facilitating interactive performances and installations in which the motion of human bodies is Figure 1: The overall architecture of the EyesWeb XMI platform used to trigger or control various other media, e.g., music, sounds, photos, films, lighting changes, and so on. The So- cial Signal Interpretation framework (SSI) [14] offers tools teraction. Finally, with respect to its previous versions, the to record, analyze and recognize human behavior in real- current version of EyesWeb XMI encompasses enhanced time, such as gestures, mimics, head nods, and emotional synchronization mechanisms, improved management and speech. Following a patch-based design, pipelines are set analysis of time-series (e.g., with novel modules for analysis up from autonomic components and allow the parallel and of synchronization and coupling), extended scripting capa- synchronized processing of sensor data from multiple input bilities (e.g., a module whose behavior can be controlled devices. through Python scripts), and a reorganization of the Eye- sWeb libraries including novel supported I/O devices (e.g., Whereas Max and Pd are especially devoted to audio pro- Kinect V2) and modules for expressive gesture processing. cessing, and vvvv to video processing, EyesWeb has a special focus on higher-level nonverbal communication, EyesWeb kernel, tools, libraries, and devices i.e., EyesWeb provides modules to automatically compute Figure 1 shows the overall architecture of EyesWeb (the features describing the expressive, emotional, and affective current version is named XMI, i.e., for eXtended Multimodal content multimodal signals convey, with particular reference Interaction). The EyesWeb Graphic Development Environ- to full-body movement and gesture. EyesWeb also has a ment tool (GDE), shown in Figure 2, manages the interac- somewhat wider scope with respect to Isadora, which es- tion with the user and supports the design of applications pecially addresses interactive artistic performances, and (patches). An EyesWeb patch consists of interconnected to SSI, which is particularly suited for analysis of social in- modules (blocks). A patch can be defined as a structured 52 chronized playback of multimodal data. In particular, this example displays motion capture (MoCap) data obtained from a MoCap system (Qualisys). The recorded MoCap data is synchronized with the audio and video tracks of the same music performance (a string quartet performance). The EyesWeb kernel The core of EyesWeb is represented by its kernel. It man- ages the execution of the patches by scheduling each block, it handles data flow, it notifies events to the user interface, and it is responsible of enumerating and organizing blocks in catalogs including a set of coherent libraries. The ker- Figure 2: A view of the EyesWeb Graphic Development nel works as a finite state machine consisting of two major Environment (GDE), with a sample patch for synchronizing motion states: design-time and run-time. At design-time users can capture, audio, and video recordings of the music performance of design and develop their patches, which are then executed a string quartet. at run-time. Patches are internally represented by a graph whose nodes are the blocks and whose edges are the links between the blocks. In the transition between design and network of blocks that channels and manipulates a digital run time, the kernel performs a topological sort of the graph input dataflow resulting in a desired output dataflow. Data associated to the patch to be started and establishes the manipulation can be either done automatically or through order for scheduling the execution of each block. real-time interaction with a user. For example, to create a simple video processing patch, an EyesWeb developer The EyesWeb libraries would drag and drop an input video block (e.g., to capture EyesWeb libraries include modules for image and video the video stream from a webcam), a processing block (e.g., processing, for audio processing, for mathematical opera- a video effect), and an output block (e.g., a video display). tions on scalar and matrices, for string processing, for time- The developer would then connect the blocks and she may series analysis, and for machine learning (e.g., SVMs, clus- also include some interaction with the user, e.g., by adding tering, neural networks, Kohonen maps, and so on). They a block that computes the energy of the user’s movement also implement basic data structures (e.g., lists, queues, and connecting it to the video effect block to control the and labeled sets) and enable to connect with the operat- amount of effect to be applied. ing systems (e.g., for launching processes or operating on the filesystem). Particularly relevant in EyesWeb are the The build-up of an EyesWeb patch in many ways resem- libraries for real-time analysis of nonverbal full-body move- bles that of an object oriented program (a network of clus- ment and expressive gesture of single and multiple users ters, classes, and objects). A patch could therefore be in- [2], and the libraries for the analysis of nonverbal social in- terpreted as “a small program or application”. The sample teraction within groups [12]. The former include modules for patch displayed in Figure 2 was built for performing syn- 53 computing features describing human full-body movement • EyesWeb Register Module: a tool allowing to add both at a local temporal granularity (e.g., kinematics, en- new blocks (provided in a dll) to an existing Eye- ergy, postural contraction and symmetry, smoothness, and sWeb installation, and to use them to make and run so on) and at the level of entire movement units (e.g., di- patches. The new blocks extend the platform and rectness, lightness, suddenness, impulsivity, equilibrium, may be developed and distributed by third parties. fluidity, coordination, and so on). The latter implements techniques that have been employed for analyzing features Supported devices of the social behavior of a group such as physical and affec- EyesWeb supports a broad range of input and output de- tive entrainment and leadership. Techniques include, e.g., vices, which are managed by a dedicated layer (devices) Recurrence Quantification Analysis [8], Event Synchroniza- located in between the kernel and the operating system. In tion [11], SPIKE Synchronization [7], and nonlinear asym- addition to usual computer peripherals (mouse, keyboard, metric measures of interdependence between time-series. joystick and game controllers, and so on), input devices include audio (from low-cost mother-board-integrated to Available Tools professional audio cards), video (from low-cost webcams Referring to Figure 1, in addition to EyesWeb GDE that was to professional video cameras), motion capture systems (to already described above, further available tools are: get with high accuracy 3D coordinates of markers in envi- ronments endowed with a fairly controlled setup), RGB-D • EyesWeb Mobile: an external tool supporting the de- sensors (e.g., Kinect for X-Box One, also known as Kinect sign of Graphic User Interfaces linked to an EyesWeb V2, extracting 2D and 3D coordinates of relevant body patch. EyesWeb Mobile is composed of a designer joints, and capturing RGB image, grayscale depth image, and a runtime component. The former is used to de- and infrared image of the scene), Leap Motion (a sensor sign the visual layout of the interfaces; the latter runs device capturing hand and finger movement), accelerome- together with an EyesWeb patch and communicates ters (e.g., onboard of Android smarthpones and connected via network with the EyesWeb kernel for receiving via network, by means of the Mobiles to EyesWeb app, results and controlling the patch remotely. see Figure 3; X-OSC sensors, and so on), Arduino, Nin- • EyesWeb Console: a tool allowing to runs patches tendo Wiimote, RFID (Radio-Frequency Identification) de- from the command line to reduce the GDE overhead. vices (e.g., used to detect the presence of a user in a spe- On Windows, it runs both as a standard application or cific area), and biometric sensors (e.g., respiration, hearth as a Windows service and additionally it can be used rate, skin conductivity, and so on). Output includes audio, to run patches in Linux and OS X. video, 2D and 3D graphics, and possible control signals to actuator devices (e.g., haptic devices, robots). Moreover, • EyesWeb Query : a tool to automatically generate EyesWeb implements standard networking protocols such documentation from a specific EyesWeb installa- as, for example, TCP, UDP, OSC (Open Sound Control), tion, including icon, description, input and output and ActiveMQ. In such a way, it can receive and send data datatypes of each block. Documentation can be gen- from/to any device, including smart objects, endowed with erated in latex (pdf), text, and MySql. network communication capabilities. 54 patch when a given condition is matched), iterative instruc- tions (by means of a specific mechanism that allows to execute a given sub-patch repetitively), and subprograms (implemented as sub-patches). Because of the visual pro- gramming paradigm, EyesWeb developers do not need to be computer scientists or expert programmers. In our ex- perience, EyesWeb patches were developed by artists, technological staff of artists (e.g., sound technicians), de- signers, content creators, students in performing arts and digital humanities, and so on. Still, some skills in usage of computer tools and, especially, in algorithmic thinking are Figure 3: A view of the Mobiles to EyesWeb Android application required. EyesWeb developers can exploit EyesWeb as a on the Google PlayStore tool for fast-prototyping of applications for smart objects: EyesWeb can receive data from such objects by means of its input devices and the task of the developer is to design Classes of users and interfaces and implement in a patch the control flow of the application. Users of EyesWeb can be ideally collected in several classes. Supervisors of EyesWeb patches are users who super- EyesWeb end users usually do not directly deal with the vise and control the execution of an EyesWeb application. platform, but rather experience the multimodal interactive They can both set parameters to customize the execution of systems that are implemented using the platform. Their the patch before running it and act upon the patch (e.g., by interface is therefore a natural interface they can operate changing the value of some parameters) while the patch is by means, e.g., of their expressive movement and gesture running. Consider, for example, a teacher who customizes to act upon multimedia content. In such a way, end users an educational serious game for her pupils and operates can also interact with smart devices they may wear or hold on the game as a child is playing with it, or a therapist who (e.g., smartphones). Smart devices can either be used to sets e.g., target and difficulty level of an exergame for a pa- directly operate on content, or the data they collect can be tient. Supervisors of EyesWeb patches do not need any presented back to end users, e.g., using data sonification particular skill in computer programming and indeed they [5] and visualization technologies. are usually a special kind of end user, with a specific exper- tise in the given application area. The EyesWeb Mobile tool EyesWeb developers make applications (patches) using allows for endowing them with traditional Graphical User In- the EyesWeb GDE and its visual programming language. terfaces they can use for their task of customizing and con- This implements all the basic constructs of programming trolling the execution of a patch. In such a way, they do not languages such as sequences (a chain of interconnected need to go into the details of the EyesWeb GDE and they blocks), conditional instructions (e.g., by means of switch work with an interaction paradigm, which is more familiar to blocks that direct the flow of data to a specific part of a them. Moreover, the patch is prevented from possible un- 55 universities, and companies. A one-week tutorial, the Eye- sWeb Week is organized every two years at our research center. In our experience at Casa Paganini - InfoMus, Eye- sWeb was used in research projects that combine scien- tific research in information and communications technol- ogy (ICT) with artistic and humanistic research [3]. In this context, the platform provided the technological ground for artistic performances (e.g., in Allegoria dell’opinione ver- bale by R. Doati, Medea by A. Guarnieri, and Invisible Line by A. Cera, to name just some of them), for multimodal in- teractive systems supporting new ways of experiencing art Figure 4: An EyesWeb Mobile interface developed for enabling a and cultural heritage (e.g., Museum of Bali in Fano, Italy; teacher to customize a serious game for children. Museum of La Roche d’Oëtre in Normandy, France; Enrico Caruso Museum near Florence, Italy), for serious games in wanted modifications. The EyesWeb Mobile interfaces can educational settings (e.g., The Potter and BeSound), and also work on mobile devices (e.g., tablets) to facilitate oper- for exergames for therapy and rehabilitation (e.g., in an on- ations when the operator cannot stay at the computer (e.g., going collaboration with children hospital Giannina Gaslini a therapist who needs to participate in the therapy session). in Genova, Italy). EyesWeb is being currently improved and This feature makes EyesWeb Mobile a tool that also suits extended in the framework of the EU-H2020-ICT DANCE remote configuration of smart objects applications. Figure 4 Project, investigating how sound and music can express, shows an EyesWeb Mobile interface developed for enabling represent, and analyze the affective and relational qualities a teacher to customize a serious game for children. of body movement. EyesWeb programmers develop new software modules for The need of supporting such a broad range of application the EyesWeb platform. They need to be skilled C++ pro- domains required to make a trade-off between implement- grammers and are endowed with the EyesWeb SDK, which ing general-purpose mechanisms and exploiting domain- enables them to extend the platform with third-parties mod- specific knowledge. For example, on the one hand, sup- ules. In particular, the EyesWeb SDK enables including in port to generality sometimes encompasses a somewhat the platform new modules for interfacing possible smart reduced learnability and usability of the platform by lowly- objects that e.g., do not communicate through standard net- skilled EyesWeb developers due to the increased complex- working protocols or are not supported by the platform yet. ity of the implemented mechanisms. On the other hand, some EyesWeb modules developed on purpose for specific Conclusion application scenarios have a limited scope and are difficult EyesWeb is nowadays employed by thousands of users to reuse. Future directions and open challenges include spanning over several application domains. It was adopted increased cross-platform interoperability and a tighter inte- in both research and industrial projects by research centers, gration with cloud services and storage technologies. 56 Acknowledgements 7. Thomas Kreuz, Daniel Chicharro, Conor Houghton, This research has received funding from the European Ralph G Andrzejak, and Florian Mormann. 2012. Union’s Horizon 2020 research and innovation programme Monitoring spike train synchrony. Journal of under grant agreement No 645553 (H2020-ICT Project Neurophysiology (2012). DANCE, http://dance.dibris.unige.it). DANCE investigates 8. Norbert Marwan, M. Carmen Romano, Marco Thiel, how affective and relational qualities of body movement can and Jürgen Kurths. 2007. Recurrence plots for the be expressed, represented, and analyzed by the auditory analysis of complex systems. Physics Reports 438, channel. 5–6 (2007), 237–329. REFERENCES 9. Max. 1988. http://cycling74.com/products/max/. 1. Antonio Camurri, Shuji Hashimoto, Matteo Ricchetti, (1988). Accessed: 2016-03-25. Andrea Ricci, Kenji Suzuki, Riccardo Trocca, and 10. Miller Puckette. 1996. Pure Data. In Proceedings of the Gualtiero Volpe. 2000. EyesWeb: Toward Gesture and International Computer Music Conference. 224–227. Affect Recognition in Interactive Dance and Music Systems. Computer Music Journal 24, 1 (April 2000), 11. Rodrigo Quian Quiroga, Thomas Kreuz, and Peter 57–69. Grassberger. 2002. Event synchronization: A simple and fast method to measure synchronicity and time 2. Antonio Camurri, Barbara Mazzarino, and Gualtiero delay patterns. Physical Review E 66, 4 (Oct 2002), Volpe. 2004. Analysis of expressive gesture: The 041904. EyesWeb expressive gesture processing library. In Gesture-based communication in human-computer 12. Giovanna Varni, Gualtiero Volpe, and Antonio Camurri. interaction. Springer, 460–467. 2010. A system for real-time multimodal analysis of nonverbal affective social interaction in user-centric 3. Antonio Camurri and Gualtiero Volpe. 2016. The media. IEEE Transactions on Multimedia 12, 6 (2010), Intersection of Art and Technology. IEEE MultiMedia 576–590. 23, 1 (Jan 2016), 10–17. 13. vvvv. 1998. http://vvvv.org/. (1998). Accessed: 4. Eyecon. 2008. http://eyecon.palindrome.de/. 2016-03-25. (2008). Accessed: 2016-03-25. 14. Johannes Wagner, Florian Lingenfelser, Tobias Baur, 5. Thomas Hermann. 2008. Taxonomy and Definitions for Ionut Damian, Felix Kistler, and Elisabeth André. 2013. Sonification and Auditory Display. In Proceedings of The Social Signal Interpretation (SSI) Framework: the 14th International Conference on Auditory Display Multimodal Signal Processing and Recognition in (ICAD 2008), Patrick Susini and Olivier Warusfel (Eds.). Real-time. In Proceedings of the 21st ACM IRCAM. International Conference on Multimedia (MM ’13). 6. Isadora. 2002. http://troikatronix.com/. (2002). ACM, New York, NY, USA, 831–834. Accessed: 2016-03-25.