49


                                   Designing Multimodal Interactive
                                   Systems Using EyesWeb XMI

Gualtiero Volpe                             Maurizio Mancini                        Abstract
Paolo Alborno                               Alberto Massari                         This paper introduces the EyesWeb XMI platform (for eX-
Antonio Camurri                             Radoslaw Niewiadomski                   tended Multimodal Interaction) as a tool for fast prototyping
Paolo Coletta                               Stefano Piana
                                                                                    of multimodal systems, including interconnection of multiple
Simone Ghisio                               Roberto Sagoleo
                                                                                    smart devices, e.g., smartphones. EyesWeb is endowed
University of Genova                        University of Genova
DIBRIS                                      DIBRIS
                                                                                    with a visual programming language enabling users to com-
Genova, Italy                               Genova, Italy                           pose modules into applications. Modules are collected in
gualtiero.volpe@unige.it                    maurizio.mancini@unige.it               several libraries and include support of many input devices
paoloalborno@gmail.com                      alby@infomus.org                        (e.g., video, audio, motion capture, accelerometers, and
antonio.camurri@unige.it                    radoslaw.niewiadomski@dibris.unige.it   physiological sensors), output devices (e.g., video, audio,
paolo.coletta@unige.it                      stefano.piana@dist.unige.it             2D and 3D graphics), and synchronized multimodal data
simoneghisio@gmail.com                      sax@infomus.org                         processing. Specific libraries are devoted to real-time anal-
                                                                                    ysis of nonverbal expressive motor and social behavior.
                                                                                    The EyesWeb platform encompasses further tools such
                                                                                    EyesWeb Mobile supporting the development of customized
                                                                                    Graphical User Interfaces for specific classes of users.
                                                                                    The paper will review the EyesWeb platform and its compo-
                                                                                    nents, starting from its historical origins, and with a particu-
                                                                                    lar focus on the Human-Computer Interaction aspects.

                                                                                    Author Keywords
                                                                                    Multimodal interactive systems; visual programming lan-
Copyright is held by the author/owner(s).                                           guages; EyesWeb XMI
AVI, June 07–10, 2016, Bari, Italy.
                                                                                    ACM Classification Keywords
                                                                                    H.5.2 [Information interfaces and presentation (HCI)]: User
                                                                                    interfaces
                                                                                                                                 50


Introduction                                                     Since then EyesWeb was reworked, improved, and ex-
Summer 1999, Opening of Salzburg Festival, Austria: In           tended along years and went through five major versions,
the music theatre opera Cronaca del Luogo by Italian com-        being always available for free1 . Nowadays, it is employed
poser Luciano Berio, a major singer (David Moss) plays a         in various application domains, going beyond the original
schizophrenic character at times appearing wise and calm,        area of computer music and performing arts, and including
while other times appearing crazy, with nervous and jerky        for example active experience of cultural heritage, exergam-
movements. Some of his movement qualities are automat-           ing, education and technology-enhanced learning, therapy
ically extracted by using sensors embedded in his clothes        and rehabilitation.
and a flashing infrared light on his helmet, synchronized
with video cameras positioned above the stage. This infor-       This paper is organized as follows: the next section presents
mation is used to morph the singer’s voice from profound         some related work, i.e., other modular platforms endowed
(wise) to a harsh, sharp (crazy) timbre. The impact with         with a visual programming language with a particular ref-
such concrete real-world applications of multimodal anal-        erence to multimedia and multimodal systems; then, the
ysis and mapping was of paramount importance to shape            major components of the EyesWeb platform are introduced;
the requirements for our first publicly available version of     finally, the different classes of users for EyesWeb and the
EyesWeb [1].                                                     reasons that make it suitable for fast prototyping of appli-
                                                                 cations including interconnection of smart objects are dis-
In particular, the need for fast prototyping tools made us       cussed under an HCI perspective.
leave the concept of EyesWeb as a monolithic applica-
tion, to be recompiled and rebuilt after any possible minor      Related work
change, and made us move to a more flexible approach,            Whereas general-purpose tools such as, for example, Math-
which was being already adopted by other software plat-          works’ Simulink exists since long time, platforms especially
forms both in the tradition of computer music programming        devoted to (possible real-time) analysis of multimodal sig-
languages and tools and in other domains such as simula-         nals are far less common. Max [9] is a platform and a visual
tion tools for system engineering. EyesWeb was thus con-         programming language for music and multimedia, originally
ceived as a modular software platform, where a user can          conceived by Miller Puckette at IRCAM, Paris, and nowa-
assemble the single modules in an application by means of        days developed and maintained by Cycling ’74. Born for
a visual programming language. As such, EyesWeb sup-             sound and music processing in interactive computer mu-
ports its users in designing and developing interactive mul-     sic, it is also endowed with packages for real-time video,
timodal systems is several ways, such as for example (i) by      3D graphics, and matrix processing. Pd (Pure Data) [10]
providing built-in input/output capabilities for a broad range   is similar in scope and design to Max. It also includes a vi-
of sensor and capture systems, (ii) by enabling to easily de-    sual programming language and it is intended to support
fine and customize how data is processed and feedback            development of interactive sound and music computing ap-
is generated, and (iii) by offering tools for creating a wide    plications. The addition of GEM (Graphics Environment for
palette of interfaces for different classes of users.
                                                                    1
                                                                        http://www.casapaganini.org
                                                                  The actual released version is EyesWeb 5.6.0.0.
                                                                                                                                    51


Multimedia) enables real-time generation and processing of
video, OpenGL graphics, images, and so on. Moreover, Pd
is natively designed to enable live collaboration across net-
works or the Internet. vvvv [13] is a hybrid graphical/textual
programming environment for easy prototyping and devel-
opment. It has a special focus on real-time video synthe-
sis and it is designed to facilitate the handling of large me-
dia environments with physical interfaces, real-time motion
graphics, audio and video that can interact with many users
simultaneously. Isadora [6] is an interactive media presen-
tation tool created by composer and media-artist Mark
Coniglio. It mainly includes video generation, processing,
and effects and is intended to support artists in developing
interactive performances. In the same field of performing
arts, Eyecon [4] aims at facilitating interactive performances
and installations in which the motion of human bodies is           Figure 1: The overall architecture of the EyesWeb XMI platform
used to trigger or control various other media, e.g., music,
sounds, photos, films, lighting changes, and so on. The So-
cial Signal Interpretation framework (SSI) [14] offers tools      teraction. Finally, with respect to its previous versions, the
to record, analyze and recognize human behavior in real-          current version of EyesWeb XMI encompasses enhanced
time, such as gestures, mimics, head nods, and emotional          synchronization mechanisms, improved management and
speech. Following a patch-based design, pipelines are set         analysis of time-series (e.g., with novel modules for analysis
up from autonomic components and allow the parallel and           of synchronization and coupling), extended scripting capa-
synchronized processing of sensor data from multiple input        bilities (e.g., a module whose behavior can be controlled
devices.                                                          through Python scripts), and a reorganization of the Eye-
                                                                  sWeb libraries including novel supported I/O devices (e.g.,
Whereas Max and Pd are especially devoted to audio pro-           Kinect V2) and modules for expressive gesture processing.
cessing, and vvvv to video processing, EyesWeb has a
special focus on higher-level nonverbal communication,            EyesWeb kernel, tools, libraries, and devices
i.e., EyesWeb provides modules to automatically compute           Figure 1 shows the overall architecture of EyesWeb (the
features describing the expressive, emotional, and affective      current version is named XMI, i.e., for eXtended Multimodal
content multimodal signals convey, with particular reference      Interaction). The EyesWeb Graphic Development Environ-
to full-body movement and gesture. EyesWeb also has a             ment tool (GDE), shown in Figure 2, manages the interac-
somewhat wider scope with respect to Isadora, which es-           tion with the user and supports the design of applications
pecially addresses interactive artistic performances, and         (patches). An EyesWeb patch consists of interconnected
to SSI, which is particularly suited for analysis of social in-   modules (blocks). A patch can be defined as a structured
                                                                                                                                     52


                                                                   chronized playback of multimodal data. In particular, this
                                                                   example displays motion capture (MoCap) data obtained
                                                                   from a MoCap system (Qualisys). The recorded MoCap
                                                                   data is synchronized with the audio and video tracks of the
                                                                   same music performance (a string quartet performance).

                                                                   The EyesWeb kernel
                                                                   The core of EyesWeb is represented by its kernel. It man-
                                                                   ages the execution of the patches by scheduling each block,
                                                                   it handles data flow, it notifies events to the user interface,
                                                                   and it is responsible of enumerating and organizing blocks
                                                                   in catalogs including a set of coherent libraries. The ker-
Figure 2: A view of the EyesWeb Graphic Development                nel works as a finite state machine consisting of two major
Environment (GDE), with a sample patch for synchronizing motion    states: design-time and run-time. At design-time users can
capture, audio, and video recordings of the music performance of   design and develop their patches, which are then executed
a string quartet.                                                  at run-time. Patches are internally represented by a graph
                                                                   whose nodes are the blocks and whose edges are the links
                                                                   between the blocks. In the transition between design and
network of blocks that channels and manipulates a digital
                                                                   run time, the kernel performs a topological sort of the graph
input dataflow resulting in a desired output dataflow. Data
                                                                   associated to the patch to be started and establishes the
manipulation can be either done automatically or through
                                                                   order for scheduling the execution of each block.
real-time interaction with a user. For example, to create
a simple video processing patch, an EyesWeb developer              The EyesWeb libraries
would drag and drop an input video block (e.g., to capture         EyesWeb libraries include modules for image and video
the video stream from a webcam), a processing block (e.g.,         processing, for audio processing, for mathematical opera-
a video effect), and an output block (e.g., a video display).      tions on scalar and matrices, for string processing, for time-
The developer would then connect the blocks and she may            series analysis, and for machine learning (e.g., SVMs, clus-
also include some interaction with the user, e.g., by adding       tering, neural networks, Kohonen maps, and so on). They
a block that computes the energy of the user’s movement            also implement basic data structures (e.g., lists, queues,
and connecting it to the video effect block to control the         and labeled sets) and enable to connect with the operat-
amount of effect to be applied.                                    ing systems (e.g., for launching processes or operating on
                                                                   the filesystem). Particularly relevant in EyesWeb are the
The build-up of an EyesWeb patch in many ways resem-
                                                                   libraries for real-time analysis of nonverbal full-body move-
bles that of an object oriented program (a network of clus-
                                                                   ment and expressive gesture of single and multiple users
ters, classes, and objects). A patch could therefore be in-
                                                                   [2], and the libraries for the analysis of nonverbal social in-
terpreted as “a small program or application”. The sample
                                                                   teraction within groups [12]. The former include modules for
patch displayed in Figure 2 was built for performing syn-
                                                                                                                                  53


computing features describing human full-body movement                • EyesWeb Register Module: a tool allowing to add
both at a local temporal granularity (e.g., kinematics, en-             new blocks (provided in a dll) to an existing Eye-
ergy, postural contraction and symmetry, smoothness, and                sWeb installation, and to use them to make and run
so on) and at the level of entire movement units (e.g., di-             patches. The new blocks extend the platform and
rectness, lightness, suddenness, impulsivity, equilibrium,              may be developed and distributed by third parties.
fluidity, coordination, and so on). The latter implements
techniques that have been employed for analyzing features         Supported devices
of the social behavior of a group such as physical and affec-     EyesWeb supports a broad range of input and output de-
tive entrainment and leadership. Techniques include, e.g.,        vices, which are managed by a dedicated layer (devices)
Recurrence Quantification Analysis [8], Event Synchroniza-        located in between the kernel and the operating system. In
tion [11], SPIKE Synchronization [7], and nonlinear asym-         addition to usual computer peripherals (mouse, keyboard,
metric measures of interdependence between time-series.           joystick and game controllers, and so on), input devices
                                                                  include audio (from low-cost mother-board-integrated to
Available Tools
                                                                  professional audio cards), video (from low-cost webcams
Referring to Figure 1, in addition to EyesWeb GDE that was
                                                                  to professional video cameras), motion capture systems (to
already described above, further available tools are:
                                                                  get with high accuracy 3D coordinates of markers in envi-
                                                                  ronments endowed with a fairly controlled setup), RGB-D
    • EyesWeb Mobile: an external tool supporting the de-         sensors (e.g., Kinect for X-Box One, also known as Kinect
      sign of Graphic User Interfaces linked to an EyesWeb        V2, extracting 2D and 3D coordinates of relevant body
      patch. EyesWeb Mobile is composed of a designer             joints, and capturing RGB image, grayscale depth image,
      and a runtime component. The former is used to de-          and infrared image of the scene), Leap Motion (a sensor
      sign the visual layout of the interfaces; the latter runs   device capturing hand and finger movement), accelerome-
      together with an EyesWeb patch and communicates             ters (e.g., onboard of Android smarthpones and connected
      via network with the EyesWeb kernel for receiving           via network, by means of the Mobiles to EyesWeb app,
      results and controlling the patch remotely.                 see Figure 3; X-OSC sensors, and so on), Arduino, Nin-
    • EyesWeb Console: a tool allowing to runs patches            tendo Wiimote, RFID (Radio-Frequency Identification) de-
      from the command line to reduce the GDE overhead.           vices (e.g., used to detect the presence of a user in a spe-
      On Windows, it runs both as a standard application or       cific area), and biometric sensors (e.g., respiration, hearth
      as a Windows service and additionally it can be used        rate, skin conductivity, and so on). Output includes audio,
      to run patches in Linux and OS X.                           video, 2D and 3D graphics, and possible control signals to
                                                                  actuator devices (e.g., haptic devices, robots). Moreover,
    • EyesWeb Query : a tool to automatically generate            EyesWeb implements standard networking protocols such
      documentation from a specific EyesWeb installa-             as, for example, TCP, UDP, OSC (Open Sound Control),
      tion, including icon, description, input and output         and ActiveMQ. In such a way, it can receive and send data
      datatypes of each block. Documentation can be gen-          from/to any device, including smart objects, endowed with
      erated in latex (pdf), text, and MySql.                     network communication capabilities.
                                                                                                                                   54


                                                                 patch when a given condition is matched), iterative instruc-
                                                                 tions (by means of a specific mechanism that allows to
                                                                 execute a given sub-patch repetitively), and subprograms
                                                                 (implemented as sub-patches). Because of the visual pro-
                                                                 gramming paradigm, EyesWeb developers do not need to
                                                                 be computer scientists or expert programmers. In our ex-
                                                                 perience, EyesWeb patches were developed by artists,
                                                                 technological staff of artists (e.g., sound technicians), de-
                                                                 signers, content creators, students in performing arts and
                                                                 digital humanities, and so on. Still, some skills in usage of
                                                                 computer tools and, especially, in algorithmic thinking are
Figure 3: A view of the Mobiles to EyesWeb Android application   required. EyesWeb developers can exploit EyesWeb as a
on the Google PlayStore                                          tool for fast-prototyping of applications for smart objects:
                                                                 EyesWeb can receive data from such objects by means of
                                                                 its input devices and the task of the developer is to design
Classes of users and interfaces                                  and implement in a patch the control flow of the application.
Users of EyesWeb can be ideally collected in several classes.
                                                                 Supervisors of EyesWeb patches are users who super-
EyesWeb end users usually do not directly deal with the          vise and control the execution of an EyesWeb application.
platform, but rather experience the multimodal interactive       They can both set parameters to customize the execution of
systems that are implemented using the platform. Their           the patch before running it and act upon the patch (e.g., by
interface is therefore a natural interface they can operate      changing the value of some parameters) while the patch is
by means, e.g., of their expressive movement and gesture         running. Consider, for example, a teacher who customizes
to act upon multimedia content. In such a way, end users         an educational serious game for her pupils and operates
can also interact with smart devices they may wear or hold       on the game as a child is playing with it, or a therapist who
(e.g., smartphones). Smart devices can either be used to         sets e.g., target and difficulty level of an exergame for a pa-
directly operate on content, or the data they collect can be     tient. Supervisors of EyesWeb patches do not need any
presented back to end users, e.g., using data sonification       particular skill in computer programming and indeed they
[5] and visualization technologies.                              are usually a special kind of end user, with a specific exper-
                                                                 tise in the given application area. The EyesWeb Mobile tool
EyesWeb developers make applications (patches) using             allows for endowing them with traditional Graphical User In-
the EyesWeb GDE and its visual programming language.             terfaces they can use for their task of customizing and con-
This implements all the basic constructs of programming          trolling the execution of a patch. In such a way, they do not
languages such as sequences (a chain of interconnected           need to go into the details of the EyesWeb GDE and they
blocks), conditional instructions (e.g., by means of switch      work with an interaction paradigm, which is more familiar to
blocks that direct the flow of data to a specific part of a      them. Moreover, the patch is prevented from possible un-
                                                                                                                                  55


                                                                  universities, and companies. A one-week tutorial, the Eye-
                                                                  sWeb Week is organized every two years at our research
                                                                  center. In our experience at Casa Paganini - InfoMus, Eye-
                                                                  sWeb was used in research projects that combine scien-
                                                                  tific research in information and communications technol-
                                                                  ogy (ICT) with artistic and humanistic research [3]. In this
                                                                  context, the platform provided the technological ground for
                                                                  artistic performances (e.g., in Allegoria dell’opinione ver-
                                                                  bale by R. Doati, Medea by A. Guarnieri, and Invisible Line
                                                                  by A. Cera, to name just some of them), for multimodal in-
                                                                  teractive systems supporting new ways of experiencing art
Figure 4: An EyesWeb Mobile interface developed for enabling a
                                                                  and cultural heritage (e.g., Museum of Bali in Fano, Italy;
teacher to customize a serious game for children.
                                                                  Museum of La Roche d’Oëtre in Normandy, France; Enrico
                                                                  Caruso Museum near Florence, Italy), for serious games in
wanted modifications. The EyesWeb Mobile interfaces can           educational settings (e.g., The Potter and BeSound), and
also work on mobile devices (e.g., tablets) to facilitate oper-   for exergames for therapy and rehabilitation (e.g., in an on-
ations when the operator cannot stay at the computer (e.g.,       going collaboration with children hospital Giannina Gaslini
a therapist who needs to participate in the therapy session).     in Genova, Italy). EyesWeb is being currently improved and
This feature makes EyesWeb Mobile a tool that also suits          extended in the framework of the EU-H2020-ICT DANCE
remote configuration of smart objects applications. Figure 4      Project, investigating how sound and music can express,
shows an EyesWeb Mobile interface developed for enabling          represent, and analyze the affective and relational qualities
a teacher to customize a serious game for children.               of body movement.

EyesWeb programmers develop new software modules for              The need of supporting such a broad range of application
the EyesWeb platform. They need to be skilled C++ pro-            domains required to make a trade-off between implement-
grammers and are endowed with the EyesWeb SDK, which              ing general-purpose mechanisms and exploiting domain-
enables them to extend the platform with third-parties mod-       specific knowledge. For example, on the one hand, sup-
ules. In particular, the EyesWeb SDK enables including in         port to generality sometimes encompasses a somewhat
the platform new modules for interfacing possible smart           reduced learnability and usability of the platform by lowly-
objects that e.g., do not communicate through standard net-       skilled EyesWeb developers due to the increased complex-
working protocols or are not supported by the platform yet.       ity of the implemented mechanisms. On the other hand,
                                                                  some EyesWeb modules developed on purpose for specific
Conclusion                                                        application scenarios have a limited scope and are difficult
EyesWeb is nowadays employed by thousands of users                to reuse. Future directions and open challenges include
spanning over several application domains. It was adopted         increased cross-platform interoperability and a tighter inte-
in both research and industrial projects by research centers,     gration with cloud services and storage technologies.
                                                                                                                             56


Acknowledgements                                                7. Thomas Kreuz, Daniel Chicharro, Conor Houghton,
This research has received funding from the European               Ralph G Andrzejak, and Florian Mormann. 2012.
Union’s Horizon 2020 research and innovation programme             Monitoring spike train synchrony. Journal of
under grant agreement No 645553 (H2020-ICT Project                 Neurophysiology (2012).
DANCE, http://dance.dibris.unige.it). DANCE investigates
                                                                8. Norbert Marwan, M. Carmen Romano, Marco Thiel,
how affective and relational qualities of body movement can
                                                                   and Jürgen Kurths. 2007. Recurrence plots for the
be expressed, represented, and analyzed by the auditory
                                                                   analysis of complex systems. Physics Reports 438,
channel.
                                                                   5–6 (2007), 237–329.
REFERENCES                                                      9. Max. 1988. http://cycling74.com/products/max/.
 1. Antonio Camurri, Shuji Hashimoto, Matteo Ricchetti,            (1988). Accessed: 2016-03-25.
    Andrea Ricci, Kenji Suzuki, Riccardo Trocca, and
                                                               10. Miller Puckette. 1996. Pure Data. In Proceedings of the
    Gualtiero Volpe. 2000. EyesWeb: Toward Gesture and
                                                                   International Computer Music Conference. 224–227.
    Affect Recognition in Interactive Dance and Music
    Systems. Computer Music Journal 24, 1 (April 2000),        11. Rodrigo Quian Quiroga, Thomas Kreuz, and Peter
    57–69.                                                         Grassberger. 2002. Event synchronization: A simple
                                                                   and fast method to measure synchronicity and time
 2. Antonio Camurri, Barbara Mazzarino, and Gualtiero
                                                                   delay patterns. Physical Review E 66, 4 (Oct 2002),
    Volpe. 2004. Analysis of expressive gesture: The
                                                                   041904.
    EyesWeb expressive gesture processing library. In
    Gesture-based communication in human-computer              12. Giovanna Varni, Gualtiero Volpe, and Antonio Camurri.
    interaction. Springer, 460–467.                                2010. A system for real-time multimodal analysis of
                                                                   nonverbal affective social interaction in user-centric
 3. Antonio Camurri and Gualtiero Volpe. 2016. The
                                                                   media. IEEE Transactions on Multimedia 12, 6 (2010),
    Intersection of Art and Technology. IEEE MultiMedia
                                                                   576–590.
    23, 1 (Jan 2016), 10–17.
                                                               13. vvvv. 1998. http://vvvv.org/. (1998). Accessed:
 4. Eyecon. 2008. http://eyecon.palindrome.de/.
                                                                   2016-03-25.
    (2008). Accessed: 2016-03-25.
                                                               14. Johannes Wagner, Florian Lingenfelser, Tobias Baur,
 5. Thomas Hermann. 2008. Taxonomy and Definitions for
                                                                   Ionut Damian, Felix Kistler, and Elisabeth André. 2013.
    Sonification and Auditory Display. In Proceedings of
                                                                   The Social Signal Interpretation (SSI) Framework:
    the 14th International Conference on Auditory Display
                                                                   Multimodal Signal Processing and Recognition in
    (ICAD 2008), Patrick Susini and Olivier Warusfel (Eds.).
                                                                   Real-time. In Proceedings of the 21st ACM
    IRCAM.
                                                                   International Conference on Multimedia (MM ’13).
 6. Isadora. 2002. http://troikatronix.com/. (2002).               ACM, New York, NY, USA, 831–834.
    Accessed: 2016-03-25.