Navigating Wall-sized Displays with the Gaze:
                         a Proposal for Cultural Heritage

              Davide Maria Calandra, Dario Di Mauro, Francesco Cutugno, Sergio Di Martino
                                     Department of Electrical Engineering and Information Technology
                                                    University of Naples "Federico II"
                                                           80127, Naples, Italy
                   {davidemaria.calandra, dario.dimauro, cutugno, sergio.dimartino}@unina.it


ABSTRACT                                                                              applied in many contexts, like advertisement, medical diagnosis,
New technologies for innovative interactive experience represent                      Business Intelligence, etc. Also in the Cultural Heritage field, this
a powerful medium to deliver cultural heritage content to a wider                     type of displays is highly appreciated, since they turn out to be
range of users. Among them, Natural User Interfaces (NUI), i.e.                       particularly suited to show to visitors artworks that are difficult or
non-intrusive technologies not requiring to the user to wear devices                  impossible to move, being a way to explore the digital counterpart
nor use external hardware (e.g. keys or trackballs), are considered                   of real/virtual environments. On the other hand, the problem with
a promising way to broader the audience of specific cultural her-                     these display is how to mediate the interaction with the user. Many
itage domains, like the navigation/interaction with digital artworks                  solutions have been proposed, with different trade-off among intru-
presented on wall-sized displays.                                                     siveness, calibration and precision degree to be achieved. Recently,
   Starting from a collaboration with a worldwide famous Italian                      some proposals have been developed aimed at exploiting the direc-
designer, we defined a NUI to explore 360 panoramic artworks                          tion of the gaze of the visitor in front of the display as a medium
presented on wall-sized displays, like virtual reconstruction of an-                  to interact with the system. The simple assumption is that whether
cient cultural sites, or rendering of imaginary places. Specifically,                 the user looks towards an edge of the screen, he/she is interested
we let the user to "move the head" as way of natural interaction to                   in discovering more content in that direction, and the digital sce-
explore and navigate through these large digital artworks. To this                    nario should be updated accordingly. In this way, there is no need
aim, we developed a system including a remote head pose estima-                       to wear a device, making easier for a heterogeneous public to enjoy
tor to catch movements of users standing in front of the wall-sized                   the digital content.
display: starting from a central comfort zone, as users move their                       Detecting the gaze is anyhow a challenging task, still with some
head in any direction, the virtual camera rotates accordingly. With                   open issues. To estimate the Point of Gaze (PoG), it is possible
NUIs, it is difficult to get feedbacks from the users about the in-                   to exploit the eye movements, the head pose or both [23], and to
terest for the point of the artwork he/she is looking at. To solve                    require special hardware to wear (e.g.: [12]) or to develop remote
this issue, we complemented the gaze estimator with a preliminary                     trackers (e.g.: [6]). The latter are not able to provide a high accu-
emotional analysis solution, able to implicitly infer the interest of                 racy, but this is an acceptable compromise in many scenarios, like
the user for the shown content from his/her pupil size.                               the Cultural Heritage, where the use of special hardware for the
   A sample of 150 subjects was invited to experience the proposed                    visitors is usually difficult.
interface at an International Design Week. Preliminary results show                      For the Tianjin International Design Week 20151 , we were asked
that the most of the subjects were able to properly interact with the                 to develop a set of technological solutions to improve the fruition
system from the very first use, and that the emotional module is an                   of a 360 digital reconstruction projected on a wall-sized display
interesting solution, even if further work must be devoted to address                 of the “Camparitivo in Triennale”2 , a lounge bar (see Figure 1)
specific situations.                                                                  located in Milan, Italy, designed by one of the most famous Italian
                                                                                      designers, Matteo Ragni, to celebrate the Italian liqueur Campari.
                                                                                      The requirements for the solution were to define a Natural User
Categories and Subject Descriptors                                                    Interface (NUI), which does not constrain users to maintain a fixed
H.5.2 [User Interfaces]: Interaction styles                                           distance from the display, neither to wear an external device.
                                                                                         To achieve our task, we designed a remote PoG estimator for
1.     INTRODUCTION                                                                   wall-sized displays where 360 virtual environments are rendered.
                                                                                      A further novelty element of the proposal is the exploitation of an-
  Wall-sized displays represent a viable and common way to
                                                                                      other implicit communication channel of the visitor, i.e. his/her at-
present digital content on large projection surfaces. They are
                                                                                      tention towards the represented image on the display. To this aim,
                                                                                      we remotely monitor pupil size variations, as they are significantly
                                                                                      correlated with the arousal level of users while performing a task.
                                                                                      This information can be firstly useful to the artist, as pupils dilate
                                                                                      when visitors are looking at pleasant images [9]. Moreover, logging
                                                                                      the pupil dilation (mydriasis) during an interaction session can be a
                                                                                      reliable source of information, useful also to analyze the usability
 c 2016 Copyright 2016 for this paper by its authors. Copying permitted for private   1
and academic purposes.                                                                    http://tianjindesignweek.com/
                                                                                      2
:                                                                                         http://www.matteoragni.com/project/camparitivo-in-triennale/
                                                                             1. Defining techniques to estimate the PoG of the user while
                                                                                he/she is looking at the display, and

                                                                             2. Defining a navigation logic associated to the PoG.

                                                                            In the following, we provide technical details on how we faced
                                                                          these two tasks.

                                                                          2.1     Point of gaze estimation
                                                                             Head poses are usually computed by considering 3 degrees of
                                                                          freedom (DoF) [17], i.e. the rotations along the 3 axis of simmetry
                                                                          in the space, x, y, and z, shown in Figure 2.


       Figure 1: Matteo Ragni’s "Camparitivo in Triennale"


level of the interface, since pupils dilate when users are required to
perform difficult tasks, too [11] [3].
   In this paper we describe both the navigation with the remote
PoG estimator and the solution for logging the mydriasis, together
with a preliminary case study. More in details, the rest of the pa-
per is structured as follows: in section 2, we explain the navigation
paradigm for cultural content with the gaze, detailing the steps we
performed to detect and track the PoG. In section 3, we explain
how the mydriasis detection could be a useful strategy to investi-                             Figure 2: Head movements.
gate the emotional reactions of users enjoying a cultural content
and we detail our steps to get the pupil dilation. In section 4, we
present the case study: Matteo Ragni’s Camparitivo in Triennale,             Once the head pose in the space is known, the pupil center po-
showing how we allow the visitors to navigate the digital rendering       sition can optionally refine the PoG estimation.For example, in the
of the lounge bar, on a wall-sized display, reporting some prelim-        medical diagnosis scenario, to estimate the PoG, patients are usu-
inary usability results. Section 5 concludes the paper, presenting        ally not allowed to move their head [7] or they have to wear head-
also future research directions.                                          mounted cameras pointed towards their eyes [12]. In these cases, to
                                                                          estimate the PoG means to compute the pupil center position with
                                                                          respect to the ellipse formed by the eyelids, while the head posi-
2.    NAVIGATING WITH THE GAZE                                            tion, when considered, is detected through IR sensors mounted on
   Even if wearable eye trackers are becoming smaller and more            the head of subjects. These systems grant an error threshold lower
comfortable, they still have an impact on the quality of a cultural       than 5 pixels [12], achievable thanks to strict constraints on the
visit. We believe that the user experience strongly depends on the        set-up, such as the fixed distance between eye and camera but, on
capability of the user to establish a direct connection with the art-     the other hand, they have a very high level of invasiveness for the
works, without the mediation of a device. For this reason, in order       users. In other scenarios, the PoG is estimated by means of remote
to allow the user to explore a 360 cultural heritage environment          trackers, such as ones presented in [6], which determine the gaze
using only his/her point of gaze, we focused on developing a re-          direction by the head orientation. These systems do not limit users’
mote head pose estimator for wall-sized displays, which does not          movements and do not require them to wear any device.
require users to wear any external device or to execute any prior            In the cultural heritage context, the gaze detection is mainly used
calibration.                                                              for two tasks. The first one is related to the artistic fruition: accord-
   The contents that we aim to navigate are 360 virtual environ-          ing with "The More You Look The More You Get" paradigm [16],
ments, expressed as a sequence of 360 frames whose step size is           users focusing their gaze on a specific work of art or part of it, can
1 . Thus, navigating the content on the left (right) means to show        be interested to receive some additional content about that specific
the previous (next) frame of the sequence. As we want visitors to         item. This usage of the gaze direction can be extremely useful in
feel the sensation of enjoying an authentic large environment, the        terms of improving the accessibility to the cultural heritage infor-
wall-sized display is used to represent the content with real propor-     mation and enhancing the visit experience quality. The second task
tions. If by one side, this choice improves the quality of the fruition   is related to understanding how people take decisions, visiting a
because it reduces the gap between real and virtual environments,         museum: which areas they are focused and how long; outputs from
on the other hand, representing an entire façade of a building in one     gaze detectors are then gathered and analyzed [18].
frame is not realistic. Thus, it requires additional complexity, since       Starting from an approach we already developed for small dis-
we have to define also a support for a vertical scroll of the content,    plays (between 50 x 30cm and 180 x 75cm) [4], we propose an
to show the not visible parts of the frame.                               extension for wall-sized ones, based on a combined exploitation of
   More in details, the development of NUIs to explore the content        the head pose and pupil size to explore digital environments. The
of wall-sized displays with the gaze, requires two subtasks:              general settings of the display is presented in figure 3. In particu-
                                                                          with 2.2 GHz; initially, the detection time was about 100 ms. The
                                                                          optimizations on face and nose search allowed us to locate the face
                                                                          and the nose on average in 35 ms, reducing the computation time
                                                                          of about 65%.

                                                                          2.1.2     Nose Tip tracking
                                                                             The previously described features are searched either the first
                                                                          time a user is detected or when the tracking is lost. In all the other
                                                                          frames, the nose tip is simply tracked.
                                                                             Several strategies have been proposed to track the motion, that
                                                                          can be categorized into three groups: feature-based, model-based
                                                                          and optical flow-based. Generally speaking, the feature-based
                                                                          strategies involve the extraction of templates from a reference im-
                                                                          age and the identification of their counterparts in the further images
                                                                          of the sequence. Some feature-based algorithms need to be trained,
                                                                          for example those based on Hidden Markov Models (HMM) [21]
                                                                          or Artificial Neural Networks (ANN) [14], while others are non-
                                                                          supervisioned, like for instance the Mean Shift Tracking algo-
         Figure 3: Gaze detection: experimental settings.                 rithms [28]. Although the model-based strategies could be consid-
                                                                          ered a specific branch of the feature-based ones, they require some
                                                                          a-priori knowledge about the investigated models [27]. The opti-
lar, the exhibition set up includes a PC (the machine on which the        cal flow is the vector field which describes how the image changes
software runs), a webcam W which acquires the input stream, and           during the time; it can be computed with different strategies as, for
a projector P which beams the cultural content on the wall-sized          example the gradient.
display D. We assume the user to stand almost centrally with re-             In our approach, we adopted a non-supervisioned feature-based
spect to D and with a frontal position of the head with respect of        algorithm. Thus, we firstly store the image region containing the
the body.                                                                 feature (i.e. the nose tip), to be used as template. Then, we ap-
   In the previous work with small displays [4], we used an eye-          ply the OpenCV method to find a match between the current frame
tracking technology to estimate the gaze, since we experienced that,      and the template. The method scans the current frame, compar-
for limited sizes, users just move the eyes in order to visually ex-      ing the template image pixels against the source frame and stores
plore the surface of the artwork. On the other hand, in the case of       each comparison result in the resulting matrix. The source frame is
wall sized displays, users have to move also their head, performing       not scanned in its entirety, but only a Region of Interest (ROI) has
thus limited ocular movements.                                            been taken into account; the ROI corresponds to the area around
   Therefore, an head pose estimator is needed. To this, according        the template coordinates in the source image. The resulting matrix
to related work [8], we developed a solution aimed at tracking the        is then analysed to find the best similarity value, depending on the
nose tip of the user in 3 Degrees of Freedom (DoF). Indeed, the           matching criterion given as input. We used the Normalized Sum of
nose tip is easy to detect and, since it can be considered as good        Squared Differences (NSSD) as matching criterion, whose formula
approximation of the head centroid, given the required precision          is reported in equation 1.
from our domain, it is a useful indicator of the head position in the
three-dimensional space.                                                                   P                0   0
                                                                                               x0 ,y 0 (T (x , y )      I(x + x0 , y + y 0 ))2
                                                                            R(x, y) = qP                            P                                (1)
2.1.1     Nose Tip detection                                                                              0   0 2
                                                                                              x0 ,y 0 T (x , y )
                                                                                                                                    ˙ 0        0 2
                                                                                                                        x0 ,y 0 I(x + x , y + y )
   The first step in the processing pipeline is to detect, within the
video stream from the webcam, the face of the user. According                In equation 1, T is the template image and I is the input frame
to the literature, this task can be executed with different strategies,   in which we expect to find a match. The coordinates (x,y) repre-
which can be grouped in two main sets: the image-based, such              sent the generic location in the input image, whose content is be-
as skin detection [10], and the feature-based. In our approach,           ing compared to the corresponding pixel of the template, located
the detection of the face is based on a solution from the second          at (x’,y’). R is the resulting matrix and each location of (x,y) in R
group, namely the Haar feature-based Viola-Jones algorithm [24].          contains the corresponding matching result. The minimum values
In a first implementation, we scanned the entire image to locate the      in R represent the minimum differences between input image and
face; subsequently this search was improved, providing as input the       template, indicating the the most likely position of the feature in
range of sizes for a valid face, depending on the distance between        the image. Thus, while a perfect match will have a value of zero, a
user and camera.                                                          mismatch will have a larger sum of squared difference. When the
   Within the area of the face, also the nose tip search is performed     mismatch value exceeds the confidence level [19], the tracking is
by means of the Viola-Jones algorithm, in terms of its OpenCV             lost.
implementation, which returns the nasal area centered on its tip.
Initially, we searched for the nose scanning the entire face; then, we    2.2     Projecting the Nose Tip for Navigation
considered that the search could be improved by taking advantage             Our second task is associating an action to the gaze. To this aim,
of the facial geometric constraints [13], to increase both precision      we have to understand where the user is looking at, on the wall-
and computational efficiency. In particular, the nose can be easily       sized display. Since we can approximatively interpret the nose tip
found starting from the facial axis on y axis and from the middle         as centroid of the head, in order to provide a coherent PoG esti-
point of the face, for both x and z axis. We performed the search         mation, we have to solve the proportion to transpose the nose tip
on images of size 1280 x 960 pixels, processed on an Intel Core i7        coordinates into the display reference system. To this aim, we ge-
                                                                        gation paradigm, where this 3x3 matrix will be replaced by a con-
                                                                        tinuous function, where the speed of the scroll will be proportional
                                                                        to the distance of the POG from the center of the display.

                                                                        3.    THE EMOTIONAL CONTRIBUTE
                                                                           One of the problems with NUIs based on the PoG estimation is
                                                                        that it is difficult to understand the reaction of the user in terms of
                                                                        interest towards the shown content [26].
                                                                           To address this issue, we developed a further video processing
                                                                        module, intended as a complement to the system presented in the
                                                                        previous section, and able to detect implicit user feedbacks. The
                                                                        output of this module can be used for a twofold objective: it could
                                                                        trigger in real-time reactions from the system, and/or it can provide
        Figure 4: Matrix Model of the Wall-Sized Display.
                                                                        a powerful post-visit tool to the curator of the exhibition, with a log
                                                                        of the reactions of the visitors to the shown digital content. In this
                                                                        way, the curator could get a better insight on the content which is
ometrically project its coordinates on the observed wall-sized dis-
                                                                        sparkling the highest interest in the visitors. In the following we
play reference system. These new coordinates are calculated and
                                                                        provide some technical details on how we faced this issue.
then tracked with respect to the shown frame. The area of the wall-
sized display is considered as a 3x3 matrix, as shown in figure 4.      3.1    The Mydriasis
What we do in the current implementation is to indicate in which
                                                                           A wide range of medical studies proved that the brain reacts to
cell of the matrix the gaze is falling.
                                                                        the emotional arousal with involuntary actions performed by the
   When the user stands in front of the display with the head cen-
                                                                        sympathetic nervous system (e.g.: [9] [11]). These changes mani-
tered in frontal position, the geometric projection of his/her nose
                                                                        fest themselves in a number of ways, such as increased heart-beat,
tip falls into the cell #5 of the matrix (2nd row, 2nd column). We
                                                                        higher body temperature, muscular tension and pupil dilation (or
defined the size of the central row to obtain a kind of “comfort
                                                                        mydriasis). Thus, it could be interesting to monitor one or more of
zone”, where minor movements of the head are not triggering any
                                                                        these involuntary activities to discover the emotional reactions of
movement of the rendered image. In details, head rotations up to
                                                                        the visitors while they are enjoying the cultural contents, in order
15 degrees on the x axis and up to 8 degrees on both the y and the z
                                                                        to understand which details arouse pleasure.
axes do not affect the gaze position. With wider rotations, the pro-
                                                                           In the age of wearable devices, there are many sensors with
jection of the nose falls in another cell, and the digital image will
                                                                        health-oriented capabilities, like for instance armbands or smart-
be shifted accordingly.
                                                                        watches, that could monitor some of these involuntary actions of
                                                                        our body. For instance, information about the heart-beat or the
                                                                        body temperature can be obtained by means of sensors which re-
                                                                        trieve electric signals, once they are applied on the body. If by
                                                                        one side these techniques grant an effective level of reliability, on
                                                                        the other side they could influence the expected results of the ex-
                                                                        periments, as users tend to change their reactions when they feel
                                                                        under examination [11]. Moreover, they would require the visitors
                                                                        to wear some special device (having also high costs for the exhibi-
                                                                        tion), which could be a non-viable solution in many contexts. For
                                                                        these reasons, we again looked for a remote solution, able to get
                                                                        an insight on the emotional arousal of the visitor without requiring
                                                                        them to wear any device.
                                                                           Given the set-up described in Section 2.1, we tried to exploit
                                                                        additional information we can get from the video stream collected
    Figure 5: Input actions associated with the gaze directions.        by the webcam. In particular, we tried to remotely monitor the
                                                                        pupils behaviour during the interaction with the wall-sized display.
   According to the <event, condition, action> paradigm [22],           Let us note that, as both pupils react to stimuli in the same way, we
the event is the identification of a fixation point; the condition is   studied the behaviour of one pupil only.
marked by the index of the cell in the 3x3 matrix and the corre-        Pupils are larger in children and smaller in adults and the normal
sponding action is defined in figure 5. In particular, as explained     size varies from 2 to 4 mm in diameter in bright light, and from 4
in figure 5, when the PoG falls in the cells #4 or #6, we associate     to 8 mm in the dark [25]. Moreover, pupils react to stimuli in 0.2
the action of navigating the content on the left side or on the right   s, with the response peaking in 0.5 to 1.0 s [15]. Hess presented 5
side, respectively. When the user observes the sections #2 or #8,       visual stimuli to male and female subjects and he observed that the
the content will be navigated upwards or downwards; the section         increase in pupil size varied between 5% and 25% [9].
#5 will be interpreted as the area in which no action will be exe-
cuted. When the PoG falls in the remaining cells, the content will      3.2    Pupil detection
be navigated in the respective diagonal directions.                        Before detecting the pupil, we have to locate and track the eye
   In the current implementation, since we are just associating a       on the video stream coming from the webcam. The detection is
cell of the matrix to the PoG, the speed of the scroll is fixed and     performed by means of the Haar feature-based Viola-Jones algo-
independent from the PoG of the user within a lateral cell of the       rithm [24], already cited in section 2.1.1, while the tracking of the
matrix. We are currently implementing a new version of the navi-        pupil is done with the template matching technique, as described in
section 2.1.2.
  The detected ocular region contains eyelids, eyelashes, shadows                        Listing 1: A snippet of the logging file
and light reflexes. These represent noise for pupil detection, as they        1  <? xml v e r s i o n = " 1 . 0 " e n c o d i n g = "UTF 8" ? >
could interfere with the correctness of the results. Thus, the eye             2 <reportCollection>
                                                                               3 <report         id = "0">
image has to be pre-processed, before searching for the pupil size.
                                                                               4     < t r a c k i d T s = " 1402674690300 " s e c t i o n = "
We developed a solution including the following steps, in order to                             1" m y d r i a s i s = " 0 " / >
perform the pre-processing:                                                    5     < t r a c k i d T s = " 1402674690500 " s e c t i o n = " 1
                                                                                             " m y d r i a s i s ="0" / >
   1. The gray scaled image (Figure 6a) is blurred by means of a               6     < t r a c k i d T s = " 1402674690700 " s e c t i o n = " 1
      median filter, in order to highlight well defined contours;                            " m y d r i a s i s ="0" / >
                                                                               7     < t r a c k i d T s = " 1402674690900 " s e c t i o n = " 1
   2. The Sobel partial derivative on the x axis reveals the signifi-                        " m y d r i a s i s ="0" / >
      cant changes in color, allowing to isolate the eyelids;                  8     < t r a c k i d T s = " 1402674691100 " s e c t i o n = " 1
                                                                                             " m y d r i a s i s ="0" / >
   3. A threshold filter identifies the sclera.                                9 </ report>
                                                                              10 < r e p o r t   id = "1">
   As result, these steps produce a mask, which allows us to iso-             11     < t r a c k i d T s = " 1402675341320 " s e c t i o n = " 1
                                                                                             " m y d r i a s i s ="0" / >
late the eye ball from the source image. Pupil detection is now               12     < t r a c k i d T s = " 1402675341520 " s e c t i o n = " 0
performed on the source image as follows:                                                    " m y d r i a s i s ="0" / >
                                                                              13     < t r a c k i d T s = " 1402675341720 " s e c t i o n = " 0
   1. We drop down to zero (black) all the pixels having cumula-                             " m y d r i a s i s ="0" / >
      tive distribution function value greater than a certain thresh-         14     < t r a c k i d T s = " 1402675341920 " s e c t i o n = " 0
      old [1] (Figure 6b);                                                                   " m y d r i a s i s ="0" / >
                                                                              15 < / r e p o r t >

   2. We morphologically transform the resulting binary image by              16 < / r e p o r t C o l l e c t i o n >

      means of a dilation process, to remove the light reflexes on
      the pupil;
   3. A contours detection operation identifies some neighbour-          thread saves the current timestamp, the index of the observed sec-
      hoods (Figure 6c).                                                 tion and an integer value representing the pupil status. If the pupil
                                                                         has normal size, the pupil status is 0, otherwise it is 1. If the system
   4. The pupillary area is found by selecting the region having
                                                                         does not detect a face for a given time (10 seconds, in the specific)
      maximum area (Figure 6d);
                                                                         the interaction session is considered terminated and the collected
   5. The center of the ellipse (Figure 6e) best fitting the pupillary   information is stored in a a XML document. The structure of the
      area, approximates the pupil center (Figure 6f).                   XML document is shown in the Listing 1.
                                                                            The XML document is created and initialized with an empty
                                                                         <reportCollection>, when the application starts; then, when each
                                                                         interaction session ends, a new <report> subtree is created. The
                                                                         timestamps values univocally identify the respective <track> ele-
                                                                         ments. Given, this simple structure, it is easy to perform subsequent
                                                                         analyses of the interaction session of the visitors.
                   (a)                            (b)
                                                                         4.       THE CASE STUDY
                                                                             The system we developed was shown at the Tianjin International
                                                                         Design Week 2015, for the personal exposition dedicated to the
                                                                         Italian designer Matteo Ragni. In particular, the software was used
                                                                         to let the visitors to navigate with the gaze the 360 virtual recon-
                   (c)                            (d)
                                                                         struction of Matteo Ragni’s Camparitivo in Triennale, on a wall
                                                                         sized display. In order to implement the case study, we started from
                                                                         the design model of Camparitivo in Triennale, in Rhino3D format
                                                                         3
                                                                           , including the textures obtained from photos, and we placed a vir-
                                                                         tual camera into the center of the model, to have the point of view
                   (e)                            (f)                    of a visitor inside the Camparitivo. With this settings, we rendered
                                                                         a complete rotation of the camera around a fixed vertical axis cor-
                 Figure 6: Pupil processing steps.                       responding to the imaginary neck of the visitor, in order to obtain
                                                                         photorealistic, raytraced reflections on the mirrors. With this setup,
   Once we detected the pupil, to calculate the mydriasis we store       we obtained 360 images with a step size of 1 degree. An illustrative
the first computed radius and, frame by frame, we make a com-            frame is shown in Figure 7. We considered each frame as divided
parison between the first radius and the ones calculated during the      according to the matrix in figure 4. Once the system indicated the
following iterations: according to Hess, when the comparison ex-         observed section of the matrix, the respective action of figure 5 was
ceeds the 5%, a mydriasis is signaled.                                   executed and the related frame was shown.
   To log all these implicit feedbacks, during the interaction a par-
allel thread keeps track of the observed sections and the related
                                                                         4.1      The experiments
                                                                         3
emotional reactions. In particular, at fixed steps of 200 ms, the            www.rhino3d.com
            Figure 7: A frame of the rendered model.


                                                                                 Figure 9: Cumulative Results of the Interviews


                                                                       a gaze-based application, we submitted the following questions to
                                                                       them:
                                                                          1. Do you think this kind of application is useful to improve the
                                                                             museum fruition?
                                                                          2. Did you find the application easy to understand?
                                                                          3. Did you find any difficulties during the interaction?
                                                                          4. How old are you?
                                                                          Participating subjects were grouped in three subsets, according
                                                                       to their age, where all the subsets have the same number of subjects.
                                                                       The group A has users whose age is between 18 and 35 years; group
                                                                       B corresponds to people from 36 to 65 years old; group C is com-
                                                                       posed by users older than 65 years. We did not make distinction
                                                                       between male and female subjects. For all of them, it was the first
                                                                       time they tried a gaze-based IT solution.

                                                                       4.2    Results
                                                                          The results of this very preliminary evaluation of the proposal are
                                                                       reported in Figure 9, where the histograms represent the percentage
                                                                       of positive answers given by the subjects over the total of answers.
               Figure 8: The experimental setting.                     Please note that for Q1 and Q2, the higher the results, the better is
                                                                       the feedback, while for Q3, the lower the better.
                                                                          Interpreting the comments of the users, as for Q1, we see that
                                                                       the vast majority of the subjects believe the proposed interface was
   Basically, motor tasks such as "look at there" are performed in     useful to improve the cultural experience. People older than 65
video games by hands controlled operations, because they are usu-      are less enthusiastic, but this is somehow an expected result. As
ally executed by classical input devices such as: joystick, joypad,    for Q2, an even higher percentage of subjects found the applica-
keyboard or mouse. Our work represents an attempt to improve           tion easy to understand. For Q2 there is less difference among the
the naturalness of this kind of interaction, by associating the task   three groups. Finally, as for Q3, we found that some of the subjects
with its implicitly corresponding interface. We left the users free    encountered difficulties in interacting with the software, with a sig-
of interacting with the application, without giving them any kind      nificant difference for the Group C with respect to the other two
of instruction or support. The only source of information for them     groups. In general problems arose when visitors performed rapid
was represented by the panel shown in figure 8, explaining that the    or wide head movements. In both cases, this led to a failure of the
input was given by the head movements and not by the eyes.             nose tip tracker. In particular, when the users performed wide rota-
   During the exposition, more than 150 visitors experienced our       tions, the template matching results exceeded the confidence level,
stand, standing at 1 meter from a webcam mounted at 160cm of           causing the lost of the tracking. Similarly, rapid head movements
height, as shown in Figure 8. Among all the visitors, 51 speaking      caused a sudden reduction of the similarity between frame and tem-
English accepted to answer to a quick oral interview, as we could      plate, causing the tracker to fail.
not submit written questionnaires during the public event for logis-      An objective survey about the user experience has been con-
tic reasons.                                                           ducted by analyzing the collected log data. In particular, we used
   After we asked users if it was the first time they experienced      the stored timestamps and the indexes of the observed Regions Of
Interest, to indicate the duration of each interaction and on which        particular area indicate that it is more noticeable, or more impor-
regions users concentrated their gaze. Data showed that 45% of             tant to the viewer than other areas [20]; Duchowski [5] estimates
users performed a complete interaction, observing all 9 ROIs. Ac-          the mean fixation duration of 1079 ms. This approach represents a
cording to the matrix in Figure 4, the most observed ROI has been          natural and simple solution to the task "look forward", but the acti-
the #4, observed by 88%of users. The average duration of the in-           vation time forces the user to wait for the operation starts, without
teraction has been 95 seconds per user.                                    doing anything and it may feel like a waste of time. Finally, also
   All in all, we can see from this very preliminary investigation         voice commands could be a natural input to perform this task; thus,
that visitors largely enjoyed the experience with the gaze-based in-       our current research direction is oriented to provide a better support
teraction.                                                                 for multimodal interaction.
   As for the mydriatic reactions of users, this is more problem-
atic. We analyzed the logs of the exhibition, and we found that the
mydriatic reactions occurred in:                                           6.    ACKNOWLEDGEMENT
                                                                              This work has been partly supported by the European Commu-
     • 65% of cases for group A;                                           nity and by the Italian Ministry of University and Research (MIUR)
                                                                           under the PON Or.C.He.S.T.R.A. (ORganization of Cultural HEr-
     • 40% of cases for group B;
                                                                           itage and Smart Tourism and Real-time Accessibility) project.
     • 20% of cases for group C.
   There are two consideration to drawn from these numbers. The            7.    REFERENCES
first is that in general the technological solution is not mature
                                                                            [1] M. Asadifard and J. Shanbezadeh. Automatic adaptive center
enough for a wide public. This is particularly true for Asiatic peo-
                                                                                of pupil detection using face detection and cdf analysis. In
ple, as the totally of the subjects had black eyes, which makes the
                                                                                Proceedings of the International MultiConference of
identification of the pupil more problematic. Some internal inves-
                                                                                Engineers and Computer Scientists, volume 1, page 3, 2010.
tigations we did with Caucasian subjects led to better results. The
                                                                            [2] R. Bates, H. Istance, M. Donegan, and L. Oosthuizen. Fly
other conclusion is that there is a well-known difference in the my-
                                                                                where you look: enhancing gaze based interaction in 3d
driatic reactions with respect to the age of the subjects, where the
                                                                                environments. Proc. COGAIN-05, pages 30–32, 2005.
older they are, the smaller are the differences in the size of the pupil
between the relaxed and aroused states. So, it is clear that the emo-       [3] D. M. Calandra, A. Caso, F. Cutugno, A. Origlia, and
tional module requires further research efforts.                                S. Rossi. Cowme: a general framework to evaluate cognitive
                                                                                workload during multimodal interaction. In Proceedings of
                                                                                the 15th ACM on International conference on multimodal
5.    CONCLUSIONS                                                               interaction, pages 111–118. ACM, 2013.
   Wall-sized displays represent a viable solution to present art-          [4] D. M. Calandra, D. Di Mauro, D. D’Auria, and F. Cutugno.
works difficult or impossible to move. In this paper, we proposed               Eyecu: an emotional eye tracker for cultural heritage
a Natural User Interface to explore 360 digital artworks shown on               support. In Empowering Organizations, pages 161–172.
wall-sized displays, allows visitors to look around and explore vir-            Springer, 2016.
tual worlds using only their gaze, stepping away from the bound-            [5] A. T. Duchowski. Eye Tracking Methodology: Theory and
aries and limitations of the keyboard and mouse. We chose to ac-                Practice. Springer-Verlag New York, Inc., Secaucus, NJ,
complish the task by means of a remote head pose detector. As                   USA, 2007.
it does not require calibration, it represents an immediate to use          [6] G. Fanelli, J. Gall, and L. Van Gool. Real time head pose
solution for supporting digital environment navigation. Moreover                estimation with random regression forests. In Computer
we developed a solution to monitor the mydriatic reactions of the               Vision and Pattern Recognition (CVPR), 2011 IEEE
subjects while they were using the system, to get an implicit feed-             Conference on, pages 617–624. IEEE, 2011.
back on the interested of the represented digital content. A prelimi-       [7] E. S. Gómez and A. S. S. Sánchez. Biomedical
nary investigation we performed at the Tianjin International Design             instrumentation to analyze pupillary responses in
Week 2015 with 51 subjects gave us the feedback that the gaze-
                                                                                white-chromatic stimulation and its influence on diagnosis
based navigation can be well-accepted by the visitors, as it is felt as
                                                                                and surgical evaluation. 2012.
a way to improve the fruition of Cultural Heritage. Nevertheless,
                                                                            [8] D. Gorodnichy. On importance of nose for face tracking.
the monitoring of mydriatic reactions should still be improved, es-
                                                                                2002.
pecially for people with black eyes.
   Anyhow, from the results we collected, there are still many po-          [9] E. H. Hess and J. M. Polt. Pupil size as related to interest
tential research direction for this topic. First of all, we are cur-            value of visual stimuli. Science, 132:349–350, Aug. 1960.
rently developing a new version of the system where the display is         [10] M. J. Jones and J. M. Rehg. Statistical color models with
no more divided into a matrix, but instead there will be a smooth               application to skin detection. International Journal of
feedback from the system, whose rapidity of response will be more               Computer Vision, 46(1):81–96, 2002.
proportional correlated to the amount of movement done by the              [11] D. Kahneman and J. Beatty. Pupil diameter and load on
head of the user. The second main research field is to extend this              memory. Science, 154(3756):1583–1585, 1966.
approach towards freely explorable 3D environments, thus to sup-           [12] M. Kassner, W. Patera, and A. Bulling. Pupil: An Open
port also the forward and backward navigation. The idea of enrich               Source Platform for Pervasive Eye Tracking and Mobile
gaze with forward and backward navigation has been approached                   Gaze-based Interaction. April 2014.
in different works. One solution is the fly-where-I-look [2] in which      [13] T. T. Le, L. G. Farkas, R. C. Ngim, L. S. Levin, and C. R.
authors associate the interest of users to fly towards an area, with            Forrest. Proportionality in asian and north american
the action to look at it. This approach finds basis in cognitive ac-            caucasian faces using neoclassical facial canons as criteria.
tivities: in particular, some studies prove that more fixations on a            Aesthetic plastic surgery, 26(1):64–69, 2002.
[14] H. Li, D. Doermann, and O. Kia. Automatic text detection          [22] B. Shneiderman. Designing the user interface. Pearson
     and tracking in digital video. Image Processing, IEEE                  Education India, 2003.
     Transactions on, 9(1):147–156, 2000.                              [23] R. Valenti, N. Sebe, and T. Gevers. Combining head pose
[15] O. Lowenstein and I. E. Loewenfeld. The pupil. The eye,                and eye location information for gaze estimation. IEEE
     3:231–267, 1962.                                                       Transactions on Image Processing, 21(2):802–815, 2012.
[16] S. Milekic. The more you look the more you get:                   [24] P. A. Viola and M. J. Jones. Rapid object detection using a
     Intention-based interface using gaze-tracking. 2003.                   boosted cascade of simple features. In CVPR (1), pages
[17] E. Murphy-Chutorian and M. M. Trivedi. Head pose                       511–518, 2001.
     estimation in computer vision: A survey. Pattern Analysis         [25] C. VL and K. JA. Clinical methods: The history, physical,
     and Machine Intelligence, IEEE Transactions on,                        and laboratory examinations. JAMA, 264(21):2808–2809,
     31(4):607–626, 2009.                                                   1990.
[18] R. NETEK. Implementation of ria concept and eye tracking          [26] D. Wigdor and D. Wixon. Brave NUI world: designing
     system for cultural heritage. Opgeroepen op september,                 natural user interfaces for touch and gesture. Elsevier, 2011.
     9:2012, 2011.                                                     [27] P. Wunsch and G. Hirzinger. Real-time visual tracking of 3d
[19] K. Nickels and S. Hutchinson. Estimating uncertainty in                objects with dynamic handling of occlusion. In Robotics and
     ssd-based feature tracking. Image and Vision Computing,                Automation, 1997. Proceedings., 1997 IEEE International
     20(1):47 – 58, 2002.                                                   Conference on, volume 4, pages 2868–2873. IEEE, 1997.
[20] A. Poole, L. J. Ball, and P. Phillips. In search of salience: A   [28] C. Yang, R. Duraiswami, and L. Davis. Efficient mean-shift
     response-time and eye-movement analysis of bookmark                    tracking via a new similarity measure. In Computer Vision
     recognition. In People and Computers XVIII—Design for                  and Pattern Recognition, 2005. CVPR 2005. IEEE Computer
     Life, pages 363–378. Springer, 2005.                                   Society Conference on, volume 1, pages 176–183. IEEE,
[21] L. R. Rabiner and B.-H. Juang. An introduction to hidden               2005.
     markov models. ASSP Magazine, IEEE, 3(1):4–16, 1986.