Navigating Wall-sized Displays with the Gaze: a Proposal for Cultural Heritage Davide Maria Calandra, Dario Di Mauro, Francesco Cutugno, Sergio Di Martino Department of Electrical Engineering and Information Technology University of Naples "Federico II" 80127, Naples, Italy {davidemaria.calandra, dario.dimauro, cutugno, sergio.dimartino}@unina.it ABSTRACT applied in many contexts, like advertisement, medical diagnosis, New technologies for innovative interactive experience represent Business Intelligence, etc. Also in the Cultural Heritage field, this a powerful medium to deliver cultural heritage content to a wider type of displays is highly appreciated, since they turn out to be range of users. Among them, Natural User Interfaces (NUI), i.e. particularly suited to show to visitors artworks that are difficult or non-intrusive technologies not requiring to the user to wear devices impossible to move, being a way to explore the digital counterpart nor use external hardware (e.g. keys or trackballs), are considered of real/virtual environments. On the other hand, the problem with a promising way to broader the audience of specific cultural her- these display is how to mediate the interaction with the user. Many itage domains, like the navigation/interaction with digital artworks solutions have been proposed, with different trade-off among intru- presented on wall-sized displays. siveness, calibration and precision degree to be achieved. Recently, Starting from a collaboration with a worldwide famous Italian some proposals have been developed aimed at exploiting the direc- designer, we defined a NUI to explore 360 panoramic artworks tion of the gaze of the visitor in front of the display as a medium presented on wall-sized displays, like virtual reconstruction of an- to interact with the system. The simple assumption is that whether cient cultural sites, or rendering of imaginary places. Specifically, the user looks towards an edge of the screen, he/she is interested we let the user to "move the head" as way of natural interaction to in discovering more content in that direction, and the digital sce- explore and navigate through these large digital artworks. To this nario should be updated accordingly. In this way, there is no need aim, we developed a system including a remote head pose estima- to wear a device, making easier for a heterogeneous public to enjoy tor to catch movements of users standing in front of the wall-sized the digital content. display: starting from a central comfort zone, as users move their Detecting the gaze is anyhow a challenging task, still with some head in any direction, the virtual camera rotates accordingly. With open issues. To estimate the Point of Gaze (PoG), it is possible NUIs, it is difficult to get feedbacks from the users about the in- to exploit the eye movements, the head pose or both [23], and to terest for the point of the artwork he/she is looking at. To solve require special hardware to wear (e.g.: [12]) or to develop remote this issue, we complemented the gaze estimator with a preliminary trackers (e.g.: [6]). The latter are not able to provide a high accu- emotional analysis solution, able to implicitly infer the interest of racy, but this is an acceptable compromise in many scenarios, like the user for the shown content from his/her pupil size. the Cultural Heritage, where the use of special hardware for the A sample of 150 subjects was invited to experience the proposed visitors is usually difficult. interface at an International Design Week. Preliminary results show For the Tianjin International Design Week 20151 , we were asked that the most of the subjects were able to properly interact with the to develop a set of technological solutions to improve the fruition system from the very first use, and that the emotional module is an of a 360 digital reconstruction projected on a wall-sized display interesting solution, even if further work must be devoted to address of the “Camparitivo in Triennale”2 , a lounge bar (see Figure 1) specific situations. located in Milan, Italy, designed by one of the most famous Italian designers, Matteo Ragni, to celebrate the Italian liqueur Campari. The requirements for the solution were to define a Natural User Categories and Subject Descriptors Interface (NUI), which does not constrain users to maintain a fixed H.5.2 [User Interfaces]: Interaction styles distance from the display, neither to wear an external device. To achieve our task, we designed a remote PoG estimator for 1. INTRODUCTION wall-sized displays where 360 virtual environments are rendered. A further novelty element of the proposal is the exploitation of an- Wall-sized displays represent a viable and common way to other implicit communication channel of the visitor, i.e. his/her at- present digital content on large projection surfaces. They are tention towards the represented image on the display. To this aim, we remotely monitor pupil size variations, as they are significantly correlated with the arousal level of users while performing a task. This information can be firstly useful to the artist, as pupils dilate when visitors are looking at pleasant images [9]. Moreover, logging the pupil dilation (mydriasis) during an interaction session can be a reliable source of information, useful also to analyze the usability c 2016 Copyright 2016 for this paper by its authors. Copying permitted for private 1 and academic purposes. http://tianjindesignweek.com/ 2 : http://www.matteoragni.com/project/camparitivo-in-triennale/ 1. Defining techniques to estimate the PoG of the user while he/she is looking at the display, and 2. Defining a navigation logic associated to the PoG. In the following, we provide technical details on how we faced these two tasks. 2.1 Point of gaze estimation Head poses are usually computed by considering 3 degrees of freedom (DoF) [17], i.e. the rotations along the 3 axis of simmetry in the space, x, y, and z, shown in Figure 2. Figure 1: Matteo Ragni’s "Camparitivo in Triennale" level of the interface, since pupils dilate when users are required to perform difficult tasks, too [11] [3]. In this paper we describe both the navigation with the remote PoG estimator and the solution for logging the mydriasis, together with a preliminary case study. More in details, the rest of the pa- per is structured as follows: in section 2, we explain the navigation paradigm for cultural content with the gaze, detailing the steps we performed to detect and track the PoG. In section 3, we explain how the mydriasis detection could be a useful strategy to investi- Figure 2: Head movements. gate the emotional reactions of users enjoying a cultural content and we detail our steps to get the pupil dilation. In section 4, we present the case study: Matteo Ragni’s Camparitivo in Triennale, Once the head pose in the space is known, the pupil center po- showing how we allow the visitors to navigate the digital rendering sition can optionally refine the PoG estimation.For example, in the of the lounge bar, on a wall-sized display, reporting some prelim- medical diagnosis scenario, to estimate the PoG, patients are usu- inary usability results. Section 5 concludes the paper, presenting ally not allowed to move their head [7] or they have to wear head- also future research directions. mounted cameras pointed towards their eyes [12]. In these cases, to estimate the PoG means to compute the pupil center position with respect to the ellipse formed by the eyelids, while the head posi- 2. NAVIGATING WITH THE GAZE tion, when considered, is detected through IR sensors mounted on Even if wearable eye trackers are becoming smaller and more the head of subjects. These systems grant an error threshold lower comfortable, they still have an impact on the quality of a cultural than 5 pixels [12], achievable thanks to strict constraints on the visit. We believe that the user experience strongly depends on the set-up, such as the fixed distance between eye and camera but, on capability of the user to establish a direct connection with the art- the other hand, they have a very high level of invasiveness for the works, without the mediation of a device. For this reason, in order users. In other scenarios, the PoG is estimated by means of remote to allow the user to explore a 360 cultural heritage environment trackers, such as ones presented in [6], which determine the gaze using only his/her point of gaze, we focused on developing a re- direction by the head orientation. These systems do not limit users’ mote head pose estimator for wall-sized displays, which does not movements and do not require them to wear any device. require users to wear any external device or to execute any prior In the cultural heritage context, the gaze detection is mainly used calibration. for two tasks. The first one is related to the artistic fruition: accord- The contents that we aim to navigate are 360 virtual environ- ing with "The More You Look The More You Get" paradigm [16], ments, expressed as a sequence of 360 frames whose step size is users focusing their gaze on a specific work of art or part of it, can 1 . Thus, navigating the content on the left (right) means to show be interested to receive some additional content about that specific the previous (next) frame of the sequence. As we want visitors to item. This usage of the gaze direction can be extremely useful in feel the sensation of enjoying an authentic large environment, the terms of improving the accessibility to the cultural heritage infor- wall-sized display is used to represent the content with real propor- mation and enhancing the visit experience quality. The second task tions. If by one side, this choice improves the quality of the fruition is related to understanding how people take decisions, visiting a because it reduces the gap between real and virtual environments, museum: which areas they are focused and how long; outputs from on the other hand, representing an entire façade of a building in one gaze detectors are then gathered and analyzed [18]. frame is not realistic. Thus, it requires additional complexity, since Starting from an approach we already developed for small dis- we have to define also a support for a vertical scroll of the content, plays (between 50 x 30cm and 180 x 75cm) [4], we propose an to show the not visible parts of the frame. extension for wall-sized ones, based on a combined exploitation of More in details, the development of NUIs to explore the content the head pose and pupil size to explore digital environments. The of wall-sized displays with the gaze, requires two subtasks: general settings of the display is presented in figure 3. In particu- with 2.2 GHz; initially, the detection time was about 100 ms. The optimizations on face and nose search allowed us to locate the face and the nose on average in 35 ms, reducing the computation time of about 65%. 2.1.2 Nose Tip tracking The previously described features are searched either the first time a user is detected or when the tracking is lost. In all the other frames, the nose tip is simply tracked. Several strategies have been proposed to track the motion, that can be categorized into three groups: feature-based, model-based and optical flow-based. Generally speaking, the feature-based strategies involve the extraction of templates from a reference im- age and the identification of their counterparts in the further images of the sequence. Some feature-based algorithms need to be trained, for example those based on Hidden Markov Models (HMM) [21] or Artificial Neural Networks (ANN) [14], while others are non- supervisioned, like for instance the Mean Shift Tracking algo- Figure 3: Gaze detection: experimental settings. rithms [28]. Although the model-based strategies could be consid- ered a specific branch of the feature-based ones, they require some a-priori knowledge about the investigated models [27]. The opti- lar, the exhibition set up includes a PC (the machine on which the cal flow is the vector field which describes how the image changes software runs), a webcam W which acquires the input stream, and during the time; it can be computed with different strategies as, for a projector P which beams the cultural content on the wall-sized example the gradient. display D. We assume the user to stand almost centrally with re- In our approach, we adopted a non-supervisioned feature-based spect to D and with a frontal position of the head with respect of algorithm. Thus, we firstly store the image region containing the the body. feature (i.e. the nose tip), to be used as template. Then, we ap- In the previous work with small displays [4], we used an eye- ply the OpenCV method to find a match between the current frame tracking technology to estimate the gaze, since we experienced that, and the template. The method scans the current frame, compar- for limited sizes, users just move the eyes in order to visually ex- ing the template image pixels against the source frame and stores plore the surface of the artwork. On the other hand, in the case of each comparison result in the resulting matrix. The source frame is wall sized displays, users have to move also their head, performing not scanned in its entirety, but only a Region of Interest (ROI) has thus limited ocular movements. been taken into account; the ROI corresponds to the area around Therefore, an head pose estimator is needed. To this, according the template coordinates in the source image. The resulting matrix to related work [8], we developed a solution aimed at tracking the is then analysed to find the best similarity value, depending on the nose tip of the user in 3 Degrees of Freedom (DoF). Indeed, the matching criterion given as input. We used the Normalized Sum of nose tip is easy to detect and, since it can be considered as good Squared Differences (NSSD) as matching criterion, whose formula approximation of the head centroid, given the required precision is reported in equation 1. from our domain, it is a useful indicator of the head position in the three-dimensional space. P 0 0 x0 ,y 0 (T (x , y ) I(x + x0 , y + y 0 ))2 R(x, y) = qP P (1) 2.1.1 Nose Tip detection 0 0 2 x0 ,y 0 T (x , y ) ˙ 0 0 2 x0 ,y 0 I(x + x , y + y ) The first step in the processing pipeline is to detect, within the video stream from the webcam, the face of the user. According In equation 1, T is the template image and I is the input frame to the literature, this task can be executed with different strategies, in which we expect to find a match. The coordinates (x,y) repre- which can be grouped in two main sets: the image-based, such sent the generic location in the input image, whose content is be- as skin detection [10], and the feature-based. In our approach, ing compared to the corresponding pixel of the template, located the detection of the face is based on a solution from the second at (x’,y’). R is the resulting matrix and each location of (x,y) in R group, namely the Haar feature-based Viola-Jones algorithm [24]. contains the corresponding matching result. The minimum values In a first implementation, we scanned the entire image to locate the in R represent the minimum differences between input image and face; subsequently this search was improved, providing as input the template, indicating the the most likely position of the feature in range of sizes for a valid face, depending on the distance between the image. Thus, while a perfect match will have a value of zero, a user and camera. mismatch will have a larger sum of squared difference. When the Within the area of the face, also the nose tip search is performed mismatch value exceeds the confidence level [19], the tracking is by means of the Viola-Jones algorithm, in terms of its OpenCV lost. implementation, which returns the nasal area centered on its tip. Initially, we searched for the nose scanning the entire face; then, we 2.2 Projecting the Nose Tip for Navigation considered that the search could be improved by taking advantage Our second task is associating an action to the gaze. To this aim, of the facial geometric constraints [13], to increase both precision we have to understand where the user is looking at, on the wall- and computational efficiency. In particular, the nose can be easily sized display. Since we can approximatively interpret the nose tip found starting from the facial axis on y axis and from the middle as centroid of the head, in order to provide a coherent PoG esti- point of the face, for both x and z axis. We performed the search mation, we have to solve the proportion to transpose the nose tip on images of size 1280 x 960 pixels, processed on an Intel Core i7 coordinates into the display reference system. To this aim, we ge- gation paradigm, where this 3x3 matrix will be replaced by a con- tinuous function, where the speed of the scroll will be proportional to the distance of the POG from the center of the display. 3. THE EMOTIONAL CONTRIBUTE One of the problems with NUIs based on the PoG estimation is that it is difficult to understand the reaction of the user in terms of interest towards the shown content [26]. To address this issue, we developed a further video processing module, intended as a complement to the system presented in the previous section, and able to detect implicit user feedbacks. The output of this module can be used for a twofold objective: it could trigger in real-time reactions from the system, and/or it can provide Figure 4: Matrix Model of the Wall-Sized Display. a powerful post-visit tool to the curator of the exhibition, with a log of the reactions of the visitors to the shown digital content. In this way, the curator could get a better insight on the content which is ometrically project its coordinates on the observed wall-sized dis- sparkling the highest interest in the visitors. In the following we play reference system. These new coordinates are calculated and provide some technical details on how we faced this issue. then tracked with respect to the shown frame. The area of the wall- sized display is considered as a 3x3 matrix, as shown in figure 4. 3.1 The Mydriasis What we do in the current implementation is to indicate in which A wide range of medical studies proved that the brain reacts to cell of the matrix the gaze is falling. the emotional arousal with involuntary actions performed by the When the user stands in front of the display with the head cen- sympathetic nervous system (e.g.: [9] [11]). These changes mani- tered in frontal position, the geometric projection of his/her nose fest themselves in a number of ways, such as increased heart-beat, tip falls into the cell #5 of the matrix (2nd row, 2nd column). We higher body temperature, muscular tension and pupil dilation (or defined the size of the central row to obtain a kind of “comfort mydriasis). Thus, it could be interesting to monitor one or more of zone”, where minor movements of the head are not triggering any these involuntary activities to discover the emotional reactions of movement of the rendered image. In details, head rotations up to the visitors while they are enjoying the cultural contents, in order 15 degrees on the x axis and up to 8 degrees on both the y and the z to understand which details arouse pleasure. axes do not affect the gaze position. With wider rotations, the pro- In the age of wearable devices, there are many sensors with jection of the nose falls in another cell, and the digital image will health-oriented capabilities, like for instance armbands or smart- be shifted accordingly. watches, that could monitor some of these involuntary actions of our body. For instance, information about the heart-beat or the body temperature can be obtained by means of sensors which re- trieve electric signals, once they are applied on the body. If by one side these techniques grant an effective level of reliability, on the other side they could influence the expected results of the ex- periments, as users tend to change their reactions when they feel under examination [11]. Moreover, they would require the visitors to wear some special device (having also high costs for the exhibi- tion), which could be a non-viable solution in many contexts. For these reasons, we again looked for a remote solution, able to get an insight on the emotional arousal of the visitor without requiring them to wear any device. Given the set-up described in Section 2.1, we tried to exploit additional information we can get from the video stream collected Figure 5: Input actions associated with the gaze directions. by the webcam. In particular, we tried to remotely monitor the pupils behaviour during the interaction with the wall-sized display. According to the paradigm [22], Let us note that, as both pupils react to stimuli in the same way, we the event is the identification of a fixation point; the condition is studied the behaviour of one pupil only. marked by the index of the cell in the 3x3 matrix and the corre- Pupils are larger in children and smaller in adults and the normal sponding action is defined in figure 5. In particular, as explained size varies from 2 to 4 mm in diameter in bright light, and from 4 in figure 5, when the PoG falls in the cells #4 or #6, we associate to 8 mm in the dark [25]. Moreover, pupils react to stimuli in 0.2 the action of navigating the content on the left side or on the right s, with the response peaking in 0.5 to 1.0 s [15]. Hess presented 5 side, respectively. When the user observes the sections #2 or #8, visual stimuli to male and female subjects and he observed that the the content will be navigated upwards or downwards; the section increase in pupil size varied between 5% and 25% [9]. #5 will be interpreted as the area in which no action will be exe- cuted. When the PoG falls in the remaining cells, the content will 3.2 Pupil detection be navigated in the respective diagonal directions. Before detecting the pupil, we have to locate and track the eye In the current implementation, since we are just associating a on the video stream coming from the webcam. The detection is cell of the matrix to the PoG, the speed of the scroll is fixed and performed by means of the Haar feature-based Viola-Jones algo- independent from the PoG of the user within a lateral cell of the rithm [24], already cited in section 2.1.1, while the tracking of the matrix. We are currently implementing a new version of the navi- pupil is done with the template matching technique, as described in section 2.1.2. The detected ocular region contains eyelids, eyelashes, shadows Listing 1: A snippet of the logging file and light reflexes. These represent noise for pupil detection, as they 1 could interfere with the correctness of the results. Thus, the eye 2 3 image has to be pre-processed, before searching for the pupil size. 4 < t r a c k i d T s = " 1402674690300 " s e c t i o n = " We developed a solution including the following steps, in order to 1" m y d r i a s i s = " 0 " / > perform the pre-processing: 5 < t r a c k i d T s = " 1402674690500 " s e c t i o n = " 1 " m y d r i a s i s ="0" / > 1. The gray scaled image (Figure 6a) is blurred by means of a 6 < t r a c k i d T s = " 1402674690700 " s e c t i o n = " 1 median filter, in order to highlight well defined contours; " m y d r i a s i s ="0" / > 7 < t r a c k i d T s = " 1402674690900 " s e c t i o n = " 1 2. The Sobel partial derivative on the x axis reveals the signifi- " m y d r i a s i s ="0" / > cant changes in color, allowing to isolate the eyelids; 8 < t r a c k i d T s = " 1402674691100 " s e c t i o n = " 1 " m y d r i a s i s ="0" / > 3. A threshold filter identifies the sclera. 9 10 < r e p o r t id = "1"> As result, these steps produce a mask, which allows us to iso- 11 < t r a c k i d T s = " 1402675341320 " s e c t i o n = " 1 " m y d r i a s i s ="0" / > late the eye ball from the source image. Pupil detection is now 12 < t r a c k i d T s = " 1402675341520 " s e c t i o n = " 0 performed on the source image as follows: " m y d r i a s i s ="0" / > 13 < t r a c k i d T s = " 1402675341720 " s e c t i o n = " 0 1. We drop down to zero (black) all the pixels having cumula- " m y d r i a s i s ="0" / > tive distribution function value greater than a certain thresh- 14 < t r a c k i d T s = " 1402675341920 " s e c t i o n = " 0 old [1] (Figure 6b); " m y d r i a s i s ="0" / > 15 < / r e p o r t > 2. We morphologically transform the resulting binary image by 16 < / r e p o r t C o l l e c t i o n > means of a dilation process, to remove the light reflexes on the pupil; 3. A contours detection operation identifies some neighbour- thread saves the current timestamp, the index of the observed sec- hoods (Figure 6c). tion and an integer value representing the pupil status. If the pupil has normal size, the pupil status is 0, otherwise it is 1. If the system 4. The pupillary area is found by selecting the region having does not detect a face for a given time (10 seconds, in the specific) maximum area (Figure 6d); the interaction session is considered terminated and the collected 5. The center of the ellipse (Figure 6e) best fitting the pupillary information is stored in a a XML document. The structure of the area, approximates the pupil center (Figure 6f). XML document is shown in the Listing 1. The XML document is created and initialized with an empty , when the application starts; then, when each interaction session ends, a new subtree is created. The timestamps values univocally identify the respective ele- ments. Given, this simple structure, it is easy to perform subsequent analyses of the interaction session of the visitors. (a) (b) 4. THE CASE STUDY The system we developed was shown at the Tianjin International Design Week 2015, for the personal exposition dedicated to the Italian designer Matteo Ragni. In particular, the software was used to let the visitors to navigate with the gaze the 360 virtual recon- (c) (d) struction of Matteo Ragni’s Camparitivo in Triennale, on a wall sized display. In order to implement the case study, we started from the design model of Camparitivo in Triennale, in Rhino3D format 3 , including the textures obtained from photos, and we placed a vir- tual camera into the center of the model, to have the point of view (e) (f) of a visitor inside the Camparitivo. With this settings, we rendered a complete rotation of the camera around a fixed vertical axis cor- Figure 6: Pupil processing steps. responding to the imaginary neck of the visitor, in order to obtain photorealistic, raytraced reflections on the mirrors. With this setup, Once we detected the pupil, to calculate the mydriasis we store we obtained 360 images with a step size of 1 degree. An illustrative the first computed radius and, frame by frame, we make a com- frame is shown in Figure 7. We considered each frame as divided parison between the first radius and the ones calculated during the according to the matrix in figure 4. Once the system indicated the following iterations: according to Hess, when the comparison ex- observed section of the matrix, the respective action of figure 5 was ceeds the 5%, a mydriasis is signaled. executed and the related frame was shown. To log all these implicit feedbacks, during the interaction a par- allel thread keeps track of the observed sections and the related 4.1 The experiments 3 emotional reactions. In particular, at fixed steps of 200 ms, the www.rhino3d.com Figure 7: A frame of the rendered model. Figure 9: Cumulative Results of the Interviews a gaze-based application, we submitted the following questions to them: 1. Do you think this kind of application is useful to improve the museum fruition? 2. Did you find the application easy to understand? 3. Did you find any difficulties during the interaction? 4. How old are you? Participating subjects were grouped in three subsets, according to their age, where all the subsets have the same number of subjects. The group A has users whose age is between 18 and 35 years; group B corresponds to people from 36 to 65 years old; group C is com- posed by users older than 65 years. We did not make distinction between male and female subjects. For all of them, it was the first time they tried a gaze-based IT solution. 4.2 Results The results of this very preliminary evaluation of the proposal are reported in Figure 9, where the histograms represent the percentage of positive answers given by the subjects over the total of answers. Figure 8: The experimental setting. Please note that for Q1 and Q2, the higher the results, the better is the feedback, while for Q3, the lower the better. Interpreting the comments of the users, as for Q1, we see that the vast majority of the subjects believe the proposed interface was Basically, motor tasks such as "look at there" are performed in useful to improve the cultural experience. People older than 65 video games by hands controlled operations, because they are usu- are less enthusiastic, but this is somehow an expected result. As ally executed by classical input devices such as: joystick, joypad, for Q2, an even higher percentage of subjects found the applica- keyboard or mouse. Our work represents an attempt to improve tion easy to understand. For Q2 there is less difference among the the naturalness of this kind of interaction, by associating the task three groups. Finally, as for Q3, we found that some of the subjects with its implicitly corresponding interface. We left the users free encountered difficulties in interacting with the software, with a sig- of interacting with the application, without giving them any kind nificant difference for the Group C with respect to the other two of instruction or support. The only source of information for them groups. In general problems arose when visitors performed rapid was represented by the panel shown in figure 8, explaining that the or wide head movements. In both cases, this led to a failure of the input was given by the head movements and not by the eyes. nose tip tracker. In particular, when the users performed wide rota- During the exposition, more than 150 visitors experienced our tions, the template matching results exceeded the confidence level, stand, standing at 1 meter from a webcam mounted at 160cm of causing the lost of the tracking. Similarly, rapid head movements height, as shown in Figure 8. Among all the visitors, 51 speaking caused a sudden reduction of the similarity between frame and tem- English accepted to answer to a quick oral interview, as we could plate, causing the tracker to fail. not submit written questionnaires during the public event for logis- An objective survey about the user experience has been con- tic reasons. ducted by analyzing the collected log data. In particular, we used After we asked users if it was the first time they experienced the stored timestamps and the indexes of the observed Regions Of Interest, to indicate the duration of each interaction and on which particular area indicate that it is more noticeable, or more impor- regions users concentrated their gaze. Data showed that 45% of tant to the viewer than other areas [20]; Duchowski [5] estimates users performed a complete interaction, observing all 9 ROIs. Ac- the mean fixation duration of 1079 ms. This approach represents a cording to the matrix in Figure 4, the most observed ROI has been natural and simple solution to the task "look forward", but the acti- the #4, observed by 88%of users. The average duration of the in- vation time forces the user to wait for the operation starts, without teraction has been 95 seconds per user. doing anything and it may feel like a waste of time. Finally, also All in all, we can see from this very preliminary investigation voice commands could be a natural input to perform this task; thus, that visitors largely enjoyed the experience with the gaze-based in- our current research direction is oriented to provide a better support teraction. for multimodal interaction. As for the mydriatic reactions of users, this is more problem- atic. We analyzed the logs of the exhibition, and we found that the mydriatic reactions occurred in: 6. ACKNOWLEDGEMENT This work has been partly supported by the European Commu- • 65% of cases for group A; nity and by the Italian Ministry of University and Research (MIUR) under the PON Or.C.He.S.T.R.A. (ORganization of Cultural HEr- • 40% of cases for group B; itage and Smart Tourism and Real-time Accessibility) project. • 20% of cases for group C. There are two consideration to drawn from these numbers. The 7. REFERENCES first is that in general the technological solution is not mature [1] M. Asadifard and J. Shanbezadeh. Automatic adaptive center enough for a wide public. This is particularly true for Asiatic peo- of pupil detection using face detection and cdf analysis. In ple, as the totally of the subjects had black eyes, which makes the Proceedings of the International MultiConference of identification of the pupil more problematic. Some internal inves- Engineers and Computer Scientists, volume 1, page 3, 2010. tigations we did with Caucasian subjects led to better results. The [2] R. Bates, H. Istance, M. Donegan, and L. Oosthuizen. Fly other conclusion is that there is a well-known difference in the my- where you look: enhancing gaze based interaction in 3d driatic reactions with respect to the age of the subjects, where the environments. Proc. COGAIN-05, pages 30–32, 2005. older they are, the smaller are the differences in the size of the pupil between the relaxed and aroused states. So, it is clear that the emo- [3] D. M. Calandra, A. Caso, F. Cutugno, A. Origlia, and tional module requires further research efforts. S. Rossi. Cowme: a general framework to evaluate cognitive workload during multimodal interaction. In Proceedings of the 15th ACM on International conference on multimodal 5. CONCLUSIONS interaction, pages 111–118. ACM, 2013. Wall-sized displays represent a viable solution to present art- [4] D. M. Calandra, D. Di Mauro, D. D’Auria, and F. Cutugno. works difficult or impossible to move. In this paper, we proposed Eyecu: an emotional eye tracker for cultural heritage a Natural User Interface to explore 360 digital artworks shown on support. In Empowering Organizations, pages 161–172. wall-sized displays, allows visitors to look around and explore vir- Springer, 2016. tual worlds using only their gaze, stepping away from the bound- [5] A. T. Duchowski. Eye Tracking Methodology: Theory and aries and limitations of the keyboard and mouse. We chose to ac- Practice. Springer-Verlag New York, Inc., Secaucus, NJ, complish the task by means of a remote head pose detector. As USA, 2007. it does not require calibration, it represents an immediate to use [6] G. Fanelli, J. Gall, and L. Van Gool. Real time head pose solution for supporting digital environment navigation. Moreover estimation with random regression forests. In Computer we developed a solution to monitor the mydriatic reactions of the Vision and Pattern Recognition (CVPR), 2011 IEEE subjects while they were using the system, to get an implicit feed- Conference on, pages 617–624. IEEE, 2011. back on the interested of the represented digital content. A prelimi- [7] E. S. Gómez and A. S. S. Sánchez. Biomedical nary investigation we performed at the Tianjin International Design instrumentation to analyze pupillary responses in Week 2015 with 51 subjects gave us the feedback that the gaze- white-chromatic stimulation and its influence on diagnosis based navigation can be well-accepted by the visitors, as it is felt as and surgical evaluation. 2012. a way to improve the fruition of Cultural Heritage. Nevertheless, [8] D. Gorodnichy. On importance of nose for face tracking. the monitoring of mydriatic reactions should still be improved, es- 2002. pecially for people with black eyes. Anyhow, from the results we collected, there are still many po- [9] E. H. Hess and J. M. Polt. Pupil size as related to interest tential research direction for this topic. First of all, we are cur- value of visual stimuli. Science, 132:349–350, Aug. 1960. rently developing a new version of the system where the display is [10] M. J. Jones and J. M. Rehg. Statistical color models with no more divided into a matrix, but instead there will be a smooth application to skin detection. International Journal of feedback from the system, whose rapidity of response will be more Computer Vision, 46(1):81–96, 2002. proportional correlated to the amount of movement done by the [11] D. Kahneman and J. Beatty. Pupil diameter and load on head of the user. The second main research field is to extend this memory. Science, 154(3756):1583–1585, 1966. approach towards freely explorable 3D environments, thus to sup- [12] M. Kassner, W. Patera, and A. Bulling. Pupil: An Open port also the forward and backward navigation. The idea of enrich Source Platform for Pervasive Eye Tracking and Mobile gaze with forward and backward navigation has been approached Gaze-based Interaction. April 2014. in different works. One solution is the fly-where-I-look [2] in which [13] T. T. Le, L. G. Farkas, R. C. Ngim, L. S. Levin, and C. R. authors associate the interest of users to fly towards an area, with Forrest. Proportionality in asian and north american the action to look at it. This approach finds basis in cognitive ac- caucasian faces using neoclassical facial canons as criteria. tivities: in particular, some studies prove that more fixations on a Aesthetic plastic surgery, 26(1):64–69, 2002. [14] H. Li, D. Doermann, and O. Kia. Automatic text detection [22] B. Shneiderman. Designing the user interface. Pearson and tracking in digital video. Image Processing, IEEE Education India, 2003. Transactions on, 9(1):147–156, 2000. [23] R. Valenti, N. Sebe, and T. Gevers. Combining head pose [15] O. Lowenstein and I. E. Loewenfeld. The pupil. The eye, and eye location information for gaze estimation. IEEE 3:231–267, 1962. Transactions on Image Processing, 21(2):802–815, 2012. [16] S. Milekic. The more you look the more you get: [24] P. A. Viola and M. J. Jones. Rapid object detection using a Intention-based interface using gaze-tracking. 2003. boosted cascade of simple features. In CVPR (1), pages [17] E. Murphy-Chutorian and M. M. Trivedi. Head pose 511–518, 2001. estimation in computer vision: A survey. Pattern Analysis [25] C. VL and K. JA. Clinical methods: The history, physical, and Machine Intelligence, IEEE Transactions on, and laboratory examinations. JAMA, 264(21):2808–2809, 31(4):607–626, 2009. 1990. [18] R. NETEK. Implementation of ria concept and eye tracking [26] D. Wigdor and D. Wixon. Brave NUI world: designing system for cultural heritage. Opgeroepen op september, natural user interfaces for touch and gesture. Elsevier, 2011. 9:2012, 2011. [27] P. Wunsch and G. Hirzinger. Real-time visual tracking of 3d [19] K. Nickels and S. Hutchinson. Estimating uncertainty in objects with dynamic handling of occlusion. In Robotics and ssd-based feature tracking. Image and Vision Computing, Automation, 1997. Proceedings., 1997 IEEE International 20(1):47 – 58, 2002. Conference on, volume 4, pages 2868–2873. IEEE, 1997. [20] A. Poole, L. J. Ball, and P. Phillips. In search of salience: A [28] C. Yang, R. Duraiswami, and L. Davis. Efficient mean-shift response-time and eye-movement analysis of bookmark tracking via a new similarity measure. In Computer Vision recognition. In People and Computers XVIII—Design for and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Life, pages 363–378. Springer, 2005. Society Conference on, volume 1, pages 176–183. IEEE, [21] L. R. Rabiner and B.-H. Juang. An introduction to hidden 2005. markov models. ASSP Magazine, IEEE, 3(1):4–16, 1986.