=Paper=
{{Paper
|id=Vol-1621/paper7
|storemode=property
|title= Navigating Wall-sized Displays with the Gaze: a Proposal for Cultural Heritage
|pdfUrl=https://ceur-ws.org/Vol-1621/paper7.pdf
|volume=Vol-1621
|authors=Davide Maria Calandra,Dario Di Mauro,Franco Cutugno,Sergio Di Martino
|dblpUrl=https://dblp.org/rec/conf/avi/CalandraMCM16
}}
== Navigating Wall-sized Displays with the Gaze: a Proposal for Cultural Heritage ==
Navigating Wall-sized Displays with the Gaze: a Proposal for Cultural Heritage Davide Maria Calandra, Dario Di Mauro, Francesco Cutugno, Sergio Di Martino Department of Electrical Engineering and Information Technology University of Naples "Federico II" 80127, Naples, Italy {davidemaria.calandra, dario.dimauro, cutugno, sergio.dimartino}@unina.it ABSTRACT applied in many contexts, like advertisement, medical diagnosis, New technologies for innovative interactive experience represent Business Intelligence, etc. Also in the Cultural Heritage field, this a powerful medium to deliver cultural heritage content to a wider type of displays is highly appreciated, since they turn out to be range of users. Among them, Natural User Interfaces (NUI), i.e. particularly suited to show to visitors artworks that are difficult or non-intrusive technologies not requiring to the user to wear devices impossible to move, being a way to explore the digital counterpart nor use external hardware (e.g. keys or trackballs), are considered of real/virtual environments. On the other hand, the problem with a promising way to broader the audience of specific cultural her- these display is how to mediate the interaction with the user. Many itage domains, like the navigation/interaction with digital artworks solutions have been proposed, with different trade-off among intru- presented on wall-sized displays. siveness, calibration and precision degree to be achieved. Recently, Starting from a collaboration with a worldwide famous Italian some proposals have been developed aimed at exploiting the direc- designer, we defined a NUI to explore 360 panoramic artworks tion of the gaze of the visitor in front of the display as a medium presented on wall-sized displays, like virtual reconstruction of an- to interact with the system. The simple assumption is that whether cient cultural sites, or rendering of imaginary places. Specifically, the user looks towards an edge of the screen, he/she is interested we let the user to "move the head" as way of natural interaction to in discovering more content in that direction, and the digital sce- explore and navigate through these large digital artworks. To this nario should be updated accordingly. In this way, there is no need aim, we developed a system including a remote head pose estima- to wear a device, making easier for a heterogeneous public to enjoy tor to catch movements of users standing in front of the wall-sized the digital content. display: starting from a central comfort zone, as users move their Detecting the gaze is anyhow a challenging task, still with some head in any direction, the virtual camera rotates accordingly. With open issues. To estimate the Point of Gaze (PoG), it is possible NUIs, it is difficult to get feedbacks from the users about the in- to exploit the eye movements, the head pose or both [23], and to terest for the point of the artwork he/she is looking at. To solve require special hardware to wear (e.g.: [12]) or to develop remote this issue, we complemented the gaze estimator with a preliminary trackers (e.g.: [6]). The latter are not able to provide a high accu- emotional analysis solution, able to implicitly infer the interest of racy, but this is an acceptable compromise in many scenarios, like the user for the shown content from his/her pupil size. the Cultural Heritage, where the use of special hardware for the A sample of 150 subjects was invited to experience the proposed visitors is usually difficult. interface at an International Design Week. Preliminary results show For the Tianjin International Design Week 20151 , we were asked that the most of the subjects were able to properly interact with the to develop a set of technological solutions to improve the fruition system from the very first use, and that the emotional module is an of a 360 digital reconstruction projected on a wall-sized display interesting solution, even if further work must be devoted to address of the “Camparitivo in Triennale”2 , a lounge bar (see Figure 1) specific situations. located in Milan, Italy, designed by one of the most famous Italian designers, Matteo Ragni, to celebrate the Italian liqueur Campari. The requirements for the solution were to define a Natural User Categories and Subject Descriptors Interface (NUI), which does not constrain users to maintain a fixed H.5.2 [User Interfaces]: Interaction styles distance from the display, neither to wear an external device. To achieve our task, we designed a remote PoG estimator for 1. INTRODUCTION wall-sized displays where 360 virtual environments are rendered. A further novelty element of the proposal is the exploitation of an- Wall-sized displays represent a viable and common way to other implicit communication channel of the visitor, i.e. his/her at- present digital content on large projection surfaces. They are tention towards the represented image on the display. To this aim, we remotely monitor pupil size variations, as they are significantly correlated with the arousal level of users while performing a task. This information can be firstly useful to the artist, as pupils dilate when visitors are looking at pleasant images [9]. Moreover, logging the pupil dilation (mydriasis) during an interaction session can be a reliable source of information, useful also to analyze the usability c 2016 Copyright 2016 for this paper by its authors. Copying permitted for private 1 and academic purposes. http://tianjindesignweek.com/ 2 : http://www.matteoragni.com/project/camparitivo-in-triennale/ 1. Defining techniques to estimate the PoG of the user while he/she is looking at the display, and 2. Defining a navigation logic associated to the PoG. In the following, we provide technical details on how we faced these two tasks. 2.1 Point of gaze estimation Head poses are usually computed by considering 3 degrees of freedom (DoF) [17], i.e. the rotations along the 3 axis of simmetry in the space, x, y, and z, shown in Figure 2. Figure 1: Matteo Ragni’s "Camparitivo in Triennale" level of the interface, since pupils dilate when users are required to perform difficult tasks, too [11] [3]. In this paper we describe both the navigation with the remote PoG estimator and the solution for logging the mydriasis, together with a preliminary case study. More in details, the rest of the pa- per is structured as follows: in section 2, we explain the navigation paradigm for cultural content with the gaze, detailing the steps we performed to detect and track the PoG. In section 3, we explain how the mydriasis detection could be a useful strategy to investi- Figure 2: Head movements. gate the emotional reactions of users enjoying a cultural content and we detail our steps to get the pupil dilation. In section 4, we present the case study: Matteo Ragni’s Camparitivo in Triennale, Once the head pose in the space is known, the pupil center po- showing how we allow the visitors to navigate the digital rendering sition can optionally refine the PoG estimation.For example, in the of the lounge bar, on a wall-sized display, reporting some prelim- medical diagnosis scenario, to estimate the PoG, patients are usu- inary usability results. Section 5 concludes the paper, presenting ally not allowed to move their head [7] or they have to wear head- also future research directions. mounted cameras pointed towards their eyes [12]. In these cases, to estimate the PoG means to compute the pupil center position with respect to the ellipse formed by the eyelids, while the head posi- 2. NAVIGATING WITH THE GAZE tion, when considered, is detected through IR sensors mounted on Even if wearable eye trackers are becoming smaller and more the head of subjects. These systems grant an error threshold lower comfortable, they still have an impact on the quality of a cultural than 5 pixels [12], achievable thanks to strict constraints on the visit. We believe that the user experience strongly depends on the set-up, such as the fixed distance between eye and camera but, on capability of the user to establish a direct connection with the art- the other hand, they have a very high level of invasiveness for the works, without the mediation of a device. For this reason, in order users. In other scenarios, the PoG is estimated by means of remote to allow the user to explore a 360 cultural heritage environment trackers, such as ones presented in [6], which determine the gaze using only his/her point of gaze, we focused on developing a re- direction by the head orientation. These systems do not limit users’ mote head pose estimator for wall-sized displays, which does not movements and do not require them to wear any device. require users to wear any external device or to execute any prior In the cultural heritage context, the gaze detection is mainly used calibration. for two tasks. The first one is related to the artistic fruition: accord- The contents that we aim to navigate are 360 virtual environ- ing with "The More You Look The More You Get" paradigm [16], ments, expressed as a sequence of 360 frames whose step size is users focusing their gaze on a specific work of art or part of it, can 1 . Thus, navigating the content on the left (right) means to show be interested to receive some additional content about that specific the previous (next) frame of the sequence. As we want visitors to item. This usage of the gaze direction can be extremely useful in feel the sensation of enjoying an authentic large environment, the terms of improving the accessibility to the cultural heritage infor- wall-sized display is used to represent the content with real propor- mation and enhancing the visit experience quality. The second task tions. If by one side, this choice improves the quality of the fruition is related to understanding how people take decisions, visiting a because it reduces the gap between real and virtual environments, museum: which areas they are focused and how long; outputs from on the other hand, representing an entire façade of a building in one gaze detectors are then gathered and analyzed [18]. frame is not realistic. Thus, it requires additional complexity, since Starting from an approach we already developed for small dis- we have to define also a support for a vertical scroll of the content, plays (between 50 x 30cm and 180 x 75cm) [4], we propose an to show the not visible parts of the frame. extension for wall-sized ones, based on a combined exploitation of More in details, the development of NUIs to explore the content the head pose and pupil size to explore digital environments. The of wall-sized displays with the gaze, requires two subtasks: general settings of the display is presented in figure 3. In particu- with 2.2 GHz; initially, the detection time was about 100 ms. The optimizations on face and nose search allowed us to locate the face and the nose on average in 35 ms, reducing the computation time of about 65%. 2.1.2 Nose Tip tracking The previously described features are searched either the first time a user is detected or when the tracking is lost. In all the other frames, the nose tip is simply tracked. Several strategies have been proposed to track the motion, that can be categorized into three groups: feature-based, model-based and optical flow-based. Generally speaking, the feature-based strategies involve the extraction of templates from a reference im- age and the identification of their counterparts in the further images of the sequence. Some feature-based algorithms need to be trained, for example those based on Hidden Markov Models (HMM) [21] or Artificial Neural Networks (ANN) [14], while others are non- supervisioned, like for instance the Mean Shift Tracking algo- Figure 3: Gaze detection: experimental settings. rithms [28]. Although the model-based strategies could be consid- ered a specific branch of the feature-based ones, they require some a-priori knowledge about the investigated models [27]. The opti- lar, the exhibition set up includes a PC (the machine on which the cal flow is the vector field which describes how the image changes software runs), a webcam W which acquires the input stream, and during the time; it can be computed with different strategies as, for a projector P which beams the cultural content on the wall-sized example the gradient. display D. We assume the user to stand almost centrally with re- In our approach, we adopted a non-supervisioned feature-based spect to D and with a frontal position of the head with respect of algorithm. Thus, we firstly store the image region containing the the body. feature (i.e. the nose tip), to be used as template. Then, we ap- In the previous work with small displays [4], we used an eye- ply the OpenCV method to find a match between the current frame tracking technology to estimate the gaze, since we experienced that, and the template. The method scans the current frame, compar- for limited sizes, users just move the eyes in order to visually ex- ing the template image pixels against the source frame and stores plore the surface of the artwork. On the other hand, in the case of each comparison result in the resulting matrix. The source frame is wall sized displays, users have to move also their head, performing not scanned in its entirety, but only a Region of Interest (ROI) has thus limited ocular movements. been taken into account; the ROI corresponds to the area around Therefore, an head pose estimator is needed. To this, according the template coordinates in the source image. The resulting matrix to related work [8], we developed a solution aimed at tracking the is then analysed to find the best similarity value, depending on the nose tip of the user in 3 Degrees of Freedom (DoF). Indeed, the matching criterion given as input. We used the Normalized Sum of nose tip is easy to detect and, since it can be considered as good Squared Differences (NSSD) as matching criterion, whose formula approximation of the head centroid, given the required precision is reported in equation 1. from our domain, it is a useful indicator of the head position in the three-dimensional space. P 0 0 x0 ,y 0 (T (x , y ) I(x + x0 , y + y 0 ))2 R(x, y) = qP P (1) 2.1.1 Nose Tip detection 0 0 2 x0 ,y 0 T (x , y ) ˙ 0 0 2 x0 ,y 0 I(x + x , y + y ) The first step in the processing pipeline is to detect, within the video stream from the webcam, the face of the user. According In equation 1, T is the template image and I is the input frame to the literature, this task can be executed with different strategies, in which we expect to find a match. The coordinates (x,y) repre- which can be grouped in two main sets: the image-based, such sent the generic location in the input image, whose content is be- as skin detection [10], and the feature-based. In our approach, ing compared to the corresponding pixel of the template, located the detection of the face is based on a solution from the second at (x’,y’). R is the resulting matrix and each location of (x,y) in R group, namely the Haar feature-based Viola-Jones algorithm [24]. contains the corresponding matching result. The minimum values In a first implementation, we scanned the entire image to locate the in R represent the minimum differences between input image and face; subsequently this search was improved, providing as input the template, indicating the the most likely position of the feature in range of sizes for a valid face, depending on the distance between the image. Thus, while a perfect match will have a value of zero, a user and camera. mismatch will have a larger sum of squared difference. When the Within the area of the face, also the nose tip search is performed mismatch value exceeds the confidence level [19], the tracking is by means of the Viola-Jones algorithm, in terms of its OpenCV lost. implementation, which returns the nasal area centered on its tip. Initially, we searched for the nose scanning the entire face; then, we 2.2 Projecting the Nose Tip for Navigation considered that the search could be improved by taking advantage Our second task is associating an action to the gaze. To this aim, of the facial geometric constraints [13], to increase both precision we have to understand where the user is looking at, on the wall- and computational efficiency. In particular, the nose can be easily sized display. Since we can approximatively interpret the nose tip found starting from the facial axis on y axis and from the middle as centroid of the head, in order to provide a coherent PoG esti- point of the face, for both x and z axis. We performed the search mation, we have to solve the proportion to transpose the nose tip on images of size 1280 x 960 pixels, processed on an Intel Core i7 coordinates into the display reference system. To this aim, we ge- gation paradigm, where this 3x3 matrix will be replaced by a con- tinuous function, where the speed of the scroll will be proportional to the distance of the POG from the center of the display. 3. THE EMOTIONAL CONTRIBUTE One of the problems with NUIs based on the PoG estimation is that it is difficult to understand the reaction of the user in terms of interest towards the shown content [26]. To address this issue, we developed a further video processing module, intended as a complement to the system presented in the previous section, and able to detect implicit user feedbacks. The output of this module can be used for a twofold objective: it could trigger in real-time reactions from the system, and/or it can provide Figure 4: Matrix Model of the Wall-Sized Display. a powerful post-visit tool to the curator of the exhibition, with a log of the reactions of the visitors to the shown digital content. In this way, the curator could get a better insight on the content which is ometrically project its coordinates on the observed wall-sized dis- sparkling the highest interest in the visitors. In the following we play reference system. These new coordinates are calculated and provide some technical details on how we faced this issue. then tracked with respect to the shown frame. The area of the wall- sized display is considered as a 3x3 matrix, as shown in figure 4. 3.1 The Mydriasis What we do in the current implementation is to indicate in which A wide range of medical studies proved that the brain reacts to cell of the matrix the gaze is falling. the emotional arousal with involuntary actions performed by the When the user stands in front of the display with the head cen- sympathetic nervous system (e.g.: [9] [11]). These changes mani- tered in frontal position, the geometric projection of his/her nose fest themselves in a number of ways, such as increased heart-beat, tip falls into the cell #5 of the matrix (2nd row, 2nd column). We higher body temperature, muscular tension and pupil dilation (or defined the size of the central row to obtain a kind of “comfort mydriasis). Thus, it could be interesting to monitor one or more of zone”, where minor movements of the head are not triggering any these involuntary activities to discover the emotional reactions of movement of the rendered image. In details, head rotations up to the visitors while they are enjoying the cultural contents, in order 15 degrees on the x axis and up to 8 degrees on both the y and the z to understand which details arouse pleasure. axes do not affect the gaze position. With wider rotations, the pro- In the age of wearable devices, there are many sensors with jection of the nose falls in another cell, and the digital image will health-oriented capabilities, like for instance armbands or smart- be shifted accordingly. watches, that could monitor some of these involuntary actions of our body. For instance, information about the heart-beat or the body temperature can be obtained by means of sensors which re- trieve electric signals, once they are applied on the body. If by one side these techniques grant an effective level of reliability, on the other side they could influence the expected results of the ex- periments, as users tend to change their reactions when they feel under examination [11]. Moreover, they would require the visitors to wear some special device (having also high costs for the exhibi- tion), which could be a non-viable solution in many contexts. For these reasons, we again looked for a remote solution, able to get an insight on the emotional arousal of the visitor without requiring them to wear any device. Given the set-up described in Section 2.1, we tried to exploit additional information we can get from the video stream collected Figure 5: Input actions associated with the gaze directions. by the webcam. In particular, we tried to remotely monitor the pupils behaviour during the interaction with the wall-sized display. According to theparadigm [22], Let us note that, as both pupils react to stimuli in the same way, we the event is the identification of a fixation point; the condition is studied the behaviour of one pupil only. marked by the index of the cell in the 3x3 matrix and the corre- Pupils are larger in children and smaller in adults and the normal sponding action is defined in figure 5. In particular, as explained size varies from 2 to 4 mm in diameter in bright light, and from 4 in figure 5, when the PoG falls in the cells #4 or #6, we associate to 8 mm in the dark [25]. Moreover, pupils react to stimuli in 0.2 the action of navigating the content on the left side or on the right s, with the response peaking in 0.5 to 1.0 s [15]. Hess presented 5 side, respectively. When the user observes the sections #2 or #8, visual stimuli to male and female subjects and he observed that the the content will be navigated upwards or downwards; the section increase in pupil size varied between 5% and 25% [9]. #5 will be interpreted as the area in which no action will be exe- cuted. When the PoG falls in the remaining cells, the content will 3.2 Pupil detection be navigated in the respective diagonal directions. Before detecting the pupil, we have to locate and track the eye In the current implementation, since we are just associating a on the video stream coming from the webcam. The detection is cell of the matrix to the PoG, the speed of the scroll is fixed and performed by means of the Haar feature-based Viola-Jones algo- independent from the PoG of the user within a lateral cell of the rithm [24], already cited in section 2.1.1, while the tracking of the matrix. We are currently implementing a new version of the navi- pupil is done with the template matching technique, as described in section 2.1.2. The detected ocular region contains eyelids, eyelashes, shadows Listing 1: A snippet of the logging file and light reflexes. These represent noise for pupil detection, as they 1 xml v e r s i o n = " 1 . 0 " e n c o d i n g = "UTF 8" ? > could interfere with the correctness of the results. Thus, the eye 2 3 image has to be pre-processed, before searching for the pupil size. 4 < t r a c k i d T s = " 1402674690300 " s e c t i o n = " We developed a solution including the following steps, in order to 1" m y d r i a s i s = " 0 " / > perform the pre-processing: 5 < t r a c k i d T s = " 1402674690500 " s e c t i o n = " 1 " m y d r i a s i s ="0" / > 1. The gray scaled image (Figure 6a) is blurred by means of a 6 < t r a c k i d T s = " 1402674690700 " s e c t i o n = " 1 median filter, in order to highlight well defined contours; " m y d r i a s i s ="0" / > 7 < t r a c k i d T s = " 1402674690900 " s e c t i o n = " 1 2. The Sobel partial derivative on the x axis reveals the signifi- " m y d r i a s i s ="0" / > cant changes in color, allowing to isolate the eyelids; 8 < t r a c k i d T s = " 1402674691100 " s e c t i o n = " 1 " m y d r i a s i s ="0" / > 3. A threshold filter identifies the sclera. 9 report> 10 < r e p o r t id = "1"> As result, these steps produce a mask, which allows us to iso- 11 < t r a c k i d T s = " 1402675341320 " s e c t i o n = " 1 " m y d r i a s i s ="0" / > late the eye ball from the source image. Pupil detection is now 12 < t r a c k i d T s = " 1402675341520 " s e c t i o n = " 0 performed on the source image as follows: " m y d r i a s i s ="0" / > 13 < t r a c k i d T s = " 1402675341720 " s e c t i o n = " 0 1. We drop down to zero (black) all the pixels having cumula- " m y d r i a s i s ="0" / > tive distribution function value greater than a certain thresh- 14 < t r a c k i d T s = " 1402675341920 " s e c t i o n = " 0 old [1] (Figure 6b); " m y d r i a s i s ="0" / > 15 < / r e p o r t > 2. We morphologically transform the resulting binary image by 16 < / r e p o r t C o l l e c t i o n > means of a dilation process, to remove the light reflexes on the pupil; 3. A contours detection operation identifies some neighbour- thread saves the current timestamp, the index of the observed sec- hoods (Figure 6c). tion and an integer value representing the pupil status. If the pupil has normal size, the pupil status is 0, otherwise it is 1. If the system 4. The pupillary area is found by selecting the region having does not detect a face for a given time (10 seconds, in the specific) maximum area (Figure 6d); the interaction session is considered terminated and the collected 5. The center of the ellipse (Figure 6e) best fitting the pupillary information is stored in a a XML document. The structure of the area, approximates the pupil center (Figure 6f). XML document is shown in the Listing 1. The XML document is created and initialized with an empty , when the application starts; then, when each interaction session ends, a new subtree is created. The timestamps values univocally identify the respective