=Paper=
{{Paper
|id=Vol-1772/paper1
|storemode=property
|title=Listen to What You Look at: Combining an Audio Guide with a Mobile Eye Tracker on the Go
|pdfUrl=https://ceur-ws.org/Vol-1772/paper1.pdf
|volume=Vol-1772
|authors=Moayad Mokatren,Tsvi Kuflik,Ilan Shimshoni
|dblpUrl=https://dblp.org/rec/conf/aiia/MokatrenKS16
}}
==Listen to What You Look at: Combining an Audio Guide with a Mobile Eye Tracker on the Go==
<pdf width="1500px">https://ceur-ws.org/Vol-1772/paper1.pdf</pdf>
<pre>
       Listen to What You Look at: Combining an Audio
         Guide with a Mobile Eye Tracker on the Go

                   Moayad Mokatren, Tsvi Kuflik and Ilan Shimshoni
                 The University of Haifa, Mount Carmel, Haifa, 31905
       mmokat03@campus.haifa.ac.il, tsvikak@is.haifa.ac.il, ishimshoni@mis.haifa.ac.il


       Abstract. The paper presents work in progress about integrating a mobile eye-
       tracker into a museum visitors’ guide system, so to relive the visitor from
       explicitly requesting information about objects of interest. The novel and most
       challenging aspects of the study are the image based positioning and the
       identification of the visitor’s focus of attention, while using a commercially
       available mobile eye tracker. A prototype system has been developed and it will
       be evaluated in a user study in a realistic setting. The focus of this paper is
       possible solutions for real-time efficient image based positioning, the overall
       system design and the planned evaluation.


1     Introduction

Vision is our main sense for gathering information. When we want to gather
information about something in our environment, we first look at it. Moreover, when
we express interest in something, we look at it. However, the only information we get
in this way is what we see: Size, shape, color, distance etc. Nowadays, a lot of
additional information about the objects that we see is available online and can be
easily accessible when one searches for it. Theoretically, it is available, just a click
away, just a query away, or just by activating the mobile device, writing the query,
submitting it, scrolling through the results list, selecting the relevant one and
accessing the relevant page. This is a bit complicated set of actions in a mobile
scenario, when an immediate, personalized and context-aware information is desired.
Current technology offers a variety of ways for information delivery to mobile users.
Context awareness is the general term describing the attempt to deliver relevant
information at the relevant time and place to the user. What is usually common to
most context aware services nowadays is that they make use of the communication
and computational power (and sensors) of the users’ mobile devices (e.g. mostly
smartphones). In addition, they interact with their users mainly by their mobile
device's touch screens, which has one major limitation: they are limited in size, the
users have to look at them during the interaction, and use a keyboard or select icons
etc. Even though voice commands can be used for activating applications, this option
is still very limited.
   A major challenge in the mobile scenario is to know exactly what the user is
interested in. In classical human-computer interaction, the users use a pointing device,
most commonly a mouse or by touching a touch screen. However, this is becoming a
major challenge in the mobile setting as noted by Calvo and Perugini [2014], who
surveyed novel pointing approaches for wearable computing. The user's position is
the best hint, accompanied by the user’s orientation. Still, there are many possibly
interesting objects near and around the user. If we know what the user is looking at,
and what the specific user's gazing profile is, then we can narrow down the possibly
relevant objects of interest and we can better serve the user with relevant
service/information when needed.
   As we move towards "Cognition-aware computing" [Bulling and Zander 2014], it
becomes clearer that eye-gaze based interaction should and will play a major role in
human-computer interaction before/until brain computer interaction methods will
become a reality [Bulling et al. 2012]. With the advent of mobile and ubiquitous
computing, it is time to explore the potential of mobile eye tracking technology for
natural, intelligent interaction of users with their smart environment, not only in
specific tasks and uses, but for a more ambitious goal of integrating eye tracking into
the process of inferring mobile users’ interests and preferences for providing them
with relevant services and information, an area that received little attention so far.
   Cultural heritage (CH) is a traditional domain for experimentation with novel
computing technology. An intelligent mobile museum visitors’ guide is a canonical
case of a context-aware mobile system. Museum visitors move in the museum,
looking for interesting exhibits, and wish to acquire information to deepen their
knowledge and satisfy their interests. A smart context-aware mobile guide may
provide the visitor with personalized relevant information from the vast amount of
content available at the museum, adapted for his or her personal needs. Mokatren et
al. [2016] already presented a novel image based positioning technique using mobile
eye tracker for a museum visit, where the position of the visitor is identified in a
predefined museum layout, and once is determined an object of interest can be
inferred. In this work we aim at developing an audio guide system using a mobile eye
tracker on the go as a positioning system and as an implicit pointing device for natural
interaction with the system using gesture recognition.


2     Background and Related Work


2.1    Requirements for Museum Audio Visitor’s Guide

The museum environment has many limitations, such as the restriction not to make
noise, not to talk loudly, not to touch anything, etc. It is obvious that museum visitor's
mobile guides should not be a replacement for traditional interpretation methods, but
rather complement them [Economou, 1998]. Under these limitations Cheverst et al.
[2000] have mentioned two key requirements for such guides, the first of which is
Flexibility. The system is expected to be sufficiently flexible to enable visitors to
explore, and learn about, a museum in their own way, including controlling their own
pace of interaction with the system. The second requirement is that the system will be
context aware, meaning that the information presented to the visitors should be

                                            3
tailored to their personal context. The personal context includes, among other things,
the visitor's interests, the visitor's current location and exhibits already visited.


2.2 Image Based Positioning

Consider a device consisting of a forward looking camera and an eye tracker. The
device takes a picture while the user is fixating on a certain position within the image.
The challenge is to recognize the object in the scene in order to deliver content related
to this object to the user.
   When an image taken by the front camera of the device, it can be matched to a set
of existing images, where the goal is to find which of the images shows the same
scene as the test image. The matching algorithm should work in cluttered scenes
(scenes from which objects have been removed or added), where the images were not
taken from the same pose and with varying illumination. For this to work local image
features were developed that are unaffected by nearby clutter or partial occlusion. The
features are at least partially invariant to illumination, 3D projective transforms, and
common object variations. On the other hand, the features must also be sufficiently
distinctive to identify specific objects among many alternatives. Several types of local
features have been developed. The most popular type of feature is SIFT [Lowe 1999]
but others also exist.
   Location-awareness procedure using image matching works as follows:
1. A set of images of the exhibits should be taken, each image may contain one or
     more objects. For each object that appears in an image, a distinct label value and
     size of region around the object should be given (in terms of width and height –
     rectangular shape)
2. Eye-tracker scene camera frame is taken and image-to-image matching procedure
     is applied using SIFT features. The result is an image with labeled regions in the
     current scene’s frame. A pair of images will be marked as matched if the
     percentage of the matched feature points (as presented by [Lowe 1999]) is larger
     than some threshold value (the threshold is determined by case study evaluation).
3. Fixation mapping transformation. The fixation point is transformed from the eye
     tracker scene camera to a suitable/matched region in the image that we got in step
     one (image from the data-set with labeled regions)
   The result of the above procedure is a location id (or an exhibit id in a museum
visit) and point/object of interest (specific object in the exhibit that the visitor looked
at).


2.3 Pupil-Dev Mobile Eye Tracker

Pupil eye tracker [Kassner et al. 2014] that is presented in Figure 1, is an accessible,
affordable, and extensible open source platform for pervasive eye tracking and gaze-
based interaction. It comprises a light-weight eye tracking headset, an open source
software framework for mobile eye tracking, as well as a graphical user interface to


                                            4
playback and visualize video and gaze data. Pupil features high-resolution scene and
eye cameras for monocular and binocular gaze estimation.


                 Figure 1. Pupil eye-tracker (http://pupil-labs.com/pupil)


2.4 Mobile Eye Tracker as a Pointing Device

Eye tracking is an active area of research, where significant progress is continuously
made over a long time. Recently, Yousefi, et al. [2015] surveyed a large variety of
mobile eye tracking applications and technologies, including aviation, marketing,
learning, medicine and more. Furthermore, as predicted (and surveyed by [Yousefi,
2015]), relatively inexpensive, easy to use mobile eye trackers are appearing. Usually,
they are experimented in specific areas of applications and tasks. Mokatren et al.
[2016] proposed a tool for location awareness, interest detection and focus of
attention using computer vision techniques and mobile eye-tracking technology, the
focus was on a museum visit. The proposed tool is based on image based positioning
technique, for that a set of images that represents the layout of the museum should be
taken and stored for image to image comparison.


3     Research Goal and Questions

Our goal is to examine the potential of integrating the eye tracking technology as a
natural interaction device into mobile audio guide system (e.g. using the eye-tracker
as a natural pointing device in a smart environment). Using it as a pointing device that
enables systems to reason unobtrusively about the user’s focus of attention and
suggest relevant information about the focus of attention as needed.
   Our focus is on developing a framework for museum’s audio guide that extends the
work of Mokatren et al. [2016] for information delivery based on eye gaze detection
and image based positioning. We will answer the following question: How can we
integrate the mobile eye tracker as a pointing device in a system that delivers
audio information to the visitor?
   For that we have developed a prototype of a system that runs on a laptop and uses
Pupil Dev [Kassner et al. 2014] mobile eye tracker for identifying objects of interest
and delivering informative content to the users.


                                           5
    In our study we have considered different factors and constrains, the real
environment lighting conditions (scenes varies in different day time, e.g. direct
sunlight, see figure 2 for example) that can greatly affect the process of image based
positioning. For that, different dataset images were taken at different times to ensure
successful positioning procedure. Another aspect was the position of the objects
relative to the eye tracker holder, since the eye tracker device is head-mounted as this
is constrained by the environment layout.


                  Figure 2. Same exhibit at different day time.


4     Context-aware, Mobile Audio Guide Framework

A key challenge in using mobile technology for supporting museum visitors' is
figuring out what they are interested in. This may be achieved by tracking where the
visitors are and the time they spend there [Yalowitz and Bronnenkant, 2009]. A more
challenging aspect is finding out what exactly they are looking at [Falk and Dierking,
2000]. Given todays' mobile devices, we should be able to gain access seamlessly to
information of interest, without the need to take pictures or submit queries and look
for results, which are the prevailing interaction methods with our mobile devices.
    Lanir et al. [2013] discussed the influence of location-aware mobile guide museum
visitors’ behavior. Their results indicate that visitors’ behavior was altered
considerably when using a mobile guide. Visitors using a mobile guide visited the
museum longer and were attracted to and spent more time at exhibits where they
could get information from the guide. Moreover, they argued that “While having
many potential benefits, a mobile guide can also have some disadvantages. It may
focus the visitor’s attention on the mobile device rather than on the museum
artifacts”.
   In this section we describe the implementation of the audio guide framework that
will address the above-mentioned two challenges – it will identify users’ focus of
attention accurately and it will do that unobtrusively. The system uses Pupil Dev
[Kassner et al. 2014] mobile eye tracker (as a pinpoint device for inferring object of
interest), laptop (that serves as a computational power) and earphones (for audio
information delivery). The system extends the image based positioning technique that
was presented by Mokatren et al. [2016] to deliver audio information about exhibits in
the museum. A visitor wears the mobile eye tracker which is connected to a laptop
(carried on back bag) enters the museum, when he looks steadily for approximately
three seconds at an exhibit, the image based positioning procedure starts and
location/position and point of interest is identified.
    We have implemented two versions of audio mobile guide:

                                           6
1.  Reactive: After identifying the position of the visitor and point/object of interest,
    “beep” sound is played and immediately after that audio information about the
    exhibit is delivered (see Figure 3).
2. Proactive: After identifying the position of the visitor and point/object of interest,
    “beep” sound is played, and the system wait for mid-air gestural action (stop
    sign). After performing the gestural action, audio information is delivered (see
    Figure 4).
  For both systems we have added an option to stop the audio information delivery at
any time by performing mid-air gestural action (stop sign).


                    Figure 3. State machine diagram for the reactive version


                   Figre 4. State machine diagram for the proactive version


5        Experiment Design

The system will be evaluated in user studies; the participants will be students from
University of Haifa. The study will be conducted in Hecht museum1, which is a small
museum, located at the University of Haifa that has both an archeological and art
collections.

1    http://mushecht.haifa.ac.il/

                                               7
   The experiment will be within-subject design that will compare the use of the audio
guide with the two versions. The study will include an orientation about using the eye
tracker, mid-air gestural interaction (one type of gesture – “stop sign”) and the mobile
guide, then taking a tour in the museum with the audio guide.
   The exhibits will be divided into three categories: Small exhibits, large exhibits
and showcases (vitrine shelves). Each case-study will include exhibits from the three
categories, we will try to differentiate each case-study exhibits by choosing different
exhibits from the same category to reduce as possible the effect of learning.
   Data will be collected as follows: The students will be interviewed and asked about
their visit experience, and will be asked to fill questionnaires regarding general
questions such as if it is the first time that they have visited the museum, their gender
and age, and more.


6        Discussion and Conclusions

In the CH setting, visitors' movement in space, time spent, information requested,
vocal interaction and orientation were used for inferring users’ interest in museum
exhibits. Adding eye gaze as additional source may greatly enhance the ability to
pinpoint the user’s focus of attention and interest (e.g. on products or exhibits), hence
improve the ability to model the user and better personalize the service offered to
her/him (e.g., exhibit or product information, shopping assistance).
   In this paper we presented a framework for a context-aware mobile museum audio
guide that uses mobile eye tracking technology for identifying the location of the
visitor and inferring his point/object of interest. The audio guide system framework
consists of two versions: Reactive and proactive, in the reactive version audio
information is delivered immediately once the point of interest is identified, in
contrast to the proactive version where the visitor needs to perform a mid-air gestural
action to start the audio delivery. The system has not been evaluated yet.
   In the image based positioning technique, there is overhead time in matching the
camera scene image with every image from the dataset. If the visitor stands at a fixed
point and a little time has passed since the last match procedure, then we can search
for a matched image from the physical nearest neighbors only. For that we need to
represent the data set using a graph, each node will represent the exhibit image/label
and the arc value represent the physical distance.
   Future work will focus on optimizing the image based positioning procedure by
representing the museum layout using graph, and then evaluating the system in an
experiment in a museum, in a realistic setting of a museum visit.


References

    1.     Bulling, A., Dachselt, R., Duchowski, A., Jacob, R., Stellmach, S., & Sundstedt, V.
           (2012). Gaze interaction in the post-WIMP world. In CHI'12 Extended Abstracts on
           Human Factors in Computing Systems, 1221-1224. ACM.


                                              8
2.    Bulling, A., & Zander, T. O. (2014). Cognition-aware computing. Pervasive
      Computing, IEEE,
3.    Calvo, A. A., & Perugini, S. (2014). Pointing devices for wearable
      computers. Advances in Human-Computer Interaction, 2014.
4.    Cheverst, K., Davies, N., Mitchell, K., & Friday, A. (2000, August). Experiences of
      developing and deploying a context-aware tourist guide: the GUIDE project. In
      Proceedings of the 6th annual international conference on Mobile computing and
      networking (pp. 20-31). ACM.
5.    Economou, M. (1998). The evaluation of museum multimedia applications: lessons
      from research. Museum Management and Curatorship, 17(2), 173-187.
6.    Lowe, David G. "Object recognition from local scale-invariant features." (1999). The
      proceedings of the seventh IEEE international conference on. Computer vision. Vol. 2
      ,pp. 1150-1157
7.    Kassner, M., Patera, W., & Bulling, A. (2014). Pupil: an open source platform for
      pervasive eye tracking and mobile gaze-based interaction. In Proceedings of the 2014
      ACM International Joint Conference on Pervasive and Ubiquitous Computing:
      Adjunct Publication. 1151-1160. ACM.
8.    Lanir, J., Kuflik, T., Dim, E., Wecker, A. J., & Stock, O. (2013). The influence of a
      location-aware mobile guide on museum visitors' behavior. Interacting with
      Computers, 25(6), 443-460.
9.    Mokatren, M., Kuflik, T. and Shimshoni, I. (2016) Exploring the potential
      contribution of mobile eye-tracking technology in enhancing the museum visit
      experience. Accepted to the workshop on Advanced Visual Interfaces in Cultural
      Heritages – AVI-CH 2016 – co-located with AVI 2016.
10.   Yalowitz, S.S. and Bronnenkant, K. (2009) Timing and tracking: unlocking visitor
      behavior. Visit. Stud., 12, 47–64.
11.   Yousefi, M. V., Karan, E. P., Mohammadpour, A., & Asadi, S. (2015). Implementing
      Eye Tracking Technology in the Construction Process. In 51st ASC Annual
      International Conference Proceedings.


                                          9

</pre>