Measuring Gaze Orientation for Human-Robot
                    Interaction

             R. Brochard∗ , B. Burger∗ , A. Herbulot∗† , F. Lerasle∗†

     ∗ CNRS; LAAS; 7 avenue du Colonel Roche, 31077 Toulouse Cedex, France
       † Université de Toulouse; UPS; LAAS-CNRS : F-31077 Toulouse, France


1    Introduction
In the context of Human-Robot interaction estimating gaze orientation brings
useful information about human focus of attention. This is a contextual infor-
mation : when you point something you usually look at it. Estimating gaze
orientation requires head pose estimation. There are several techniques to esti-
mate head pose from images, they are mainly based on training [3, 4] or on local
face features tracking [6]. The approach described here is based on local face
features tracking in image space using online learning, it is a mixed approach
since we track face features using some learning at feature level. It uses SURF
features [2] to guide detection and tracking. Such key features can be matched
between images, used for object detection or object tracking [10]. Several ap-
proaches work on fixed size images like training techniques which mainly work
on low resolution images because of computation costs whereas approaches based
on local features tracking work on high resolution images. Tracking face features
such as eyes, nose and mouth is a common problem in many applications such as
detection of facial expression or video conferencing [8] but most of those appli-
cations focus on front face images [9]. We developed an algorithm based on face
features tracking using a parametric model. First we need face detection, then
we detect face features in following order: eyes, mouth, nose. In order to achieve
full profile detection we use sets of SURF to learn what eyes, mouth and nose
look like once tracking is initialized. Once those sets of SURF are known they
are used to detect and track face features. SURF have a descriptor which is often
used to identify a key point and here we add some global geometry information
by using the relative position between key points. Then we use a particle filter to
track face features using those SURF based detectors, we compute the head pose
angles from features position and pass the results through a median filter. This
paper is organized as follows. Section 2 describes our modeling of visual features,
section 3 presents our tracking implementation. Section 4 presents results we get
with our implementation and future works in section 5.


2    Visual features
We use some basic properties of facial features to initialize our algorithm : eyes
are dark and circular, mouth is an horizontal dark line with a specific color,...
2

Then we use sets of SURF to learn a better description of visual features. SURF
are extracted with their relative position in order to keep geometrical information
about features. We use a similarity measure to compare two sets of SURF (see
the probability measures in [10]), we detect a feature as the maximum of this
function in image space. This function is defined as:
                                      kP − Qk22D               kP − Qk2desc
                    X X                                                  
      σ(S0 , S1 ) =           exp −         2        ×  exp  −      2           (1)
                                         2σ2D                     2σdesc
                   P ∈S0 Q∈S1
where S0 and S1 are two sets of SURF, k.k2D/desc are the Euclidean norms
in image space (2D) and in SURF descriptor space (desc). σdesc defines the
local tolerance (descriptor space) whereas σ2D defines the geometrical tolerance
(image space). Learning a set of SURF is a matter of selecting SURF which
are representative of the feature and putting them into the same set with their
relative position (you can set the origin to the center of the feature).


3   Tracking implementation
We use a simple face model described by the position of eyes, center of mouth
and nose tip similar to the one in [6]. We use 7 parameters (two absolute co-
ordinates, two angles, spacing of eyes, scale, relative abscissa of nose) filtered
by an ICONDENSATION particle filter [1]. Likelihood measures and detection
are based on the same probability maps, detected features being the maxima of
those maps. A particle likelihood is the product of all features likelihood. Maps
information is merged in a way that if there is a high probability to have a fea-
ture, it decreases the probability to have a different one. Likelihood is computed
using both standard properties of tracked features and similarity of SURF sets.
We have about 100 SURF stored for mouth, 20 for eyes, it’s varying a lot for nose
(from 0 to 40) depending on visibility of nostrils and light saturation. Because
SURF similarity can be very small, we add a small constant  (0.01) in order
not to get a null likelihood for particles that are not so bad. Since we cannot
predict head movement, the dynamic model we use is a random walk model, we
only expect head moves not to be too fast.


4   Results
We have taken a video sequence of 1140 frames (Fig. 1) to compare the behavior
of the algorithm with and without SURF and how the number of particles in the
particle filter influences the result. This sequence shows a single person making
head moves : pan and tilt rotations as well as fast head movements, both rotations
and translations and a simple “look at” movement. For each test, we ran the
algorithm 21 times on the same sequence due to the stochastic nature of particle
filters. We ran 3 tests with this sequence. The first test with 250 particles without
SURF and only 50 particles with SURF. The second test is with 50 particles for
both versions of the algorithm. The last tests with 250 particles both with and
without SURF.
                                                                                  3

    It is slower to compute particles likelihood with SURF because of similarity
calculations involved in each likelihood evaluation. This is the reason why the
first test does not compare the two versions of the algorithm with the same
number of particles. Without optimizing likelihood calculations the algorithm
runs 200 SURF particles as fast as 7000 particles without SURF. An interesting
point to note is sets of SURF can learn the shape of a feature as it is varying, as
a direct consequence if you close your eyes it can still track them, this is useful
when the person is blinking.
   The average standard deviation for the 3 tests with SURF is 10.9°for tilt,
14.9°for pan and 4.4°for roll, without SURF we get 10.5°for tilt, 15.8°for pan and
8.09°for roll. We get a smaller standard deviation over several runs with SURF
and a small amount of particles than without SURF and with lots of particles.
Nose tracking is not working properly yet and all the results are obtained with an
ambiguity on the sign of angles which is not always solved properly (the model we
use requires nose position to solve this sign ambiguity) which decreases greatly
the quality of measures. We ran a few tests with a better nose detector (which
uses SURF too) and already got much better results with SURF due to learning
being based on the result of the particle filter.
    We also ran the algorithm on the Yale B face database [7] on images with
frontal light. The Yale B database contains gray scale face images from 10 dif-
ferent persons with various lighting conditions and poses. With SURF we get an
average error (over the 10 different faces) of 3.15°for tilt, 10.3°for pan, 1.41°for
roll and without SURF we get 4.06°for tilt, 11.1°for pan, 1.5°for roll.
   Our current implementation adds a noticeable overhead per particle but this
can be optimized in order to have the same overhead per particle as the imple-
mentation without SURF. It would be a bit slower only because of the need to
extract SURF from the image which takes approximately 8ms during our tests.


      Fig. 1. Video sequences. Last image shows SURF extracted from an eye.
4

5    Conclusion and future works
Using sets of SURF to detect and track face features makes the process more
robust and stable under various poses. It allows detecting and tracking features
with varying shapes. Our current implementation of SURF similarity calcula-
tions is slow. With some optimization using SURF adds a small overhead and
we could use more SURF. Taking face geometry (features relative position) into
account could also improve detection robustness. The main idea is to use sets
of SURF to learn something to track and learning changes in its shape as it
moves. We are currently integrating our algorithm on a robot and have planned
to use both human focus of attention and gesture recognition [5] to achieve better
human-robot interaction.


Acknowledgments
This work was partially conducted within the EU STREP Project Commrob
funded by the European Commission Division FP6 under Contract F P 6−045441
and the french ANR project AMORCES.


References
 1. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle
    filters for on-line non-linear/non-gaussian bayesian tracking. IEEE Transactions
    on Signal Processing, 50:174–188, 2001.
 2. H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded-Up Robust Features. In
    European Conference on Computer Vision, pages 404–417, Graz, Austria, 2006.
 3. B. Benfold and I. Reid. Colour invariant head pose classification in low resolution
    video. In British Machine Vision Conference, September 2008.
 4. L.M. Brown and Ying-Li Tian. Comparative study of coarse head pose estimation.
    In Workshop on Motion and Video Computing, pages 125–130, December 2002.
 5. B. Burger, G. Infantes, I. Ferrané, and F. Lerasle. Dbn versus hmm for gesture
    recognition in human-robot interaction. In Int. workshop on Electronics, Control,
    Modelling, Measurement and Signals, pages 59–65, Mondragon, Spain, July 2009.
 6. A. H. Gee and R. Cipolla. Determining the gaze of faces in images. Image and
    Vision Computing, 12:639–647, 1994.
 7. A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman. From few to many: Illu-
    mination cone models for face recognition under variable lighting and pose. IEEE
    Transactions on Pattern Analysis and Machine Intelligence, 23(6):643–660, 2001.
 8. J. Luo, C. W. Chen, and K. J. Parker. Face location in wavelet-based video com-
    pression for high perceptual quality videoconferencing. In International Conference
    on Image Processing, volume 2, pages 583–586, October 1995.
 9. M. Pantic and L. Rothkrantz. Automatic Analysis of Facial Expressions: The State
    of the Art. IEEE Transactions on Pattern Analysis and Machine Intelligence,
    22(12):1424–1445, December 2000.
10. H. Zhou, Y. Yuan, and C. Shi. Object tracking using SIFT features and mean
    shift. Computer Vision and Image Understanding, 113(3):345–352, 2009.