ET4S 2014


          Spatiotemporal windows for fixation detection

                                   Tyler Thrash and Iva Barisic

                          Chair of Cognitive Science, ETH Zurich
                    {tyler.thrash,iva.barisic}@gess.ethz.ch


        Abstract. Eye fixations are periods of relative stability derived from continuous
        eye position (or eye movement) data. In order to define eye fixations, research-
        ers often assume that the eye(s) will not move beyond a particular spatiotem-
        poral window (i.e., a spatial area towards which the eye is directed within a par-
        ticular period of time). However, exact specifications of this window vary from
        field to field and even from one experiment to another. Efforts to standardize
        these specifications have assumed (either implicitly or explicitly) that there is
        one appropriate window size for describing eye behavior. The present paper ex-
        plores an alternative approach. Specifically, we provide a method for determin-
        ing the most appropriate spatiotemporal window that can vary from participant
        to participant (or task to task). This approach may also be extended to provide a
        metric for detection algorithm comparison.

        Keywords: eye tracking • fixation detection • scene perception


1       Introduction

   In order to be meaningful, eye tracking data needs to be classified into periods of
movement (e.g., saccades) and periods of stability (e.g., fixations). During periods of
movement, visual stimuli are usually considered inaccessible to the human observer.
This phenomenon is called saccadic suppression [1]. Most of visual perception is
based on information that is accessible during periods of stability [2]. Fixation detec-
tion algorithms attempt to determine what information is perceptually available by
inferring which eye tracking data points represent periods of stability [3].
   All of these algorithms essentially rely on the definition of what we call a “spatio-
temporal window” (i.e., a spatial area towards which the eye is directed within a par-
ticular period of time). Some detection algorithms (e.g., dispersion-based algorithms;
[4]) emphasize the two spatial dimensions of this window by evaluating possible fixa-
tions in terms of the dispersion of data points around possible foci. However, these
algorithms also typically incorporate lower and upper bounds for the “reasonable”
duration of a fixation. Other detection algorithms (e.g., velocity-based algorithms;
[4]) emphasize the temporal dimension of this window by classifying eye tracking
data in terms of velocity and/or acceleration. These algorithms also typically include
lower and upper bounds for the size of a fixation along spatial dimensions. Thus, the
three-dimensional spatiotemporal window is a critical consideration for the imple-
mentation of both dispersion-based and velocity-based algorithms.

ET4S 2014, September 23, 2014, Vienna, Austria
Copyright © 2014 for the individual papers by the papers' authors. Copying permitted for private and
academic purposes. This volume is published and copyrighted by its editors.


                                                    2
                                                                                  ET4S 2014


   One assumption underlying most efforts to standardize specifications of the spatio-
temporal window is that one set of parameters can be used to describe the eye behav-
ior of all healthy adults (e.g., [5]), even though there is a good deal of variability in
this behavior both within an individual and across individuals [6] [7] [8]. The varia-
bility not described by this set of parameters is typically considered “noise” (e.g., as
resulting from the imprecision of the eye tracking equipment). Even algorithms that
can be adapted to different noise profiles (e.g., [4]) assume the same spatiotemporal
window for defining fixations. In contrast, the current approach allows for variability
in the size of the spatiotemporal window across individuals and tasks.
   The specification of spatiotemporal windows is especially critical when it is diffi-
cult to define the direction of a stimulus from the observer objectively (i.e., without
relying on designations by other observers). This scenario is common for investiga-
tions of naturalistic scene perception and navigation because of the lack of clear
boundaries between objects and/or the dynamic nature of the stimuli [9]. Except for
sophisticated computational vision algorithms, there are no established methods for
determining the objective “truth” to which a set of detected fixations (e.g., resulting
from different detection algorithms) can be compared in these scenarios. The current
approach extends a common technique for comparing mathematical models without
needing to presuppose any particular objective truth.


2      Current approach

   There are two primary applications of the current approach: the specification of the
spatiotemporal window for different observers/tasks and the comparison of different
detection algorithms.


2.1    Specification of the spatiotemporal window
    Our general approach for specifying the most appropriate spatiotemporal window
is to calculate error in the data points relative to the nearest detected fixation. Error, in
this case, represents variability in the gaze data that is within the defined spatiotem-
poral window but cannot be explained by the set of fixations detected by a particular
algorithm.
    At most, six parameters are needed to describe spatiotemporal windows that reflect
plausible (and interpretable) fixations. Researchers should start by defining the sizes
of spatial and temporal intervals. The spatial and temporal interval parameters deter-
mine which data points are used for calculating the error term of each detected fixa-
tion. Data points are only included in the following calculations if they fall within
both spatial and temporal intervals for any detected fixation. The distance function is
calculated using the following equation:
                                                                                 1
𝑑(𝑝1 , 𝑝2 ) = [𝑤1 (𝑥1 − 𝑥2 )𝑚 + 𝑤2 (𝑦1 − 𝑦2 )𝑚 + (1 − 𝑤1 − 𝑤2 )(𝑡1 − 𝑡2 )𝑚 ]𝑚        (1a)


                                             3
                                                                                 ET4S 2014


Here, x 1 and x 2 represent the locations of two points along the horizontal axis, y 1 and
y 2 represent the locations of two points along the vertical axis, t 1 and t 2 represent the
locations of two points along the temporal dimension, the two w’s represent the rela-
tive weighing of the two spatial dimensions with respect to the temporal dimension, m
represents the type of Minkowski distance metric, and d(p 1 ,p 2 ) represents the distance
between two points. For most applications, m should be constrained to be either 2
(resulting in a Euclidean distance metric) or 1 (resulting in a city-block distance met-
ric). A city-block distance metric may be appropriate if researchers consider errors
along x and y dimensions as independent of each other. Other values for m are possi-
ble but difficult to interpret. The parameters w 1 and w 2 also need to be constrained so
that each weight is greater than 0 and that their sum is less than 1. Larger values for
the w’s indicate larger relative contributions for deviations along the corresponding
spatial dimensions to the fit of the resulting model. Note that this distance function
may need to accommodate differences in visual angle if, for example, two participants
are fixed at different distances from the stimulus.
   Equation 1a also assumes that the distribution of data points that represent each fix-
ation is uniform rather than Gaussian (see, e.g., [10]). The utility of the uniformly
distributed distance function can be compared empirically to the utility of the follow-
ing normally distributed (and Euclidean) distance function:

              �
              ⃓                           (𝑥 −𝑥 )2
                                         − 1 22
              ⃓
              ⃓         ⎧      𝑤1 �1 − 𝑒     2𝑠    �        ⎫
              ⃓
              ⃓         ⎪                                   ⎪
              ⃓ 1                          (𝑦 −𝑦 )
                                          − 1 22
                                                   2
𝑑(𝑝1 , 𝑝2 ) = ⃓
              ⃓�      �       +𝑤 2 �1 − 𝑒     2𝑠     �                              (1b)
              ⃓
              ⃓
                 𝑠√2𝜋
                        ⎨                                   ⎬
              ⃓
              ⃓         ⎪                          (𝑡 −𝑡 )2
                                                 − 1 22     ⎪
                         +(1 − 𝑤1 − 𝑤2 ) �1 − 𝑒 2𝑠 �
              ⎷         ⎩                                   ⎭


Here, the only additional parameter is s, which represents the “steepness” of the nor-
mally distributed distance function. Note that s does not necessarily correspond to the
standard deviation of the distribution of resulting distances. The w’s should be con-
strained in the same manner as for the uniformly distributed distance function.
   In order to determine which of several possible specifications is most appropriate
for a particular detection algorithm, we then need to calculate the error term for each
fixation:

                 ∑ 𝑑(𝑝𝑖 ,𝑝̅ )
𝑒(𝑓𝑖𝑥𝑎𝑡𝑖𝑜𝑛) =                                                                       (2)
                     𝑛𝑝


Here, p represents a data point with index i, ͞p represents the centroid for all of the
data points within the spatiotemporal window, d represents the distance metric from
Equation 1a or 1b, n p represents the number of data points within the spatiotemporal
window for a detected fixation, and e(fixation) represents the error term for the de-
tected fixation (i.e., the mean of the distances from the centroid to each data point
within the spatiotemporal window).


                                             4
                                                                             ET4S 2014


   If researchers are comparing sets of detected fixations with spatiotemporal windows
of the same size and shape, then sums of e(fixation) across sets of detected fixations
are sufficient for comparing different detection algorithms. Across any range of spa-
tial and temporal intervals, the smallest sum of e(fixation) will reveal the most appro-
priate spatiotemporal window for any given detection algorithm.
   However, in order to compare spatiotemporal windows with different shapes or siz-
es, the error term needs to be converted into a measure that accounts for the number
of free parameters or the number of detected fixations, respectively. Towards this end,
the summed and squared error terms for all of the detected fixations of a given spatio-
temporal window can be converted to Bayes’ information criterion (BIC):

                 ∑ 𝑒(𝑓𝑖𝑥𝑎𝑡𝑖𝑜𝑛)2
𝐵𝐼𝐶 = �𝑛𝑓 × ln �                  �� + �𝑘 × ln�𝑛𝑓 ��                             (3)
                     𝑛𝑓 −1


Here, n f represents the number of detected fixations, k represents the number of free
parameters, ln represents the natural logarithm function, and e(fixation) represents the
error term from Equation 3. We consider each interval as only one parameter because
the location of the fixation along a particular dimension and both boundaries of each
interval are completely constrained by the determination of the size of the interval and
the data.


2.2    Detection algorithm comparison
   The BIC can also be used in order to compare different fixation detection algo-
rithms using Equations 1-3. The primary challenge for comparing different detection
algorithms thus becomes determining which parameters are free to vary (see [11]).
The BIC should be used to penalize the fit of any parameter that could have changed
in order to improve the fit of the model to the data. Notably, this method does not
require any assumptions regarding the “true” foci in the stimulus.


3      Future validation studies

   Future investigations can attempt to validate or invalidate our approach in at least
two ways. First, following [5], researchers can direct participants to focus on individ-
ual stimuli at known coordinates. This procedure is often used by eye tracking soft-
ware for calibrating eye movement data before an experiment [12]. For validation
purposes, fixations may be considered the periods of time during which a participant
was asked to focus on a particular stimulus. The veracity with which the BIC metric
determines the most appropriate spatiotemporal window (or best-performing detection
algorithm) should then be reflected by similar patterns in other metrics (e.g., number
of detected fixations; [5]).
   Second, the mean spatiotemporal window specified across individual participants
may approximately correspond to established recommendations already in the litera-
ture (e.g., [5] [13]). This may occur if the primary advantage of the current approach


                                               5
                                                                             ET4S 2014


is to account for additional variability, but this procedure could also be misleading if
the current approach actually produces more accurate fixation detection than previous
approaches.


4      Conclusions

   The present paper provided a novel approach to the specification of spatiotemporal
windows for fixation detection algorithms. This approach may also be applied to the
comparison of different detection algorithms. Two future studies for potentially falsi-
fying this approach are also briefly described.


5      References

1. Matin, E. (1974). Saccadic suppression: A review and an analysis. Psychological
Bulletin, 81, 899-917.
2. Henderson, J. M. (2003). Human gaze control during real-world scene percep-
tion. Trends in Cognitive Sciences, 7(11), 498-504.
3. Salvucci, D. D., & Goldberg, J. H. (2000). Identifying fixations and saccades in
eye-tracking protocols. Proceedings of the Eye Tracking Research and Applications
Symposium, 71-78.
4. Nyström, M. & Holmqvist, K. (2010). An adaptive algorithm for fixation, saccade,
and glissade detection in eyetracking data. Behavior Research Methods, 42, 188-204.
5. Komogortsev, O. V., Gobert, D. V., Jayarathna, S., Koh, D. H., & Gowda, S. M.
(2010). Standardization of automated analyses of oculomotor fixation and saccadic
behaviors. IEEE Transactions on Biomedical Engineering, 57, 2635-2645.
6. Rayner, K. (1998). Eye movements in reading and information processing: 20 years
of research. Psychological Bulletin, 124, 372-422.
7. Hyönä, J., Lorch Jr, R. F., & Kaakinen, J. K. (2002). Individual differences in read-
ing to summarize expository text: Evidence from eye fixation patterns. Journal of
Educational Psychology, 94(1), 44.
8. Rayner, K., & Raney, G. E. (1996). Eye movement control in reading and visual
search: Effects of word frequency. Psychonomic Bulletin & Review, 3(2), 245-248.
9. Henderson, J. M., & Hollingworth, A. (1998). Eye movements during scene view-
ing: An overview. Eye guidance in reading and scene perception, 11, 269-293.
10. Santella, A., & DeCarlo, D. (2004). Robust clustering of eye movement record-
ings for quantification of visual interest. Proceedings of the Eye Tracking Research
and Applications Symposium, 27-34.
11. Lewandowsky, S. & Farrell, S. (2010). Computational Modeling in Cognition:
Principles and Practice. Thousand Oaks, CA, USA: Sage Publications.
12. Hornof, A. J., & Halverson, T. (2002). Cleaning up systematic error in eye-
tracking data by using required fixation locations. Behavior Research Methods, In-
struments, & Computers, 34(4), 592-604.
13. Salthouse, T. A., & Ellis, C. L. (1980). Determinants of eye-fixation duration. The
American Journal of Psychology, 93, 207-234.


                                           6