1. Introduction

Recognition: Enhancing Event-Based Eye Tracking with Motion-Aware Post-Processing

Nuwan Bandara

pmnsbandara@smu.edu.sg 0

Thivya Kandappu

thivyak@smu.edu.sg 0

Archan Misra

archanm@smu.edu.sg 0 0 School of Computing and Information Systems, Singapore Management University , Singapore

2025

Event-based eye tracking holds significant promise for fine-grained cognitive state inference, ofering high temporal resolution and robustness to motion artifacts, critical features for decoding subtle mental states such as attention, confusion, or fatigue. In this work, we introduce a model-agnostic, inference-time refinement framework designed to enhance the output of existing event-based gaze estimation models without modifying their architecture or requiring retraining. Our method comprises two key post-processing modules: (i) Motion-Aware Median Filtering, which suppresses blink-induced spikes while preserving natural gaze dynamics, and (ii) Optical Flow-Based Local Refinement, which aligns gaze predictions with cumulative event motion to reduce spatial jitter and temporal discontinuities. To complement traditional spatial accuracy metrics, we propose a novel Jitter Metric that captures the temporal smoothness of predicted gaze trajectories based on velocity regularity and local signal complexity. Together, these contributions significantly improve the consistency of event-based gaze signals, making them better suited for downstream tasks such as micro-expression analysis and mind-state decoding. Our results demonstrate consistent improvements across multiple baseline models on controlled datasets, laying the groundwork for future integration with multimodal afect recognition systems in real-world environments. Our code implementations can be found at https://github.com/eye-tracking-for-physiological-sensing/EyeLoRiN.

eye tracking event camera post processing local refinement model-agnostic jitter metric

1. Introduction

CEUR Workshop

ISSN1613-0073 input for real-time micro-expression-based inference systems that aim to detect and respond to users’ unspoken mental states.

However, traditional camera-based eye tracking systems face well-known limitations in capturing these micro-level dynamics. Frame-based approaches typically operate at 30 – 1000 Hz and struggle with motion blur during rapid eye movements or transient behaviors such as blinking. In contrast, event-based vision sensors capture pixel-level brightness changes asynchronously with microsecond latency, yielding sparse but high-temporal-resolution data streams. These characteristics make eventbased sensors ideally suited for fine-grained eye tracking, especially in contexts where temporal fidelity, motion robustness, and low-latency processing are critical.

Despite these advantages, efective utilization of event data in eye tracking remains a challenge. The sparse and asynchronous nature of the data presents dificulties in processing and interpreting, requiring specialized models that can handle both the temporal and spatial dimensions of eye movements [10]. Spatio-temporal models, such as Change-Based ConvLSTM [11], graph-based event representations [12], and event binning methods [13], have emerged as promising approaches. These models aim to capture the dynamic and continuous nature of eye movements by encoding both spatial and temporal information from the event streams. This is particularly important because eye movements are inherently continuous, both in terms of space and time, and spatio-temporal models attempt to leverage these properties to improve gaze estimation accuracy.

However, while these models have shown promise, they sufer from several limitations. A key challenge in event-based eye tracking is the handling of blink artifacts, which cause interruptions in the event data and lead to erroneous gaze predictions [14]. Another limitation is the temporal inconsistency often observed in the predictions, as eye movements are physiologically continuous and models sometimes fail to enforce this temporal smoothness, leading to abrupt gaze shifts that undermine tracking stability [15]. Additionally, existing models often fail to fully leverage local event distributions, resulting in misaligned gaze predictions. These challenges, coupled with the inherent label sparsity of event datasets, make it dificult to develop a universally robust event-based tracking system.

To address these challenges, we propose a model-agnostic inference-time post-processing and local refinement framework to enhance the accuracy and robustness of event-based eye tracking. Our approach targets the shortcomings of existing spatio-temporal models by introducing lightweight, post-processing techniques that can be integrated with any model without requiring retraining or architectural changes. This makes our method flexible and easily applicable to a wide range of existing models. The post-processing framework consists of two key components: (i) motion-aware median ifltering, which enforces temporal smoothness by taking advantage of the continuous nature of eye movements, and (ii) optical flow-based local refinement, which improves spatial consistency by aligning gaze predictions with dominant motion patterns in the local event neighborhood. These refinements not only mitigate blinking artifacts but also ensure that gaze predictions remain temporally continuous and spatially accurate, even in the presence of rapid eye movements or motion artifacts.

By incorporating these post-processing and refinement techniques, our approach improves the overall performance and robustness of event-based eye tracking. This makes it particularly valuable in real-world applications where traditional models may fail due to the challenges posed by low-light environments, high-speed motion, or intermittent artifacts like blinks. Furthermore, because our framework is model-agnostic, it can be applied to any existing event-based eye-tracking model, ofering a significant boost to accuracy and stability without requiring changes to the core model.

In this paper, we make the following key contributions: • Model-Agnostic Post-Processing: We propose an inference-time refinement approach that enhances existing event-based pupil estimation models without modifying their architectures or requiring retraining through ( 1 ) Motion-Aware Median Filtering: a median filtering technique that incorporates motion-awareness to preserve temporal continuity in gaze predictions and mitigate blinking-induced errors, and ( 2 ) Optical Flow-Based Smooth Refinement: a local refinement strategy leveraging optical flow to ensure that predicted gaze positions align with the cumulative local event motion, reducing spatial inconsistencies in tracking results. • Jitter Metric: We propose a complementary metric for pupil tracking task to specifically evaluate the temporal smooth continuity of the predictions with respect to true targets based on ( 1 ) global statistical distribution of pupil velocities, addressed via comparative velocity entropy, and ( 2 ) local fine-grained frequency content of pupil velocities, addressed via spectral arc length-guided spectral entropy. • Empirical Validation and Performance Gains: Through extensive experiments, we demonstrate that our proposed methods significantly enhance the robustness and accuracy of state-ofthe-art event-based eye-tracking models across diverse conditions.

2. Related Work 2.1. Micro-expression Recognition

Micro-expressions are rapid, involuntary facial movements that reveal transient emotional states often concealed from conscious awareness. Due to their subtlety and brevity, automatic recognition of micro-expressions remains a challenging task that has garnered significant interest within the computer vision and afective computing communities.

Early approaches predominantly relied on handcrafted spatio-temporal features extracted from high-frame-rate facial video sequences, such as Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) [16], optical flow-based descriptors [ 17], and optical strain [18], to capture nuanced facial muscle dynamics. While these methods laid the foundation for subsequent advances, they were often limited by their dependence on feature engineering and sensitivity to noise. More recently, deep learning architectures, including 3D Convolutional Neural Networks [19] and Long Short-Term Memory (LSTM) networks [20], have been employed to automatically learn hierarchical representations from highframe-rate facial video sequences, significantly improving recognition accuracy. Attention mechanisms have further enhanced model capacity by focusing on discriminative spatial and temporal regions [21]. The availability of specialized datasets such as CASME [22], CASME II [2], SMIC [23], and SAMM [24], has been foundational, providing high-frame-rate facial videos annotated with micro-expression labels under controlled environments.

While facial cues remain the primary modality for micro-expression analysis, recent studies underscore the complementary value of ocular signals, including pupil dynamics and saccadic eye movements, as indicators of cognitive and afective states [ 25, 26]. Integrating eye-tracking data can enhance the robustness and granularity of emotion recognition, particularly in applications such as deception detection, clinical assessment of afective disorders, and adaptive human-computer interaction systems that respond to user engagement and mental workload.

In this paper, we specifically address the challenge of fine-grained eye tracking, an important yet often overlooked component of micro-expression analysis, and propose novel techniques to improve its accuracy and temporal consistency, thereby enhancing the reliability of eye-based cognitive and afective state inference.

2.2. Event-based Pupil and Gaze Tracking

Event-based cameras, ofer a fundamentally diferent sensing paradigm compared to conventional frame-based cameras. By asynchronously recording pixel-level brightness changes with microsecond temporal resolution and an extremely high dynamic range (120 ) while consuming minimal power ( level), these sensors are uniquely suited for applications requiring precise, low-latency motion tracking [27]. Consequently, the eye tracking community has recently begun to explore event-based approaches for pupil and gaze tracking, motivated by the limitations of traditional RGB and infrared systems in terms of temporal resolution and power eficiency [ 12]. Current research on event-based eye tracking can be broadly categorized into two main approaches:

Hybrid event-RGB: Several works have proposed combining event data with RGB frames to leverage the spatial resolution and structural detail of conventional cameras alongside the temporal advantages of event sensors. The approaches presented in [28] exemplify this direction by using RGB frames for initial pupil detection, with event streams employed to refine temporal tracking. While efective in improving robustness, such hybrid methods are inherently constrained by the frame rate of the RGB sensor, typically in the order of tens of milliseconds, which limits the full exploitation of the asynchronous, high-frequency event data. Moreover, these methods rely on RGB imagery and thus may sufer from environmental limitations such as variable lighting conditions, which event sensors are inherently more resilient to.

Event-Only Tracking: More recent work focuses exclusively on event streams, aiming to fully harness the unique properties of event cameras for pupil and gaze estimation. The event-only approaches [29, 12, 30], which aggregate events into either 2D or 3D representations and get inferred through either neural networks or traditional computer vision algorithms and thus, occasionally sufer from label sparsity and ineficient representations.

Bandara et al. [12] proposed EyeGraph, a novel approach that constructs spatiotemporal graphs from event data to represent pupil contours, and addressed the issue of label sparsity by proposing an unsupervised graph-based clustering approach to spatially localise the pupil in a 3D event volume. In contrast, Sen et al. [29] presented EyeTrAES, an event-based adaptive slicing mechanism that adaptively adjusts the volume of the event accumulation based on the underlying eye motion.

Despite these advances, event-only pupil tracking faces several open challenges. Designing representations that preserve the high temporal fidelity of events without overwhelming computational resources remains an active research area. Moreover, the limited availability of synchronized ground truth data hinders large-scale supervised training. Event sensors also produce noise from environmental artifacts such as illumination flicker or head movement, necessitating robust filtering and model designs [27].

2.3. Spatio-Temporal Processing of Events

Event-based eye tracking requires models that efectively capture the sparse, asynchronous nature of event streams while preserving high temporal resolution [27]. Unlike frame-based methods, eventbased vision demands spatio-temporal representations that can model continuous eye movements with precision. Recent prior works have proposed several spatio-temporal models-based event processing to capture the temporal evolution of the events. In this paper, we consider three classes of such models: (a) ConvGRU and ConvLSTM, (b) graph-based representations, and (c) event-binning methods.

ConvLSTM and ConvGRU networks [11], originally designed for dense sequential data, struggle with the sparsity of event streams. Change-Based-convLSTM (CB-convLSTM) mitigates this by leveraging change-based updates, focusing on local event dynamics rather than static frames [11]. This improves tracking accuracy by ensuring temporal continuity while maintaining event eficiency. Graph-based models encode events as nodes with spatio-temporal edges, preserving fine-grained motion patterns [12]. These approaches enhance tracking by leveraging local event dependencies, making them efective for eye movement estimation. Event binning methods aggregate events over predefined intervals to create structured inputs for deep temporal networks. While simple and eficient, methods like causal event volume binning strike a balance between temporal continuity and real-time feasibility [13].

While these models enhance spatio-temporal representation, they often produce temporally inconsistent or spatially misaligned predictions. Our proposed model-agnostic inference-time refinement improves accuracy by enforcing temporal smoothness and aligning predictions with local optical lfow. This enhancement operates independently of the underlying spatio-temporal model, making event-based eye tracking more robust across diferent architectures.

3. Motivation 3.1. Inference-time Post-processing

Even though the current event-based pupil tracking models attempt to incorporate both spatial and temporal learning blocks within their pipelines, the empirical results show that they still sufer from poor performance when handling blink artifacts. Further, even though the pupil movements are physiologically continuous and bounded, these models fail to strictly enforce this rule within the learning pipeline, leading to unstable pupil trajectories at the inference time. To address these limitations, here we propose a motion-aware median filtering technique as a post-processing method which penalizes trajectory outliers, either due to blinking or tracking instability, based on an adaptive motion profiling mechanism and thereby, achieves a stable pupil trajectory. Additionally, given that the existing models often tend to prioritize global perceptible eye morphology compared to local heuristics, which are deduced through event distributions, the pupil trajectory predictions often present an unpreventable ofset. To circumvent this issue, we propose to utilize the optical flow around the original predictions such that if the optical flow at the original prediction does not align with the local optical flow, then an ofset is assumed and subsequently, corrected by shifting the prediction by a small defined margin. It is to be noted that both of these proposed techniques work in model-agnostic fashion and thus, can be lfexibly applied to any existing model without them being re-trained or modified.

3.2. Jitter Metric

Existing metrics for evaluating the event-based pupil tracking performance such as p-accuracy and pixel distances only capture the per-sample positional accuracy of the methods and thereby neglect an essential aspect in pupil tracking: temporal smooth-continuity, which is critical for the stable and user-ergonomic performance in many downstream applications such as foveated rendering, gazebased human computer interaction, user authentication, and afective-cognitive modelling [ 31, 29]. In addition, unlike gaze which exhibits both smooth and rapid transitions which sometimes can be almost non-continuous with a significantly higher angle acceleration [ 28], the pupil movements are bounded and continuous in nature [12] which further validates the need for an explicit evaluation metric for temporal smooth-continuity of pupil tracking. To this end, we propose a velocity-based metric which considers and weights both ( 1 ) global statistical distribution of velocities via Kullback-Leibler divergence and ( 2 ) local velocity jaggedness via spectral arc length-guided spectral entropy. Further, it is to be noted that our metric is comparative between the predicted and true pupil trajectories while being lesser susceptible to measurement or prediction noise in contrast to jerk-based smoothness scores. We theoretically and empirically show the eficacy of the proposed metric in the context of pupil tracking.

4. Inference-time Post-processing

To address the shortcomings of existing methods in the inference stage, we propose to add two lightweight post-processing techniques specifically targeting the following limitations: ( 1 ) motion-aware median filtering (algorithm 1) to (a) ensure the temporal consistency of the predictions since the eye movements are physiologically bound to be continuous in spatial domain [12] and (b) reduce the blinking artifacts and ( 2 ) optical flow estimation in the local spatial neighbourhood (algorithm 2) to smoothly shift the original predictions if the flow vector at the original prediction is unaligned with the cumulative local neighbourhood flow direction. Synoptically, ( 1 ) is motivated by our empirical observations which convey the abundance of blinking artifacts within the predictions of the existing models (as shown in Fig. 1), whereas ( 2 ) is specifically inspired by our observations which hint that the original predictions tend to occupy a negligence towards the event motion flow in the local neighbourhood, suggesting a lack of attention to the local event distribution in the original models.

More descriptively, in motion-aware filtering as shown in Algorithm 1, we first estimate the local motion variance in temporal dimension (i.e., within a set time window) using a set of alternative methods (a) (b) (c)

(d) (e) global temporal variation of x coordinates around the presented blinking case. including 0ℎ to 2 order kinetics, covariance and frequency (of which the equations are defined in Equations below), and subsequently, assign a median-based adaptive filter windows for each set time windows such that the kernel size for median filtering is adaptive and appropriate to the variability of the background pupil movement while also ensuring the temporal consistency.

With respect to the local motion variance calculation, more mathematically, if the predicted pupil trajectory is p() = [ estimation could be derived as ̄ = 1 ∑/2

=−/2 () () ] for discrete time index , then, 0ℎ kinetic-based local motion variance

‖ ( + )‖ whereas = √( − −1 )2 + ( − −1 )2 and being the window size. In extension, velocity is the first derivative of position: v() = the smoothened velocity-based local variance could be calculated through vel = 1 ∑ the local motion variance estimation of acc = 1 ∑ /2 Similarly, acceleration is the second derivative of trajectory position: a() = 2 =−/2 ‖a( + )‖. With respect to the covariancebased local motion estimation, given a local window = {p( − /2), … , p( + /2)} , we compute: Σ = Cov( )and subsequently, the covariance-based motion variance is estimated as: p() and thus, /2 =−/2 ‖v( + )‖.

2p() , which leads to cov = ‖Σ ‖ ( 1 ) where ‖ ⋅ ‖ denotes the Frobenius norm.

When estimating the local motion variance through frequency features, we utilize the Short-Time Fourier Transform of pupil trajectory signal () be ( , ) = ∫( ) ( − ) power spectrum is ( , ) = | ( , )| 2. Then, the frequency-domain motion variance is estimated as of which the −2 freq = √

Var ( ( , ) )+ Var ( ( , ) ).

In optical flow estimation as shown in Algorithm 2, we first estimate the appropriate size for the

region of interest (ROI) around the filtered prediction using the first order derivatives of and and

Algorithm 1 Motion-aware median filtering

Require: Original predictions { , }, base window for local motion variance estimation , minimum allowed smoothing window , maximum allowed smoothing window , percentile to determine adaptive window size , method (.) ∈ {displacement, velocity, acceleration, covariance, frequency} 1: Output: filtered predictions { ( ,) , ( ,) } 2: local motion variance ⟵ ({ , }, ) 3: smoothened variance ⟵ rolling mean( , local motion variance) 4: median window ⟵ clipping(smoothened variance, , ) 5: adaptive windows ⟵ clipping( , , rolling(median window, , )) 6: { ( ,) , ( ,) } ⟵ rolling median({ , }, adaptive windows) then, if the number of events within the selected ROI exceeds a set threshold, we accumulate and determine the cumulative vector trajectory of the events within ROI to softly shift the filtered prediction to further refine its spatial position.

5. Jitter Metric 5.1. Background & Definition

Even though the existing metrics for the task at hand are beneficial to evaluate positional accuracy, none is efective in evaluating the smooth continuity of the pupil movements, which is critical for the stable performance in downstream applications such as foveated rendering or gaze-based interaction [31, 29], especially given the bounded and continuous nature of oculomotor (i.e., pupil movement) activity. Therefore, as a complementary metric, we propose the following jitter metric (Eq. 2) to specifically evaluate the temporal smoothness of the predictions while also considering the true targets.

When designing the jitter metric, we postulate two key premises based on pupil velocity considering both global and local levels: if the predicted trajectory is significantly diferent in temporal cohesion (i.e., comparative smooth-continuity) than the true trajectory, then ( 1 ) the distributions of the motion (i.e., velocity) in the global level are statistically diferent from each other, suggesting the predicted trajectory is of poor realism of the true trajectory, and ( 2 ) abrupt transitions and local jaggedness are reflective in the disparity in the frequency content, suggesting the fine-grained local diferences between the two trajectories.

To embed premise ( 1 ) within our metric, we employ Kullback-Leibler (KL) divergence due to its ability in measuring the information loss when compared to the true velocity distribution, such that, if the predicted velocities encompasses highly erratic or excessively smooth velocity distribution compared to true velocities, the log-normalized comparative velocity entropy, measured through KL divergence, would reflect such global diference with a higher value of our jitter metric. In contrast, to integrate premise ( 2 ) within our jitter metric, we utilize spectral arc length (SPARC)-guided, inspired by [32], spectral entropy (SPE) which is a frequency-domain metric known for its lower sensitivity towards noise than the jerk-based smoothness metrics. Typically, more complex frequency spectra relates to greater arc lengths and thereby, lesser locally smooth signals. Further, smoother velocity signals generally consist of higher lower frequency energy and vice versa. Motivated by these observations, here we incorporate ground-truth-anchored SPE diference between predicted and true velocities as a reflective measure of the fine-grained local diferences between predicted and true trajectories (see section 5.2 for our derivation of this equation).

( pred(,) , true(,) ) = ⋅ |SPE ( pred) − SPE ( true) |

|SPE ( true) | + + (1 − ) ⋅ log (1 + KL ( [ ( pred) ] || [ ( true) ])) ( 2 )

}, scaling parameter , count threshold , diference threshold ,(,,,) where ∈ {1, } , filtered predictions { ( ,)

, ( ,) 1: Output: Refined predictions 2: timestep ⟵ (= , )− (=1,

) { (, ,) |{ ( ,) , ( ,)

}| 3: previous timestamp ⟵ 4: ROI size ⟵ × 10 5: for , ( ( ,)

, ( ,)

} do , (, ,)

} diference in x diference in y ⟵ absolute( ⟵ absolute( ( ,) ( ,) − ({ − ({ −∶ ( ,) −∶ ( ,) })) })) if diference in x > × ∪ diference in y > × then current timestamp ⟵ previous timestamp + ( + 1) × if > then ⟵ (1 + ) × ⟵ (1 − ) ×

Algorithm 2 Rule-based optical flow estimation for smooth shifts

Require: Continuous event stream with number of events events in ROI ⟵ ( ∈ {previous timestamp, current timestamp}, ∈ { ( ,) − , ( ,) + } , − , then |events in ROI| ⟵ 0; ⟵ 0 for ∈ {1, }

do previous timestamp

⟵ current timestamp ( ,)

+ } ) ∈ ,(,,,) + = + = events in ROI( = ) − events in ROI( = − 1 ) events in ROI( = ) − events in ROI( = − 1 ) if absolute( ) > 0∪ absolute( ) > 0 then ⟵ ( ,) ⟵ ( ,) + ‖,‖ + ‖,‖ 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: end for where, else end if end if ∈ { ( ,) ⟵ if > × 10 (, ,) (, ,) end if end for end if KL ( [ ( pred) ] ||

[ ( true) ]) = ∑

) ( 3 ) ( 4 ) Here, pred(,) , true(,)

are predicted and true pupil trajectories whereas [.] is the function for velocity histogram estimation from predicted and true trajectories. The input to SPE as in Eq. 4 is (, ) ∈ { pred(,) , true(,) } and is the Fourier magnitude of the respective velocity signal. The weight hyperparameter for balancing the impact of KL divergence and SPE terms is ∈ [0, 1] and is a small constant to ensure numerical stability (i.e., avoid division by zero). In summary, our jitter metric

is computed as a weighted sum of the normalized SPE diference and the log-normalized KL divergence of velocity histograms, and in addition, as designed, a lower value would reflect a more similar comparative temporal smoothness between the true and predicted pupil trajectories. Further, a Step 1: Uniform Frequency Spacing Assumption

Assuming the frequency bins are uniformly spaced, we define:

Δ = +1 − (is constant), Δ ′ =

We then simplify the arc length as with Δ = +1 − and given that the spectrum is smooth Δ < where (> 0) is infinitesimally small and ∈ ℝ:

SPARC ≈ −Δ ′ ∑

1 + ( −1 −1 =1 =1 √ = −Δ ′ ∑ (1 + = −( − 1)Δ ′ − Δ Δ ′

2 ) ( 1 Δ 2 Δ ′

1 2Δ ′ −1 =1

2 ) ) ∑ (Δ )2

Δ − 1 2 2 (using √1 + 2 ≈ 1 + when || << 1 ) detailed set of theoretical analysis on the proposed jitter metric, including the boundedness, continuity, formal constraints, continuity, and diferentiability, is included in section 5.3.

5.2. Derivation of SPE Equation

Original Discrete SPARC Definition Let ( )

denote the normalized magnitude spectrum of the velocity signal () , sampled at frequency . Let { , } be the discrete points of this spectrum such that being the frequency bin and being the normalized magnitude of Fourier transform at frequency (The velocity signal can be optionally ifltered prior by a cutof frequency used. The SPARC metric is given by:

and amplitude threshold ). is the number of frequency bins SPARC = − ∑ This corresponds to the total normalized negative arc length in the frequency-magnitude plane. ( ) = SPARC ≈ SPE = − ∑ log( + ) ⋅ ( )

This shows that SPARC is negatively correlated with squared variation in spectral magnitude and thus, penalizes high spectral variation, which corresponds to non-smooth or jerky movements (Observation A).

Step 2: Frequency-Magnitude Reinterpretation

We now reinterpret ( )

as forming a normalized (discrete) probability distribution over frequency (instead of normalizing using the maximum amplitude), penalizing the higher spectral variation, such that ( ) ≈ | |:

Motivated by Observation A, we propose a frequency-weighted sum form that emphasizes smoothness by penalizing energy concentrated at high frequencies (Note that the frequency weighting term ( ) penalizes the higher frequencies by leading the SPE towards more negative sum): ( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) where > 0 is a small constant added for numerical stability.

Even though this version drops the explicit arc-length geometry, it retains the original spirit of SPARC by favoring low-frequency spectral energy concentration, and is more computationally tractable and diferentiable.

Summary

We conclude with the following approximate proxy metric:

SPE = − ∑ log( + ) ⋅

5.3. Theoretical Analysis on Jitter Metric

5.3.1. Theoretical Justification Supplementary to our explanations in section 5.1 and the derivation in section 5.2, below we present two lemmas to show why the jitter metric captures the smoothness.

Lemma 5.1. Let () be a continuous-time signal. Then, the smoother () is, the more concentrated its frequency spectrum is in low frequencies. As a result, the spectral entropy is lower.

Proof. The intuition behind the lemma, in high-level terms, is that a signal with higher smoothness should have a fewer rapid fluctuations, which in Fourier terms, similar to the high-frequency components having lower magnitudes. The quantity, , in the first term in the jitter metric reflects, in absolute sense, how spread out energy is across frequencies; i.e., highly smooth signal has low .

Formally, in more general sense, if the velocity function ∈ (ℝ)with − continuous derivatives: , () ∈ 1(ℝ)and Fourier transform of , ∈̂

1(ℝ ),then by diferentiation theorem, Similarly, if () ∈ 1(ℝ)for some ∈ ℕ , then, ∈̂ (ℝ)for 0 ≤ ≤ , ( )̂ = −1 ̂ 2

( ) (̂ ) = ( 2 −1 ̂ ) ( ) (11) (12) (13) (14) Therefore, through this decay property, it is possible to conclude that smoother signals, which have larger , decay faster which goes to 0 at infinity.

On the other hand, the log penalizing term in

the energy at high frequency, the greater the contribution to in absolute sense, since log( + ) is monotonically increasing for + > 1 .

Therefore, considering above, a smoother signal () has a faster decaying spectrum and thus lower weights on high frequency terms, and thus, has lower spectral entropy.

Lemma 5.2. Let and be discrete velocity magnitude distributions for predicted and true trajectories. Then, ( ||)

quantifies how diferent the temporal dynamics are between the predicted and true trajectories, with larger divergence implying greater discrepancy in movement similarity. Proof. The intuition behind the lemma, in high-level terms, is that a if the predicted trajectory exhibits similar motion variability and regularity as the true trajectory, then, a lesser extra information would in the first term in jitter metric ensure that higher be needed to encode samples from using a model based on .

Formally, assuming both (), () > 0 and ∑ () = has the following properties (from Gibb’s inequality), ∑ () = 1 , and noting that the KL divergence ( ||) ≥ 0 and ( ||) = 0 ⟷ = In addition, as shown in theorem 5.6, KL divergence is finite only if the support of is contained within the support of . As an example, if assigns a probability to velocities where is almost 0, then,

Based on above properties and the log-likelihood ratio, it is possible to imply that, () () • () >> () for some ∈ ⟶

>> , which in turn implies that increases. • Similarly, () << () for some ∈ ⟶ << 0 while () ≈ 0 . log () () = (, ) As a case study, it is possible to build a simple family of probability distributions parametrized by . Let =0 = and therefore, difers from as is modified. If = 0 and ∈ ℕ and the family of probability distributions be univariate Gaussians, then, is characterized by (0, ) and generally, . Therefore, it is trivial that as grows, shifts away from . Then, from general case for Since in our case, 1 = 2, 1 = , and 2 = 0, then, ( ||) = log 2 + ( ||) = log 1 + Therefore, it is trivial that increases monotonically with , i.e., when difers more from . Similarly, a more generic proof can follow Donsker and Varadhan’s variational formula in this context. 5.3.2. Formal Constraints Jitter metric is theoretically plausible under the following set of assumptions: • The predicted and true trajectories must be at least 1−continuous. • The jitter metric must have the full support of [ ( true) ].

• Both predicted and true trajectories must have bounded energies: ∫| (̂ .)|2 < ∞ • Additionally, as required by the task at hand, the velocity trajectories should be non-negative and temporally aligned.

Lemma 5.3. Let pred(,) () ∈ 1(ℝ)be the predicted pupil trajectory, then, the velocity continuous, and the velocity histogram is well-defined as a Radon-Nikodym derivative with respect to is the Lebesgue measure.

Proof. By definition of 1−continuity, pred(,) and

Since pred(,) ()is 1, then, value theorem. Further, pred(,) () ()is diferentiable (on a closed and bounded interval), is continuous. Continuity of ensures it is measurable, by measure theory. (as it is a physical trajectory, i.e., a pupil velocity trajectory). measure, and ⊆ ℝ is any measurable set:

Considering the constructed velocity histogram on the time interval [0, ] ∈ ℝ , is the Lebesgue is bounded on any compact interval [, ]; , ∈ ℝ by extreme does not diverge since it has a finite energy, i.e., ∫| pred(,) () |2 < ∞ 1 () = ({ ∈ [0, ]| ∈ }) As measurable for any Borel set . Further, as of physical trajectories), is a valid probability measure.

is continuous, then measurable, the preimage { ∈ [0, ]| ∈ } is Lebesgue

is bounded and < ∞ (due to being characteristics (15) (16) (17) ( ) = 1

∑ | pred(,) ( ) = | ( )−|1 Being a physical trajectory,

is piecewise monotonic. Therefore, it is possible to further extend that should admit a probability density function ( ) : = pred(,) ( +1 )−pred(,)

+1 − ∞ . Further, a similar proof can The same can be proved for true trajectory as well. Therefore, the first term is scale-invariant for > 0 Regarding the second term, for scaled ′ and ′, while adapting the continuous definition for KL ≈ − ∑ log( + ) ⋅ (

) = SPE( ( ′|| ″) =∫ ′( )log ′( ) ′( ) = 1 ∫ ( )log 1 1

) ( ( ) Therefore,

is scale-invariant, so is log(1 + ).

Therefore, from all above, the jitter metric is scale invariant for ∀ > 0 with negligible . () () () () = ∫ () log ⋅ = ∫ () log = ( ||) where the sum is over roots of

− = 0 . As a practical implication, velocity from discrete pupil trajectories (of samples) can be computed via finite diferences: follow true pupil trajectories as well. pred(,)

()is 1, then, the histogram of { } converges to as ⟶ 5.3.3. Scale Invariance Theorem 5.4. Jitter metric is scale-invariant for ∀ > 0 under negligible .

= − ∑ log( + ) ⋅ ( ) ≈ − ∑ log( + ) ⋅ ( ) under negligible the velocities scale as: ℱ ( pred(′,) ()

SPE(

) = ℱ ( pred(′,) () >0

for negligible . divergence, Changing variables: = (18) ( ) , as if (19) (20) (21) (22) (23) Then, the scaled trajectories are: pred(′,) () = ⋅ pred(,) Proof. Consider a linear and consistent scaling factor (> 0) across both predicted and true trajectories. pred(′,) () = ⋅ and ()and true(′,) true(′,) ()

= ⋅ () = ⋅ true(,) (). Therefore, true(,) ()

. By extension, the When considering the first term in the jitter metric, from the scaling property of Fourier transform: ) = ℱ ( ) ⟶ | ′| = | |. Therefore,

) = −∑ log( + ) ⋅ ( 5.3.4. Lower Bound Theorem 5.5. Jitter metric is lower bounded. In other words, for any predicted (pred(,) ) and true (true(,) ) trajectories, the metric satisfies ( pred(,) , true(,) ) ≥ 0with equality if pred(,) = true(,) Proof. Both first and second terms in the jitter metric are strictly non-negative due to both additive terms being non-negative (while ∈ [0, 1] and ∈ ℝ ): • The first term is trivially non-negative (i.e., |SPE ( pred) − SPE ( true) | ≥ 0) • The second term is also non-negative since KL divergence is always non-negative (i.e., Gibb’s inequality,

( ||) ≥ 0 leads to log(1 + ) ≥ 0 for ≥ 0 ).

Regarding the condition for equality, true(,) ), then, ( pred) metric vanishes.

second term also vanishes.

• Similarly, ( pred) = ( true) leads to (.) = ,0since log ( • When the predicted trajectory and the true trajectory are congruent in 1D sense (i.e., pred(,) = = ( true) . Therefore, |SPE ( pred) − SPE ( true) | = 0 and the first term in jitter

Since both first and seconds terms are strictly non-negative, their weighted sum (i.e., jitter metric is the lower bound for the metric. value) is also strictly non-negative. Further, (.) = 0 is only possible when pred(,) = true(,) which 5.3.5. Upper Bound Theorem 5.6. Jitter metric is not upper-bounded.

Proof. The first term in the jitter metric is bounded if both SPE ( true) and SPE ( pred) are finite and ∈ ℝ.

In other words,

Then, (.) ⟶

∞,so, (.) ≥ (1 − ) ⋅ log(1 + ∞) = ∞.

To ensure the suficient condition for KL divergence to be upper-bounded,

Regarding the second term: let

[ ( pred) ] be a velocity distribution with support disjoint from [ ( true) ]. [ ( pred) ] assigns non-zero probability to events where [ ( true) ] has zero probability.

[ ( true) () ) = 0, and therefore, the ] bound, then Proof. Since (

[ ( true) ]) < ∞ [ ( true) ] has compact support:

Similarly, since

[ ( pred) ] has compact support, Since both are on the same support and ∈ ∈ = inf

[ ( true) ]() > 0 =̄ sup

[ ( pred) ]() > 0 [ ( pred) ] is bounded, 0 < ≤ <̄ ∞ sup log ( [ ( pred) ]() ) ≤ log −̄ log Lemma 5.7. Given both [ ( true) ] and [ ( pred) ] have the same support and [ ( pred) ] has a finite upper (24) (25) (26) (27) Let = log −̄ log , then, (

metric becomes unbounded above. 5.3.6. Continuity

Therefore, if

[ ( pred) ] is with support disjoint from [ ( true) ], then, second term dominates and the Theorem 5.8. Jitter metric is continuous everywhere except when [ ( true) ] has zero-mass bins (i.e., if not smoothened). In other words, with full support, jitter metric is continuous. Proof. Assume both pred(,)

are diferentiable on a closed and bounded interval (Note that these are valid assumptions given the typical ocularmotor i.e., pupil activity, function [12]). Then, both ( pred) and ( true) are continuous functions under 2 norm: if ‖true − pred‖ ⟶ 0 then ‖ ( true) − ≤ (log −̄ log )∑ ∈

[ ( pred) ]() = < ∞ [ ( pred) ] [ ( true) ] ) ≤ sup log ( ∈ [ ( pred) ]() [ ( true) ]() ∈ ) ∑ [ ( pred) ]() (28) (29) (30) (31) (32) (33) (34) (35) If

Based on the assumptions of diferentiability on a closed and bounded interval (and thereby continuous since diferentiability implies continuity), both tern implies that those are Lebesgue integrable. Therefore, both velocity trajectories are in 1 space:

( pred) and ( true) are Riemann integrable, which in (), let Fourier transform of predicted velocity trajectory be:

̂ . Note that, for simplicity, we only consider 1D case (for coordinate) considering one variable at a time in the trajectory and this is easily extendable to multi-variable case as the Fourier transform on multivariables is well-defined. ∀

⟶ : () = −

() is measurable ∀ ∈ ℕ () ⟶ − ()| where | () ()| is integrable Therefore, by dominated convergence theorem, ⟶∞ lim ̂() ( ) = lim ⟶∞ ∫ − ℝ () () = lim ⟶∞ ∫ () =

ℝ ∫ () ℝ

̂ ⟶∞ lim ̂() ( ) =∫ −

ℝ Therefore, ̂()

( ) is uniformly continuous. Similarly, we can prove this for true trajectory as well under the same set of assumptions. Since taking the absolute value, normalization, multiplication, and log (for > 0 ) do not violate the continuity, the first term is continuous.

Assuming soft histogramming (i.e., soft binning or kernel density estimation), in other words, if [ ( true) ] has full support:

[ ( true) ] > 0 everywhere, then, As log is continuous over (0, ∞ ),by extension, under the said assumption, the second term is also As the convex sum of continuous functions is also continuous, the jitter metric is continuous while [ ( pred) ], [ ( true) ] ∈ (0, 1] ⟶ [ ( pred) ]/

[ ( true) ] is continuous

(36) Theorem 5.9. Jitter metric is diferentiable almost everywhere except when ( 1 ) pred = true, ( 2 ) = 0 for any , ( 3 ) with support disjoint from [ ( true) ], and ( 4 ) [ ( pred) ] is on the simplex boundary.

are diferentiable on a closed and bounded interval (Note that these are valid assumptions given the typical ocularmotor i.e., pupil activity, function [12] as a physical signal). Then, both ( pred) and ( true) are continuous functions under 2 norm: if ‖true − pred‖ ⟶ 0 then ‖ ( true) −

Based on the assumptions of diferentiability on a closed and bounded interval (and thereby continuous since diferentiability implies continuity), both ( pred) and ( true) are Riemann integrable, which in tern implies that those are Lebesgue integrable. Therefore, both velocity trajectories are in 1 space: continuous. with full support. 5.3.7. Diferentiability If ( pred() )

(37) ( ) is () (38) (39) (), let Fourier transform of predicted velocity trajectory be:

Following the theorem 5.8 on the continuity of the jitter metric, more specifically, ̂() uniformly continuous, and assuming () which is also uniformly continuous. Consider, () ∈ 1(ℝ ),define the function () ()as − () ̂ (ℎ)ℎ = ∫ ∫ − −ℎ () () ℎ = ℝ ∫ −ℎ | 0 ()

() − ⋅0 ] = ̂() ( ) − () ̂ (0) 0 ∫ () = ∫ () ℝ 0 ℝ ()[− Therefore, by fundamental theorem of calculus, ̂() we can prove this result for ̂() ( ), and the true velocity trajectory as well.

( )is diferentiable almost everywhere. Similarly, pred(.) =

true as well. when

Upon the immediate above result, the first term in the jitter metric is diferentiable almost everywhere except when | | = 0 since the absolute value function ∈ ℝ is not diferentiable at 0. Further, since pred(.) = true leads to vanishing the first term, the diferentiability of jitter metric is not defined Regarding the second term in the jitter metric, if parameters ∈ ℝ and ∈ ℝ respectively, then, [ ( true) ] as . Here, we assume that [ ( pred) ] and

[ ( true) ] can be parametrized using

[ ( true) ](;|) [ ( pred) ] can be written as

, [ ( true) ](;| ) [ ( pred)

](;|) ∈ 1(ℝ)on the space , whereas

∈ ∈ ) + () 1 () ()

] i.e., a fixed wrt as and respectively. Therefore, > 0 ∀(; ) ∈ and . For simplicity, we denote and ( || ) =∑ log( ) For a fixed wrt , term:

Therefore,

Therefore, under the assumptions of , ∈ 1 in , , respectively, on and () > 0 ∀ ∈ (and given is not on the simplex boundary), the above sum uniformly converges and the derivative exists and continuous. Similarly, the partial derivative of wrt can be proven to be: ( || ) = −∑ ()

log () which exists and continuous under the same set of assumptions.

A more detailed proof with gradient derivations can follow as proved in variational bayes [33]. 5.3.8. Time Complexity The dominant operation in the first term is the Fourier transform. If used fast Fourier transform (FFT) on an array of length , then, As taking magnitude, normalization, weighted sum are in () , then, the time complexity of the first

Time complexity of the first term

= ( log + ) = ( log )

For the second term, assuming histograms with bins ( << ), soft-binning is of order ( ⋅ ) (discrete) KL divergence is of order () . Subsequently,

Time complexity of FFT = ( log )

Time complexity of the second term = ( ⋅ + ) = ( ⋅ ) Time complexity of Jitter Metric = ( log + ⋅ ) = ( log ) (40) (41) (42) (43) (44) (45) (46) (47) and (48)

5.4. Empirical Justification for Jitter Metric

To empirically demonstrate the necessity and validity of the proposed jitter metric in the context of pupil tracking, we present a set of controlled visual demonstrations in Fig. 2. These examples are designed to contrast gaze prediction trajectories with varying degrees of temporal smoothness and positional accuracy. In doing so, we highlight the limitations of conventional metrics, such as Mean Squared Error (MSE), and illustrate how the proposed jitter metric serves as a complementary measure that captures temporal continuity, a crucial yet often overlooked dimension in eye tracking evaluation.

Each subfigure (Fig. 2b – Fig. 2g) represents a perturbed version of the predicted pupil trajectory (Fig. 2a) obtained from the 3ET+ dataset. Perturbations were deliberately designed to simulate typical error modes encountered in real-world event-based eye-tracking pipelines. These include: (i) lowamplitude random noise, (ii) blink-induced discontinuities, (iii) pixel shift artifacts, and (iv) highfrequency tremor-like oscillations. Each simulated prediction includes at least two of these perturbations to reflect realistic degradation patterns observed in practice.

We use the prediction in Fig. 2a as the reference trajectory. It achieves a moderate MSE of 15.83 and a jitter metric score of =0.75 = 0.18. This example serves as a baseline for comparing other trajectories in terms of both spatial and temporal quality. In Fig. 2b, although the MSE is lower than that of the reference, the prediction exhibits reduced temporal smoothness. This discrepancy is efectively captured by our jitter metric, which assigns it a higher score, penalizing the temporal instability overlooked by conventional positional metrics. Conversely, the prediction in Fig.2e demonstrates superior temporal smoothness, despite a slightly higher MSE. The jitter score reflects this improvement, underscoring the metric’s ability to reward smooth predictions even in the presence of minor spatial deviations.

Figures 2c and 2f present a particularly instructive comparison: both predictions yield nearly identical MSE values but difer significantly in their temporal continuity. The proposed jitter metric distinguishes between the two, accurately assigning a lower score to the smoother prediction. These cases exemplify scenarios where traditional metrics fail to diferentiate predictions with similar positional accuracy but markedly diferent perceptual quality.

The trajectory in Fig. 2d is characterized by blink-related discontinuities and abrupt pixel shifts—common artifacts in event-based systems. While its MSE is comparable to the reference, its elevated jitter score reflects the increased temporal noise. This highlights the metric’s sensitivity to transient disruptions that can compromise downstream tasks such as attention estimation or gaze-based interaction.

Finally, Fig. 2g illustrates a case with relatively high MSE but excellent temporal smoothness. The jitter metric appropriately assigns it a lower score than the reference, reinforcing its utility as a decoupled measure of temporal fidelity.

These examples validate the proposed jitter metric as a critical complement to existing positional accuracy measures. By capturing trajectory-level smoothness, the metric provides a more holistic evaluation of prediction quality, particularly in applications such as micro-expression recognition, cognitive state inference, and gaze-based behavioral analytics, where temporal consistency is important.

6. Experiments & Results 6.1. Datasets & Base Models

By following the recent challenge on event-based eye tracking [10], we test our method on the 3ET+ dataset [10, 30] since 3ET+ serves as the most prominent benchmark dataset for the task at hand. In contrast, since our proposed method is presented as a post-processing step and works in a model-agnostic fashion, we select two recent models as base models: CB-ConvLSTM [11], and bigBrains [13], to show the impact of the proposed pipeline towards improved pupil coordinate predictions in each case. More descriptively, CB-ConvLSTM is a change-based convolutional long short-term memory architecture which specifically designed for eficient spatio-temporal modelling to predict pupil coordinates from sparse event frames, whereas bigBrains attempts to preserve causality and learn spatial relationships using a lightweight model consisting of spatial and temporal convolutions.

(a) MSE: 15.83, JM: 0.18 (b) MSE: 9.82, JM: 0.23 (c) MSE: 14.40, JM: 0.20 (d) MSE: 59.41, JM: 0.32 (e) MSE: 23.07, JM: 0.16 (f) MSE: 16.32, JM: 0.11 (g) MSE: 3.39, JM: 0.05

6.2. Implementation Details

We implement and run all our post-processing blocks on a single V100 GPU machine while setting the following specifics for each proposed algorithm. For the implementation of algorithm 1, we set to be 5, to be 20, the percentile to determine adaptive window size to be 75 and the default mode of local motion variance estimation method to be based on covariance. For the implementation of algorithm 2, we set the scaling parameter to be 8, the count threshold to be 5, and the diference threshold to be 2.

6.3. Baselines

FreeEVs [30], and SEE [35].

6.4. Evaluation Metrics

Trivially, we consider two base models, described in section 6.1, as baselines to compare with the proposed post-processing techniques. In addition, to further extend our analysis, we compare our method with latest other works in the literature as well, including EyeGraph [12], MambaPupil [34], Along with the proposed jitter metric, we implement three other metrics: p-accuracy, mean Euclidean distance ( 2) and mean Manhattan distance ( 1), which are utilized in the recent works in the literature [30], to quantitatively evaluate the performance of the proposed post-processing methods. As defined in [ 11], p-accuracy, as presented in Eq. 50, indicates the pixel-level accuracy of the predictions by checking the

Euclidean distance between the predicted coordinates (

) and true coordinates ( ) is within a specified pixel threshold ( ℎ ). In this work, we set the pixel thresholds to be 10, 5, and 1 following [30]. Further, since the pupil coordinate prediction is a regression task, we incorporate two well-established regression metrics: 2 in Eq. 51 and 1 in Eq. 52 as well.

with ( , , ℎ) = { −

‖ ≤ ℎ { ℎ} = ∑ ( ,

, ℎ) 1 =1 1 if ‖ 0 otherwise 2 = 1 1 = 1 =1 =1 ∑ ‖ − ∑ | − ‖ 2 | (50) (51) (52)

6.5. Results

In this section, we present a comprehensive evaluation of our inference-time post-processing framework across four dimensions: standard positional accuracy metrics, our proposed jitter metric, per-component ablation analysis, and computational complexity. We also present a set of qualitative results in Fig. 3 for further demonstration.

Positional Accuracy: Table 1 reports the performance of our inference-time refinement pipeline on standard positional accuracy metrics, evaluated on predictions from the bigBrains model [13]. Our post-processing techniques consistently improve the gaze localization accuracy across all benchmarks. Notably, when applied to the base model, our method reduces the mean squared error by more than 5.1% on average (on both validation and test datasets), outperforming all existing event-based eye tracking approaches. These results demonstrate that our method enhances baseline predictions without any model retraining or architectural modifications, validating its model-agnostic design.

Computational Complexity: Table 2 quantifies the computational overhead introduced by our post-processing modules. As our methods operate entirely at inference time and do not include any trainable parameters, the additional computational burden is minimal. Specifically, motion-aware median filtering and optical flow refinement require approximately 172 and 340 FLOPs per event frame, respectively. Across all tested configurations, the total computational overhead remains below 0.00048% p10↑ 91.45 99.42 99.26 99.00 99.00 99.99 of the base model’s cost. This confirms the practicality of our approach for real-time deployment on edge devices with constrained resources.

Temporal Smoothness via Jitter Metric: Table 3 highlights the benefits of our approach with respect to temporal smoothness, as measured by the proposed jitter metric. Unlike traditional metrics that emphasize spatial proximity to ground truth, the jitter metric captures the fine-grained continuity of gaze predictions over time, an essential attribute for downstream applications such as mind-state decoding or attention estimation. Our refinement modules yield significantly lower jitter scores than both the base model and other state-of-the-art systems, indicating smoother and more stable tracking output. This demonstrates that our method not only improves accuracy but also mitigates high-frequency noise often induced by sensor sparsity or blinking.

Ablation Analysis: To isolate the contribution of each refinement component, we conduct an ablation study summarized in Table 4. We evaluate the individual efects of the motion-aware median ifltering and optical flow-based local refinement modules. The former proves efective in suppressing transient spikes caused by blinks and sensor noise, while the latter aligns predictions more closely with the underlying motion cues embedded in the event stream. When combined, these components exhibit complementary benefits, leading to the highest overall improvements in both accuracy and smoothness.

These results validate the composability and robustness of our design.

As shown in Tab. 4, both of our post processing techniques consistently improved the results of vanilla predictions of each method. To this end, only applying the motion-aware median filtering improves the vanilla prediction performance of [13] by reducing the 2 error from 1.500 to 1.466 whereas applying both motion-aware median filtering and optical flow-based local refinement leads to an 2 of 1.423, thereby marking an overall improvement of 5.13%. Similarly, when we apply our model-agnostic methods

Model size 417K 417K 809K 809K on [11]’s vanilla predictions, the performance improved from 7.922 to 7.504. These observations confirm the validity of the proposed model-agnostic post-processing methods as a collective way of improving the existing models while also ensuring the eficacy of the individual blocks.

7. Discussion & Conclusion

This work presents a practical, model-agnostic framework for enhancing event-based eye-tracking pipelines at inference time. A key strength of our approach is its applicability: the proposed postprocessing techniques, i.e., Motion-Aware Median Filtering and Optical Flow Estimation for smooth shifts, can be seamlessly applied to the output of any existing event-based pupil estimation model, improving temporal stability and spatial coherence without retraining or modifying the model architecture. This makes our method especially valuable in resource-constrained settings or when dealing with black-box models, ofering a lightweight way to boost performance across a wide range of real-world applications. Additionally, the introduction of a dedicated Jitter Metric provides a complementary measure on model quality, addressing a critical gap in evaluation criteria for time-sensitive behavioral tracking.

Despite these strengths, there are important limitations to acknowledge.

• Our work currently focuses exclusively on the ocular modality. While eye movements, particularly micro-saccades and pupil dynamics, are powerful indicators of cognitive states such as attention, fatigue, or confusion, real-world mind-state inference typically benefits from multimodal integration, combining gaze with facial micro-expressions, head dynamics, or physiological signals. Extending our refinement pipeline and temporal metrics to accommodate or complement such modalities remains an exciting direction for future work. • Our evaluations are conducted on datasets collected in controlled laboratory settings, where participants are relatively still, lighting is consistent, and noise in the event stream is minimal. In contrast, real-world deployments, for example, in wearable settings or during naturalistic interactions, introduce challenges such as head motion, background clutter, dynamic lighting, and partial occlusions. These conditions may degrade the assumptions behind our refinement techniques (e.g., motion coherence in flow estimation), and thus real-world validation is a critical next step. • While our approach improves temporal smoothness and reduces spatial jitter, it does not correct fundamental prediction errors arising from poor base model performance. If a baseline model consistently mispredicts pupil location due to sensor misalignment, incorrect calibration, or biased training data, our method may smooth those errors rather than eliminate them. Therefore, the method is best viewed as an enhancement layer for models that already ofer reasonable accuracy, rather than as a full corrective mechanism.

Acknowledgments

This work was supported by both the Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 grant (Grant ID: 22-SIS-SMU-044), and by Singapore Management University’s Lee Kong Chian Professorship Award. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s).

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling check. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. [11] Q. Chen, Z. Wang, S.-C. Liu, C. Gao, 3ET: eficient event-based eye tracking using a change-based convlstm network, in: 2023 IEEE Biomedical Circuits and Systems Conference (BioCAS), IEEE, 2023, pp. 1–5. [12] N. Bandara, T. Kandappu, A. Sen, I. Gokarn, A. Misra, EyeGraph: modularity-aware spatio temporal graph clustering for continuous event-based eye tracking, Advances in Neural Information Processing Systems 37 (2024) 120366–120380. [13] Y. R. Pei, S. Brüers, S. Crouzet, D. McLelland, O. Coenen, A lightweight spatiotemporal network for online eye tracking with event camera, in: Proceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, 2024, pp. 5780–5788.

[14] J. W. Grootjen, H. Weingärtner, S. Mayer, Highlighting the challenges of blinks in eye tracking for interactive systems, in: Proceedings of the 2023 Symposium on Eye Tracking Research and Applications, 2023, pp. 1–7. [15] P. D. Allopenna, J. S. Magnuson, M. K. Tanenhaus, Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models, Journal of memory and language 38 (1998) 419–439. [16] C. Guo, J. Liang, G. Zhan, Z. Liu, M. Pietikäinen, L. Liu, Extended local binary patterns for eficient and robust spontaneous facial micro-expression recognition, IEEE Access 7 (2019) 174517–174530. [17] M. Verburg, V. Menkovski, Micro-expression detection in long videos using optical flow and recurrent neural networks, in: 2019 14th IEEE International conference on automatic face & gesture recognition (FG 2019), IEEE, 2019, pp. 1–6. [18] S.-T. Liong, J. See, R. C.-W. Phan, A. C. Le Ngo, Y.-H. Oh, K. Wong, Subtle expression recognition using optical strain weighted features, in: Computer Vision-ACCV 2014 Workshops: Singapore, Singapore, November 1-2, 2014, Revised Selected Papers, Part II 12, Springer, 2015, pp. 644–657. [19] Z. Xia, X. Hong, X. Gao, X. Feng, G. Zhao, Spatiotemporal recurrent convolutional networks for recognizing spontaneous micro-expressions, IEEE Transactions on Multimedia 22 (2019) 626–640. [20] M. Bai, R. Goecke, Investigating lstm for micro-expression recognition, in: Companion Publication of the 2020 International Conference on Multimodal Interaction, 2020, pp. 7–11. [21] Y. Wang, S. Zheng, X. Sun, D. Guo, J. Lang, Micro-expression recognition with attention mechanism and region enhancement, Multimedia Systems 29 (2023) 3095–3103. [22] W.-J. Yan, Q. Wu, Y.-J. Liu, S.-J. Wang, X. Fu, Casme database: A dataset of spontaneous microexpressions collected from neutralized faces, in: 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), IEEE, 2013, pp. 1–7. [23] X. Li, T. Pfister, X. Huang, G. Zhao, M. Pietikäinen, A spontaneous micro-expression database: Inducement, collection and baseline, in: 2013 10th IEEE International Conference and Workshops on Automatic face and gesture recognition (fg), IEEE, 2013, pp. 1–6. [24] A. K. Davison, C. Lansley, N. Costen, K. Tan, M. H. Yap, Samm: A spontaneous micro-facial movement dataset, IEEE transactions on afective computing 9 (2016) 116–129. [25] E. H. Hess, J. M. Polt, Pupil size as related to interest value of visual stimuli, Science 132 (1960) 349–350. [26] T. Partala, V. Surakka, Pupil size variation as an indication of afective processing, International journal of human-computer studies 59 (2003) 185–198. [27] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, et al., Event-based vision: A survey, IEEE transactions on pattern analysis and machine intelligence 44 (2020) 154–180. [28] A. N. Angelopoulos, J. N. Martel, A. P. Kohli, J. Conradt, G. Wetzstein, Event-based near-eye gaze tracking beyond 10,000 Hz, IEEE Transactions on Visualization and Computer Graphics 27 (2021) 2577–2586. [29] A. Sen, N. S. Bandara, I. Gokarn, T. Kandappu, A. Misra, EyeTrAES: fine-grained, low-latency eye tracking via adaptive event slicing, Proceedings of the ACM on Interactive, Mobile, Wearable and

Ubiquitous Technologies 8 (2024) 1–32.

[30] Z. Wang, C. Gao, Z. Wu, M. V. Conde, R. Timofte, S.-C. Liu, Q. Chen, Z.-J. Zha, W. Zhai, H. Han, et al., Event-based eye tracking. AIS 2024 challenge survey, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5810–5825. [31] P. Majaranta, A. Bulling, Eye tracking and eye-based human–computer interaction, in: Advances in physiological computing, Springer, 2014, pp. 39–65. [32] S. Balasubramanian, A. Melendez-Calderon, A. Roby-Brami, E. Burdet, On the analysis of movement smoothness, Journal of neuroengineering and rehabilitation 12 (2015) 1–11. [33] D. P. Kingma, M. Welling, Auto-encoding variational bayes, 2022. URL: https://arxiv.org/abs/1312.

6114. arXiv:1312.6114. [34] Z. Wang, Z. Wan, H. Han, B. Liao, Y. Wu, W. Zhai, Y. Cao, Z.-J. Zha, Mambapupil: Bidirectional selective recurrent model for event-based eye tracking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5762–5770. [35] B. Zhang, Y. Gao, J. Li, H. K.-H. So, Co-designing a sub-millisecond latency event-based eye tracking system with submanifold sparse cnn, in: Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition, 2024, pp. 5771–5779.

[1]

Ekman , Emotions revealed, Bmj 328 ( 2004 ).

[2]

W.-J.

Yan ,

Li ,

S.-J.

Wang ,

Zhao ,

Y.-J.

Liu ,

Y.-H.

Chen ,

Fu , Casme ii: An improved spontaneous micro-expression database and the baseline evaluation , PloS one 9 ( 2014 ) e86041 .

[3]

Ekman , Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition) , WW Norton & Company , 2009 .

[4]

Wezowski ,

I. S.

Penton-Voak , An open label pilot study of micro expression recognition training as an intervention for low mood , Scientific Reports 15 ( 2025 ) 1 - 12 .

[5]

Wang ,

Peng ,

Bi , T. Chen, Micro-attention for micro-expression recognition , Neurocomputing 410 ( 2020 ) 354 - 362 .

[6]

Alipour , É. Céret,

Dupuy-Chessa , A framework for user interface adaptation to emotions and their temporal aspects , Proceedings of the ACM on Human-Computer Interaction 7 ( 2023 ) 1 - 25 .

[7]

Bixler , S. D'Mello, Automatic gaze-based detection of mind wandering with metacognitive awareness, in: User Modeling, Adaptation and Personalization: 23rd International Conference, UMAP 2015 , Dublin, Ireland, June 29-July 3, 2015 . Proceedings 23, Springer, 2015 , pp. 31 - 43 .

[8]

M. K.

Eckstein ,

Guerra-Carrillo ,

A. T. M.

Singley ,

S. A.

Bunge , Beyond eye gaze: What else can eyetracking reveal about cognition and cognitive development? , Developmental cognitive neuroscience 25 ( 2017 ) 69 - 91 .

[9]

Prasse ,

D. R.

Reich , S. Makowski,

Schefer ,

L. A.

Jäger , Improving cognitive-state analysis from eye gaze with synthetic eye-movement data , Computers & Graphics 119 ( 2024 ) 103901 .

[10]

Chen ,

Gao ,

Liu ,

Perrone , et al., Event-Based Eye Tracking . 2025 event -based vision workshop , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , 2025 .