-

SYSYEM

1613-0073

for Real-Time Psychometric and HCI Applications

Emanuele Iacobelli

iacobelli@diag.uniroma1.it 1 4

Davide Pelella

1 4

Valerio Ponzi

ponzi@diag.uniroma1.it 1 3 4

Samuele Russo

samuele.russo@duniroma1.it 2 4

Christian Napoli

cnapoli@diag.uniroma1.it 0 1 3 4

Eye Tracking, Machine Learning, Real-Time Application, Appearance-Based Eye Tracking System, Gaze Laterality Studies

0 Department of Computational Intelligence, Czestochowa University of Technology , 42-201 Czestochowa , Poland 1 Department of Computer, Automatic and Management Engineering, Sapienza University of Rome , 00185 Rome , Italy 2 Department of Psychology, Sapienza University of Rome , 00185 Rome , Italy 3 Institute for Systems Analysis and Computer Science, Italian National Research Council , 00185 Roma , Italy 4 learning , particularly Convolutional Neural Networks

2024

10 0000 0002

Eye-tracking technology has long been a valuable tool across various domains, and recent advancements in neural networks have significantly expanded its versatility and potential. However, real-world applications continue to face challenges such as accommodating users' natural movements, variations in lighting, occlusions of the eyes, and the limited availability of large, open-source datasets for training models. To address these issues, we developed a comprehensive pipeline that produces a lightweight and eficient model, requiring only an RGB camera as external hardware, making it easily deployable on standard PCs. Key input features include facial images, eye regions, head pose angles, the Eye Aspect Ratio (EAR), and a face grid that determines the face's location within the camera's frame. The model was trained using a custom dataset, in which participants were instructed to fixate on both randomly positioned points and the standard 9-point grid commonly employed in eye-tracking calibration. The resulting system was integrated into a real-time application, ofering fast and accessible gaze tracking, making it well-suited for studies requiring rapid gaze assessments across broad regions of the screen, such as psychometric research and Human-Computer Interaction (HCI) tasks. Its design is particularly advantageous for gaze laterality studies, which explore hemispheric dominance and attentional bias in cognitive and emotional processing, key concepts relevant to ADHD and dyslexia. Moreover, the system's capabilities naturally extend to emotional and decision-making tasks, where broad-area gaze tracking can support the analysis of preference formation and attentional patterns without the need for specialized hardware.

CEUR ceur-ws.org

1. Introduction

The human senses gather approximately 11 million bits of information per second, with about 80% being visual and the remainder distributed among the other senses. Due to the dominance of visual perception, AI-based technology [ 1, 2, 3, 4, 5 ] has become a valuable research tool in fields such as psychology [ 6, 7, 8, 9 ], marketing [ 10, 11 ], healthcare [ 12, 13, 14, 15 ], safety [ 16, 17 ], Human-Computer Interaction (HCI) [18, 19, 20, 21], and Virtual Reality (VR) and robotics [22, 23, 24]. This technology is particularly crucial in psychometric applications, facilitating studies on cognitive functions like focus, emotion recognition, and decision-making, as well as in gaze laterality research, where phenomena such as hemispheric dominance and attentional bias are investigated. Historically, professional systems relied on expensive hardware, such as scleral search coils [25], electrooculography [26], EEG (C. Napoli) Similarly, [ 31 ] investigated reading performance in children with ADHD, providing key insights into how the condition afects oculomotor control and reading ability, highlighting its potential for educational and clinical applications. in [ 32 ] a similar approach is used for to diagnose autism spectrum disorder. In addition, these systems have proven efective in detecting dyslexia by capturing distinctive eye movement patterns during reading tasks.

This approach, powered by CNNs, enables early identification of dyslexia, allowing for timely interventions [33]. To summarize, eye-tracking in gaze laterality research

Attribution 4.0 International (CC BY 4.0). © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License provides a unique window into cognitive processes, allowing for a deeper understanding of how attentional 2.1.2. Appearance-Based Approach resources are allocated across the visual field. For that reason, the motivation behind developing our lightweight Appearance-based methods aim to learn a direct mapping real-time application is to enable more researchers to between the input image and the eye-gaze direction withstudy gaze movement patterns without the need to in- out relying on camera calibration or geometric models vest in expensive professional eye-tracking systems. By [44]. These methods are highly flexible, but they can be reducing the cost and complexity, we aim to make this sensitive to head movements. Currently, the most efectechnology more available for a wider range of studies tive approaches leverage convolutional neural networks focused on cognitive and neurological research. As eye- (CNNs) and their variants to create mapping functions. tracking becomes more accessible, its application in both While CNNs often achieve high accuracy on benchmark research and clinical environments will continue to grow, datasets, they can struggle to generalize across diferent ofering new avenues for understanding and addressing datasets unless trained on large-scale annotated datasets, these conditions. which are time-consuming and complex to create.

Recent works have made significant eforts to overcome these challenges by creating diverse and compre2. Related Works hensive datasets that improve the training and generalization of CNN models. For example, the MPIIGaze 2.1. Eye-Tracking Approaches dataset [45] is a widely-used resource that contains over 200,000 images of 15 participants captured in real-world In literature is possible to distinguish among 2 possible environments. This dataset helps improve gaze predicapproaches, free from heavy specific instrumentation: tion in unconstrained settings, with variations in lighting, Model-Based, and Appearance Based [ 34 ]. head pose, and other real-world factors. Similarly, ETH-XGaze [46] provides a large dataset 2.1.1. Model-Based Approach with high-quality annotations, including images from 110 The model-based approach utilizes a 3D geometric model subjects captured under a wide range of head poses and to determine the direction of the eye’s gaze. This is done lighting conditions. This dataset addresses the limitations by calculating a vector that connects the 3D positions of smaller datasets and enables CNN models to learn of the eyeball’s center and the pupil’s center. These po- robust gaze estimations in diverse environments. sitions are derived from 2D eye landmarks and the 2D Additionally, the FAZE dataset [47] is designed specifposition of the iris in the image, which are then projected ically to tackle domain generalization problems. FAZE onto the 3D model. Initially, research in this area focused includes a large number of participants and images across on developing accurate geometric models, but more re- diferent devices and environments, aiming to enhance cent advancements have shifted towards improving the the generalization of appearance-based gaze estimation precision of eye landmark detection using machine learn- models by incorporating domain adaptation techniques. ing methods [ 35, 36, 37, 38, 39, 40 ]. For instance, [48] introduced GazeCapture, a dataset of

For example, [ 41 ] describes an eye-tracking system videos recorded using smartphone front cameras under that uses the Kinect v2 sensor. This device, equipped varying lighting conditions and head movements. They with RGB and depth cameras, identifies facial landmarks used this dataset to train a CNN to predict the screen coand computes the 3D gaze vector by combining face ori- ordinates a user is looking at on a smartphone or tablet. entation with eye direction. Another system, presented The input to the CNN includes segmented images of the in [42], employs the Supervised Descent Method (SDM) eyes and face, as well as a mask showing the face’s loto detect 2D facial landmarks, while depth information cation in the image. To enhance real-time performance from the Kinect is used to estimate the user’s 3D head (10–15 FPS on modern mobile devices), the authors appose. The eye regions are further processed using the plied a technique called dark knowledge to reduce model Starburst algorithm to estimate the pupil center for accu- complexity. rate gaze tracking. An alternative approach, proposed by [49], works in a

A more recent approach [43] uses a combination of desktop environment and uses an RGB camera to track Unet and Squeezenet networks to significantly improve eye movement. The system first segments the eye region, the accuracy and memory eficiency of eye-gaze tracking, detects the iris center and the inner eye corner and then making it feasible even on smartphones. Although model- calculates an eye vector representing the eye’s movement. based techniques ofer the advantage of being training- A second-order polynomial mapping function, combined free and adaptable to various conditions, they can still with head pose information, is used to map this eye vecface challenges with the precision of landmark detection tor to screen coordinates while compensating for head and the accurate positioning of the iris. movements.

More recent work [50] shifts the focus from traditional eye-gaze tracking to time-varying signals such as the vertical displacement between the iris and the inner eye corner, which is less afected by head movements. Instead of a direct mapping function, this method uses a CNN to track multiple eye feature points, including the iris center and eyelid positions. These points are then used to generate eye movement signals, which are fed into a specialized CNN for user behavior recognition.

2.2. Challenges and Approach

limitations, a brand-new collection of the dataset was necessary, which was more suitable for the task of interest. To address these challenges a system was designed to record the user’s gaze on a PC’s screen optimizing the data to the task of interest.

Despite notable advancements, real-world applications of eye-tracking technologies continue to face significant challenges. These challenges arise from environmental factors such as varying lighting conditions, reflections in the images (e.g., glare), objects on the face (e.g., eyeglasses), diferences in contrast between the iris and pupil due to varying iris colors, and individual variations in eye anatomy. Additionally, the required computational resources, combined with the limited range of vertical eye movements, further complicate these implementations. Furthermore, the end-to-end approach relies on access to large-scale, publicly available datasets for train- 3.1.1. Recording ing, which presents an additional hurdle. As a result, despite their potential, these methods have not yet been widely adopted, often being overshadowed by specialized eye-tracking equipment designed for specific purposes.

To address these challenges, a comprehensive pipeline has been developed, encompassing dataset collection, model architecture design, and real-time testing. The goal is to utilize Convolutional Neural Networks to create an end-to-end gaze prediction system that uses only images captured from a standard laptop webcam, aiming to achieve real-time performance.

Data collection used 15-inch laptops in various environmental and lighting conditions. To mitigate the potential biases introduced by the use of a single webcam for all the data, multiple webcams from diferent computers were utilized. This strategy ensured the collection of a diverse set of images, simulating possible real-world applications and enhancing the robustness and generalization capability of the model limiting the bias introduction.

The custom dataset was gathered using specially developed software designed to display nine strategically chosen key points on the screen. These points included one at each corner of the screen, one at the center, and 3. Implementation one at each of the four cardinal directions on the screen, Nord, sud, est, and west, as illustrated in Figure 1. ParticiThis section explores the implementation of the entire pants were instructed to fixate on each point sequentially, pipeline, from data collection to the architecture and as they were shown, for a predetermined amount of time. real-time tracking, expanding the key components. This method allowed the collection of data samples for each gaze point while permitting participants to natu3.1. Dataset Collection rally adjust their head orientation and position like in a typical user interaction. Besides these 9 points, a variable To develop a robust eye-gaze tracking system using just a number of random points were also shown on the screen, portable computer’s webcam, the dataset is crucial. The one after the other. actual available ones present many limitations, such as Additionally, the data collection process included sesthe poor amount of data, poor quality data, or less lib- sions where participants were asked to wear glasses, to erty in the disposition and interaction of the user with enrich the dataset with varied and challenging condithe screen and the distance with the camera. Others, tions. with higher volume data, are based on mobile devices, not allowing an easy transition from vertical screens of smartphones to horizontal PC screens, similarly, the proximity and the relative angle of interaction to the device itself are drastically diferent. To overcome these 32–41

3.2. Data extraction and Annotation

3.2.1. Face, mask grid and eyes

Each video is then processed extracting candidate frames.

Each frame is inspected and the cropped face image is extracted if available. Face detection is executed using MediaPipe Face Detection, a lightweight model based on the BlazeFace architecture, which provides state-of-theart techniques optimized for real-time applications. This model also performs well under challenging conditions such as partial occlusions, diverse facial orientations, Figure 2: Facial landmarks provided by Dlib’s 68 model, which and varying lighting conditions. The MediaPipe detector detect the face and then the coordinate (x,y) of the 68 total outputs the coordinates (, , , ℎ) of the bounding box features, providing information about the aperture of mouth, around the detected face, which will be used to generate eye, and the orientation of the head. The points from 37 to the face grid. This grid will provide a spatial map of face 41 and from 43 to 48 will be leveraged for the computation of positioning within the video frame, helping the model to the Eye Aspect Ratio. Points 37 and 46 are leveraged for the understand where the face is positioned relative to the roll pose, points 28, and 9 for the tilt, and the 34, 37, and 46 entire frame. For each detected face, the bounding box for the yaw. coordinates (, , , ℎ) are scaled down to fit a grid of size 25 × 25. The bounding box is then mapped to this grid, marking cells where the face is 1 and all other cells Where 37, 38, … , 48 are the landmarks around the as 0. This binary grid serves as one of the inputs to the left and right eyes, respectively, according to the Figure 2. model, facilitating the learning of spatial relationships This metric facilitates the identification of eyelid position in the gaze estimation tasks. The pipeline proceeds to and blinks and leverages this information to improve the the detection of the eyes, which employs either Haar robustness of the model. cascades or the lib library, depending on which method yields the most accurate results on the specific conditions, 3.2.3. Roll-Pitch-Yaw as determined through a human-in-the-loop evaluation.

While the Haar Cascades already provide a bounding box The head orientation is derived from facial landmarks deto crop the region of interest of the eyes, the dlib uses the tected in each frame. Roll is determined by the tilt of the landmark features of the eyes, considers padding around, line connecting the outer corners of the eyes (landmarks and then crops. The eyes are not automatically included 37 and 46) relative to the horizontal axis, indicating left in the dataset, instead, each pair is inspected to ensure or right head tilt. The pitch measures the vertical tilt of they are successfully recognized and suficiently open. the head and is calculated from the vertical position of This check is crucial for confirming the quality of the the top of the nose bridge (landmark 28) relative to the data and that at least the horizontal position of the pupil chin (landmark 9), showing whether the head is tilted can be discerned, excluding instances where the eyes are upward or downward. Yaw, indicating left or right head fully closed. rotation around the vertical axis, is calculated from the

The face grid together with the face and eye images position of the nose tip (landmark 34) relative to the midare grouped with the gaze point as in Figure 3 and then point between the eyes (average of landmarks 37 and expanded with the additional input features. 46). These three angles provide a comprehensive 3D orientation of the head, enhancing the accuracy of gaze 3.2.2. Eye Aspect Ratio estimation without necessitating a 3D head model or extensive computations, making the system adaptable for real-time applications where the user’s head position varies.

In the end, this information is paired with the corresponding gaze point on the screen, selected from nine possible options.

If both eyes are correctly detected, the pipeline proceeds

to associate the corresponding Eyes Aspect Ratio. The EAR is a geometric measure used to quantify the openness of the eyes. It is computed for each eye using six specific facial landmarks. For the left eye, the EAR is calculated as follows:

EARℎ EAR = = ‖ 38 − 42‖ + ‖ 39 − 41‖

2‖ 37 − 40‖ ‖ 44 − 48‖ + ‖ 45 − 47‖ 2‖ 43 − 46‖

Some preprocessing steps were performed before feeding the data into the model for training to ensure the reliability and robustness of the system.

3.3.1. Image Resizing and Cropping Figure 4: Model Architecture pipeline: The model is organized in 2 parallel pipelines that work on the eye and face. The first, All images, face and eye regions, were resized to a uni- in red, takes as input the cropped eye images and the Eye form dimension of 64 × 64 pixels to maintain consistency Aspect Ratio computed with the facial landmarks. The CNNs across the dataset and to be fed into the model. that take as input the eyes share the parameters. The second, in blue, takes as input the cropped face, the Mask grid, and 3.3.2. Histogram Equalization tthhee hgeaazdeppooisnet.. TThheeni mthaegoeustapruetsblaurrerecdonfcoartpenriavtaecdytroecaosomnpsu.te 3.4.1. Model Architecture Histogram equalization was employed to improve feature extraction. This technique adjusts pixel values in an image to enhance overall contrast. By redistributing the intensity levels, it equalizes the histogram of the output image. This process makes the model more robust in identifying relevant features under varied lighting conditions.

The model architecture draws inspiration from the

iTracker model [48], incorporating modifications to enhance performance. These modifications include additional input features such as head pose angles (yaw, pitch, 3.3.3. Data Augmentation and roll), the Eye Aspect Ratio (EAR), and the reorganization and reduction of the layers, to provide a lighter Several data augmentation techniques were applied to model with faster convergence. The complete pipeline is enhance the robustness of the model. Specifically, a ran- shown in Figure 4. dom crop was used to simulate limited visibility of the The Eye Aspect Ratio incorporation started from the face or eyes, and Gaussian Blur was employed to mimic consideration that, in normal conditions, users will tend poor image quality or focus. Variability in brightness to open their eyes wider when looking at higher points and saturation was introduced, along with random rota- on a screen and as narrow as they are looking downward tions and random erasing of portions and filling it with on the screen. The integration of the EAR information random values. These techniques help reduce overfitting aims to specifically enhance the sensitivity of the model and improve the model’s ability to generalize from the to vertical gaze shifts, improving the performance of the training data to unseen data in real-world applications. model on the vertical axe prediction and better handling

These preprocessing steps, collectively, ensure that the cases in which the pupil is hardly observable by the simdata fed into the model is of high quality, consistent in ple raw image provided by the webcam. size and format, and varied enough to promote robust The integration of head orientation data, along with learning and prediction accuracy. the face grid, aims to provide the model with comprehensive information about the head’s spatial positioning, 3.4. Model without the necessity for computationally demanding external 3D models of the head or the eyes. Leverages In this section will be presented the model, the architec- the advantages of model-based methods while avoiding ture, and the training. The object was the realization of their drawbacks. an eficient model able to provide good performances and The model’s architecture is organized in two distinct run in real-time on a real-world application. The core of semantic pathways for the eyes and face, each consisting the implementation involved developing and training a of several convolutional layers followed by pooling layers, convolutional neural network (CNN) to predict the gaze these layers are designed to capture fine-grained details point based on the processed input features. necessary for accurate gaze estimation. The eye pathway processes separately the eye images with convolutional layers with shared parameters between the right and left eye, then the information is integrated with the EAR of both eyes with a fully connected. The face pathway processes the entire face region through a similar series of convolutional layers, then combines this information with the face grid, and roll pitch yaw angles.

3.5. Loss Function The choice of an appropriate loss function is extremely

important for the efectiveness of model training. During development, two primary loss functions were evaluated: Figure 5: Model Training Plot: The mean absolute error (MAE) Mean Squared Error (MSE) and Huber Loss. for pixel coordinates (x,y) is illustrated, with training data in

Huber Loss was used to mitigate the outlier sensitivity blue and validation data in orange. The green and red lines issue with MSE, and the large scale of pixel predictions. It represent the MAE for the x and y coordinates in the training combines the best properties of MSE and Mean Absolute data, while purple and brown depict these in the validation Error (MAE), behaving like MSE for small errors and like data. Notably, the MAE for the x coordinates is consistently MAE for large errors, reducing the influence of outliers higher than for the y coordinates, likely due to the larger pixel on the model’s training. The Huber Loss is defined as: scale on the laptop screen.

1 ( − )̂ 2 { 2 (| − | ̂− for | − | ̂ ≤ otherwise ( , )̂ =

12 ) toafinthinegucsoernsloisotkenincgy awtitthhet hlaepdtoaptasscert.eeTnhe1s5e-ifnrcahm, emsaairneWhere is a threshold parameter that dictates the transi- captured at a standard video frame rate of 30 frames per tion point between the squared loss and the absolute loss. second, usually provided by commonly available webThis property makes Huber Loss particularly promising cams, which balances between providing smooth video for this application, as it balances the need for robust- and the computational load on the system. Each captured ness with the sensitivity to small errors, critical for the frame undergoes a series of preprocessing steps like in precise prediction of gaze points. As shown in 5, Huber the training phase to maintain consistent data and to Loss provided a significant improvement in model con- enhance the model’s performance. vergence and performance compared to MSE, leading to more stable training and reduced gradient accumulation 3.6.2. Calibration

The application allows to perform an optional calibration

3.5.1. Regularization step to improve the results on the actual user of the eyeTo further increase the robustness of the architecture, tracking. To perform the calibration, the system proceeds were leveraged some regularization techniques. Together to show 9 points on the screen, in the 9 main representawith the already cited data augmentation, working on tive points, for each collects the prediction provided by the data, on the model side leveraged the dropout, with the model and compares it with the actual ground truth. a hyperparameter tuning which led to a successful value Then leverage the diference between the two values to of 0.2. The training loop then incorporated a learning improve further predictions of the actual user. rate scheduler together with an early stopping. 3.6.3. Feature Extraction

3.6. Real-Time Tracking The system ofers flexibility in selecting the method for

The implementation of the real-time tracking functional- extracting eye patches from images, with options includity represents an essential step for practical applications. ing the dlib68 and the eye cascade approaches. Following The following section describes the system’s setup, the this it calculates the Eye Aspect Ratio and head angles. operational flow, and the technologies employed. Then the model uses this information to predict the gaze point in screen coordinates in real-time. This step is com3.6.1. System Setup putationally intensive and is optimized to run eficiently on standard consumer hardware without significant deFor the real-time application, the system uses standard lays. The predicted gaze point is immediately displayed laptop webcams, 1280 x 720p, to capture video frames on the user’s screen, providing real-time feedback.

4. Results

above the other, the main focus was paid to the vertical movement recognition of the eyes, a critical aspect in the The proposed eye-tracking system demonstrates signif- eye-tracking field. icant and promising improvements in gaze estimation The comparison of the model with the MPIIGaze, compared to existing methods, excelling not only in pre- ETHXGaze, and FAZE pointed out a series of considdiction accuracy but also in eficiency—an essential fac- erations about the performances of the models. The comtor for real-time applications. These gains are largely parison focused on the zone classification accuracy in a attributed to model optimizations that resulted in a more grid setup of the screen (Figure 6), lightweight design. Specifically, the system performed smoothly on a laptop GPU, achieving a frame rate of 4.1.1. Four Cell Grid task 50 frames per second under optimal visibility and environmental conditions. On a laptop CPU, the model Our model showed an overall accuracy of 88.5% with prealso maintained commendable performance, consistently cision of 0.887 and recall of 0.885, excelling particularly in delivering 30 frames per second without any drop in accu- the top left grid cell while showing weaker performance racy. This places the system on par with state-of-the-art in the bottom right grid cell. This shows a promising models but with fewer parameters, making it more efi- overall behavior, with room for improvement due to the cient. unbalanced result in the four cells. Interestingly, the

To validate the model’s efectiveness, a real-time appli- proposed architecture, compared to the other models, cation was developed. This application captures the live showed a slightly better understanding of the top left and video feed from the webcam, processes it through the top right cases, rather than the lower one. model to predict the user’s gaze point, and then displays the predicted point on the screen, providing immediate 4.1.2. Vertical Dual-Grid Task feedback. To further assess performance, the model was compared with several state-of-the-art architectures on a common task, where the screen was divided into cells to track accuracy. The system showed promising results in terms of both inference time for real-time deployment and prediction accuracy, performing competitively against the benchmark models.

The second task aimed to inspect the model’s capability of recognizing horizontal movement, and the model demonstrated a good 93% overall accuracy, with 0.935 precision and 0.912 recall, quite struggling with the right section. This result comes from the previous considerations on the 4-grid task, in which, the bottom-right case was shown to be responsible for a drop-down in the prediction performances, making sufering this lack also to this other task when needed to correctly identify the gaze-point into the right part of the screen.

To evaluate the eye-tracking recordings and benchmark

model performance, the real-time eye-tracking system was leveraged to perform a Fixation-Zone task [51]. 4.1.3. Horizontal Dual-Grid Task To maintain consistency, all the experiments were per- The third task focused on evaluating the model’s ability formed on a laptop with an incorporated camera and to identify vertical eye movement accurately. Here, the a 15-inch screen. The approach performs a zone-wise model achieved an accuracy of 91%, precision of 0.925, classification accuracy, aggregated over the participants, and recall of 0.899, in correctly recognizing the vertical where the users are instructed to fixate on specific regions grid cell observed. While it is apparent that other models of the screen, which turn green for a certain amount of experience a significant decline in performance transitime, free to move, as long as their gaze is constrained tioning from horizontal to vertical eye movement tasks, within the boundaries. The experiments instructed to the proposed model exhibited only a slight drop in perperform a total of 3 tasks, where each aimed to enforce formance. It still significantly outperformed the other and study the model performances to specific behavior models, especially in the top cell case. Unfortunately, and compare this information with other SOTA archi- the bottom cell exhibited a slightly lower precision of tectures like the MPIIGaze, ETHXGaze, and the FAZE. the model when the gaze point approached the screen’s In the first one, the screen was divided into 4 grid cells, center, leading to some misclassifications that slightly determining the overall behavior of the model, and ob- exceeded the bottom grid cell boundary and resulted in serving the general performances of eye-tracking all over errors. the screen. The second task has 2 grid cells that split Unfortunately, many misclassification cases were also vertically the screen on two sides, this allows to better linked to unfavorable user visibility or environmental focus on the architecture capability to recognize the hor- conditions. These factors made predictions more chalizontal movement of the gaze. In the last task, which lenging for the model, highlighting areas for improvedivided the screen into two grid cells horizontally, one ment and the potential to surpass existing architectures.

4.1. Comparison tasks 5. Conclusions

This work presented an end-to-end eye-tracking solution designed to be lightweight, utilizing only a standard webcam, while maintaining high accuracy and low resource requirements. The results indicate that the proposed system can be efectively applied in various real-world scenarios, achieving robust performance in both vertical and horizontal gaze detection. This versatility makes it a practical tool for studies in areas such as psychometrics and Human-Computer Interaction (HCI), especially those focused on gaze laterality and cognitive assessments for broad regions of the screen. Interestingly, the model demonstrated significant robustness in detecting vertical gaze movements, likely due to its high sensitivity to eye aperture ratio, making it particularly adept at distinguishing between upper and lower gaze positions. This capability was confirmed during task evaluations, where the system showed better precision in upper-screen positions compared to lower ones. Some imprecision was noted in central areas of the screen, particularly in distinguishing between center-up and center-low positions, likely due to the natural tendency for the eyes to be more open in upper gaze positions. Despite these challenges, the model maintained eficiency even on smaller laptop screens and at greater distances, contrasting with typical close-range setups required by mobile devices. Future work could focus on enhancing the system’s robustness under diverse lighting conditions and user poses by enriching the dataset with more varied samples and a wider range of user demographics. Increasing the number of ifxation points during data collection could also provide a more comprehensive understanding for the model, improving precision across all screen areas. Additionally, modifying the model to focus solely on the eye regions, rather than the entire face, could improve its performance in situations where face visibility is limited or when only one eye is visible. This refinement would not only make the model more eficient but also help it handle challenging conditions such as medical constraints or occlusions more efectively. In summary, the proposed system represents a significant advancement in making eye-tracking technology more accessible and practical for a wide range of everyday applications, reducing the need for expensive specialized hardware and ofering a versatile tool for research and clinical environments.

[1] M. M. Mariani , R.

Perez-Vega , J. Wirtz,

Ai in marketing, consumer research and psychology: A systematic literature review and research agenda , Psychology & Marketing 39 ( 2022 ) 755 - 776 .

[2]

Banerjee ,

Sindhu ,

Sindhu , et al., Exploring the intersections of ai (artificial intelligence) in psychology and astrology: a conceptual inquiry for human well-being , J Psychol Clin Psychiatry 15 ( 2024 ) 75 - 77 .

[3]

Napoli ,

Bonanno , G. Capizzi, An hybrid neurowavelet approach for long-term prediction of solar wind , in: Proceedings of the International Astronomical Union , volume 6 , 2010 , p. 153 - 155 . doi: 10 .1017/S174392131100679X.

[4]

Napoli ,

Pappalardo , E. Tramontana, An agent-driven semantical identifier using radial basis neural networks and reinforcement learning , in: CEUR Workshop Proceedings , volume 1260 , 2014 . URL: https://www.scopus.com/inward/ record.uri?eid= 2 - s2 . 0 - 84919742629 &partnerID= 40 &md5= c3ee8a3fa1716b39215326edfc67d955 .

[5]

Capizzi ,

G. Lo

Sciuto ,

Napoli , E. Tramontana, Advanced and adaptive dispatch for smart grids by means of predictive models , IEEE Transactions on Smart Grid 9 ( 2018 ) 6684 - 6691 . doi: 10 .1109/TSG. 2017 . 2718241 .

[6]

Clifton ,

Ferreira ,

J. M.

Henderson ,

A. W.

Inhof ,

S. P.

Liversedge ,

E. D.

Reichle ,

E. R.

Schotter , Eye movements in reading and information processing: Keith rayner's 40year legacy , Journal of Memory and Language 86 ( 2016 ) 1 - 19 . URL: https://www.sciencedirect.com/science/ article/pii/S0749596X15000960. doi:https://doi. org/10.1016/j.jml. 2015 . 07 .004.

[7]

Russo ,

Napoli , A comprehensive solution for formatics) , volume 14126 LNAI, 2023 , p. 3 - 16 . psychological treatment and therapeutic path plan- doi:10.1007/978- 3- 031 - 42508- 0_1. ning based on knowledge base and expertise shar- [18]

Jacob ,

Karn , Eye Tracking in Humaning, in: CEUR Workshop Proceedings , volume 2472 ,

Computer

Interaction and Usability Research: 2019 , p. 41 - 47 . Ready to Deliver the Promises , volume 2 , 2003 ,

[8]

Lo Sciuto ,

Russo ,

Napoli , A cloud-based pp. 573 - 605 . doi: 10 .1016/B978- 044451020 - 4/ lfexible solution for psychometric tests validation, 50031- 1. administration and evaluation , in: CEUR Workshop [19]

Brandizzi ,

Russo , G. Galati, C. Napoli, AddressProceedings, volume 2468 , 2019 , p. 16 - 21 . ing vehicle sharing through behavioral analysis: A

[9]

Falciglia ,

Betello ,

Russo ,

Napoli , Learn- solution to user clustering using recency-frequencying visual stimulus-evoked eeg manifold for neural monetary and vehicle relocation based on neighimage classification , Neurocomputing 588 ( 2024 ). borhood splits, Information (Switzerland) 13 ( 2022 ). doi: 10 .1016/j.neucom. 2024 . 127654 . doi: 10 .3390/info13110511.

[10]

Napoli ,

Pappalardo ,

Tramontana , A hy- [20]

Brociek ,

G. D.

Magistris ,

Cardia , F. Coppa, brid neuro-wavelet predictor for qos control and S. Russo, Contagion prevention of covid-19 by stability , in: Lecture Notes in Computer Sci- means of touch detection for retail stores , in: CEUR ence (including subseries Lecture Notes in Arti- Workshop Proceedings , volume 3092 , 2021 , p. 89 - ifcial Intelligence and Lecture Notes in Bioinfor- 94. matics) , volume 8249 LNAI, 2013 , p. 527 - 538 . [21]

Brandizzi ,

Russo ,

Brociek ,

Wajda , First doi: 10 .1007/978- 3- 319 - 03524- 6_ 45 . studies to apply the theory of mind theory to green

[11] G. De Magistris , S.

Russo , P. Roma, J. T.

Starczewski , and smart mobility by using gaussian area clusterC. Napoli, An explainable fake news detector based ing , in: CEUR Workshop Proceedings , volume 3118 , on named entity recognition and stance classifica- 2021 , p. 71 - 76 . tion applied to covid-19, Information (Switzerland) [22] V.

Ponzi , S.

Russo , V.

Bianco , C.

Napoli , A. Wa13 ( 2022 ). doi: 10 .3390/info13030137. jda, Psychoeducative social robots for an healthier

[12]

P. S.

Holzman ,

L. R.

Proctor ,

D. L.

Levy ,

N. J.

Yasillo , lifestyle using artificial intelligence: a case-study,

H. Y. Meltzer , S. W.

Hurt , Eye-tracking dysfunc- in : CEUR Workshop Proceedings , volume 3118 , tions in schizophrenic patients and their relatives , 2021 , p. 26 - 33 . Archives of general psychiatry 31 ( 1974 ) 143 - 151 . [23]

N. N.

Dat ,

Ponzi ,

Russo ,

Vincelli , Supporting

[13]

S. I.

Illari ,

Russo ,

Avanzato ,

Napoli , A cloud- impaired people with a following robotic assistant oriented architecture for the remote assessment by means of end-to-end visual target navigation and follow-up of hospitalized patients, in: CEUR and reinforcement learning approaches , in: CEUR Workshop Proceedings , volume 2694 , 2020 , p. 29 - Workshop

Proceedings

, volume 3118 , 2021 , p. 51 - 35 . 63 .

[14]

Napoli ,

Ponzi ,

Puglisi , S. Russo, [24]

Capizzi ,

Napoli ,

Russo , M. Woźniak, LessenI. E. Tibermacine, Exploiting robots as healthcare ing stress and anxiety-related behaviors by means resources for epidemics management and support of ai-driven drones for aromatherapy, in: CEUR caregivers , in: CEUR Workshop Proceedings , vol- Workshop Proceedings, volume 2594 , 2020 , p. 7 - ume 3686, 2024 , p. 1 - 10 . 12 .

[15]

Russo , S. I. Illari ,

Avanzato ,

Napoli , Reduc- [25]

Whitmire ,

Trutoiu ,

Cavin ,

Perek , B.

Scally, ing the psychological burden of isolated oncologi- J.

Phillips , S.

Patel , Eyecontact: scleral coil eye cal patients by means of decision trees, in: CEUR tracking for virtual reality , 2016 , pp. 184 - 191 . Workshop Proceedings, volume 2768 , 2020 , p. 46 - doi :10.1145/2971763.2971771. 53. [26]

Tian ,

Cao , Fatigue driving detection based on

[16]

Singh ,

J. S.

Bhatia ,

Kaur , Eye tracking based electrooculography: a review, EURASIP Journal on driver fatigue monitoring and warning system , in: Image and Video Processing 2021 ( 2021 ) 33 . India International Conference on Power Electron- [27]

Russo ,

I. E.

Tibermacine , A . Tibermacine, ics 2010 ( IICPE2010 ), 2011 , pp. 1 - 6 . doi: 10 .1109/ D. Chebana,

Nahili ,

Starczewscki ,

Napoli , IICPE. 2011 . 5728062 . Analyzing eeg patterns in young adults exposed to

[17]

Alfarano , G. De Magistris,

Mongelli , S. Russo, diferent acrophobia levels: a vr study, Frontiers J . Starczewski , C. Napoli , A novel convmixer in Human Neuroscience 18 ( 2024 ). doi: 10 . 3389/ transformer based architecture for violent behav- fnhum . 2024 . 1348154 . ior detection, in: Lecture Notes in Computer [28]

Boutarfaia ,

Russo ,

Tibermacine , I. E. TiberScience (including subseries Lecture Notes in Ar- macine, Deep learning for eeg-based motor imagery tificial Intelligence and Lecture Notes in Bioin- classification: Towards enhanced human-machine interaction and assistive robotics, in: CEUR Work- gaze tracking by combining eye-and facial-gaze shop Proceedings , volume 3695 , 2023 , p. 68 - 74 . vectors, The Journal of Supercomputing 73 ( 2017 )

[29]

M. W.

Johns ,

Tucker ,

Chapman , K. Crow- 3038 -3052. ley, N. Michael, Monitoring eye and eyelid move- [42]

Xiong ,

Liu ,

Cai , Z. Zhang, Eye gaze trackments by infrared reflectance oculography to mea- ing using an rgbd camera: a comparison with a sure drowsiness in drivers , Somnologie 11 ( 2007 ) rgb solution , in: Proceedings of the 2014 ACM 234-242 . International Joint Conference on Pervasive and

[30]

D. Y.

Lee ,

Shin ,

R. W.

Park , S. -M. Cho , S. Han, Ubiquitous Computing: Adjunct Publication, 2014 ,

Yoon ,

Choo ,

J. M.

Shim ,

Kim ,

S.-W.

Jeon , et al., pp. 1113 - 1121 . Use of eye tracking to improve the identification of [43]

Wang ,

Chai ,

Xia , Realtime and accurate 3d attention-deficit/hyperactivity disorder in children, eye gaze capture with dcnn-based iris and pupil Scientific Reports 13 ( 2023 ) 14469 . segmentation, IEEE transactions on visualization

[31]

Caldani ,

Acquaviva ,

Moscoso ,

Peyre , and computer graphics 27 ( 2019 ) 190 - 203 . R. Delorme,

M. P.

Bucci , Reading performance in [44]

I. E.

Tibermacine ,

Tibermacine , W. Guettala, children with adhd: An eye-tracking study , Annals

Napoli , S.

Russo , Enhancing sentiment analof Dyslexia 72 ( 2022 ) 552 - 565 . ysis on seed-iv dataset with vision transformers:

[32]

Ponzi ,

Russo ,

Wajda ,

Napoli , A com- A comparative study , in: ACM International parative study of machine learning approaches for Conference Proceeding Series , 2023 , p. 238 - 246 . autism detection in children from imaging data , in: doi:10.1145/3638985.3639024. CEUR Workshop Proceedings , volume 3398 , 2022 , [45]

Zhang ,

Sugano ,

Fritz ,

Bulling , Mpiigaze: p. 9 - 15 . Real-world dataset and deep appearance-based gaze

[33]

Nerušil ,

Polec ,

Škunda ,

Kačur , Eye tracking estimation, IEEE transactions on pattern analysis based dyslexia detection using a holistic approach , and machine intelligence 41 ( 2017 ) 162 - 175 . Scientific Reports 11 ( 2021 ) 15687 . [46]

Zhang , S. Park,

Beeler ,

Bradley , S. Tang,

[34]

Ji , D.

Hansen, In the eye of the beholder: A sur-

O. Hilliges , Eth-xgaze: A large scale dataset for vey of models for eyes and gaze, IEEE Transactions gaze estimation under extreme head pose and gaze on Pattern Analysis & Machine Intelligence 32 variation, in: Computer Vision-ECCV 2020 : 16th ( 2010 ) 478 - 500 . doi: 10 .1109/TPAMI. 2009 . 30 . European Conference, Glasgow, UK, August 23-28,

[35]

Fiani ,

Russo ,

Napoli , An advanced solu- 2020 , Proceedings, Part V 16, Springer, 2020 , pp. tion based on machine learning for remote emdr 365-381. therapy, Technologies 11 ( 2023 ). doi: 10 .3390/ [47]

Park ,

S. D.

Mello ,

Molchanov , U. Iqbal, technologies11060172 .

Hilliges ,

Kautz , Few-shot adaptive gaze es-

[36]

Iacobelli ,

Russo , C. Napoli, A machine learning timation, in: Proceedings of the IEEE/CVF interbased real-time application for engagement detec - national conference on computer vision , 2019 , pp. tion , in: CEUR Workshop Proceedings , volume 9368 - 9377 . 3695, 2023 , p. 75 - 84 . [48]

Krafka ,

Khosla ,

Kellnhofer , H. Kannan,

[37]

Fiani ,

Russo ,

Napoli , A fully automatic visual S. Bhandarkar ,

Matusik ,

Torralba , Eye trackattention estimation support system for a safer driv- ing for everyone , 2016 . arXiv: 1606 .05814. ing experience, in: CEUR Workshop Proceedings, [49]

Meng ,

Zhao , Webcam-based eye movevolume 3695 , 2023 , p. 40 - 50 . ment analysis using cnn , IEEE Access 5 ( 2017 )

[38]

Iacobelli ,

Ponzi ,

Russo ,

Napoli , Eye- 19581 -19587. tracking system with low-end hardware : Devel- [50] Y.-m. Cheung , Q. Peng , Eye gaze tracking with opment and evaluation, Information (Switzerland) a web camera in a desktop environment , IEEE 14 ( 2023 ). doi:10.3390/info14120644. Transactions on Human-Machine Systems 45 ( 2015 )

[39]

Pepe ,

Tedeschi ,

Brandizzi ,

Russo , L. Ioc- 419 -430. chi, C. Napoli, Human attention assessment us- [51]

Saxena ,

L. K.

Fink ,

E. B.

Lange , Deep ing a machine learning approach with gan-based learning models for webcam eye trackdata augmentation technique trained using a cus- ing in online experiments, Behavior Retom dataset , OBM Neurobiology 6 ( 2022 ). doi:10. search Methods 56 ( 2024 ) 3487 - 3503 . URL: 21926/obm.neurobiol.2204139. https://doi.org/10.3758/s13428-023-02190-6.

[40]

Fiani ,

Ponzi ,

Russo , Keeping eyes on the doi:10.3758/s13428-023-02190-6 . road: Understanding driver attention and its role in safe driving , in: CEUR Workshop Proceedings , volume 3695 , 2023 , p. 85 - 95 .

[41]

B. C.

Kim ,

Ko , U. Jang, H. Han,

E. C.

Lee , 3d