=Paper= {{Paper |id=Vol-3695/p06 |storemode=property |title=A Fully Automatic Visual Attention Estimation Support System for A Safer Driving Experience |pdfUrl=https://ceur-ws.org/Vol-3695/p06.pdf |volume=Vol-3695 |authors=Francesca Fiani,Samuele Russo,Christian Napoli |dblpUrl=https://dblp.org/rec/conf/system/FianiR023 }} ==A Fully Automatic Visual Attention Estimation Support System for A Safer Driving Experience== https://ceur-ws.org/Vol-3695/p06.pdf

A Fully Automatic Visual Attention Estimation Support
System for A Safer Driving Experience
Francesca Fiani1 , Samuele Russo2 and Christian Napoli1,3,4
1
Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185 Roma, Italy
2
Department of Psychology, Sapienza University of Rome, 00185 Roma, Italy
3
Institute for Systems Analysis and Computer Science, Italian National Research Council, 00185 Roma, Italy
4
Department of Computational Intelligence, Czestochowa University of Technology, 42-201 Czestochowa, Poland

Abstract
Drivers’ attention is a key element in safe driving and in avoiding possible accidents. In this paper, we present a new approach
to the task of Visual Attention Estimation in drivers. The model we introduce consists of two branches, one which performs
Gaze Point Detection to determine the exact point of focus of the driver, and the other which executes Object Detection
to recognize all relevant elements on the road (e.g. vehicles, pedestrians, and traffic signs). The combination of the two
outputs from the two branches allows us to determine whether the driver is attentive and, eventually, on which element of
the road they are focusing. Two models are tested for the gaze detection task: the GazeCNN model and a model consisting
of a CNN+Transformer. The performance of both models is evaluated and compared with other state-of-the-art models to
choose the best approach for the task. Finally, the results of the Visual Attention Estimation performed on 3761 pairs of
images (driver view and corresponding road view) from the DGAZE dataset are reported and analyzed.

Keywords
Visual Attention Estimation, ADAS (Autonomous Driver Assistance Systems), GazeCNN, Visual Transformers, DGAZE

1. Introduction gorithms help optimize energy consumption and reduce
carbon footprint by predicting demand and managing
Attention while driving is a key element in road safety supply efficiently. In the field of renewable energies,
to keep passengers, drivers and pedestrians safe. Distrac- these algorithms aid in forecasting energy production
tions caused by secondary tasks have been proved as the from sources like wind and solar, thereby facilitating ef-
main factor in slowed responses in immediately danger- fective grid management. In the field of human-computer
ous situations [1], with 80% of reported crashes and 65% interaction, machine learning enhances user experience
of near-crashes over 100 analyzed vehicles caused by un- by enabling systems to understand and respond to human
safe driving behaviors such as inattention [2]. Moreover, behavior in a more intuitive and personalized manner
the probability of collisions caused by driver distraction [7, 8, 9, 10, 11]. Lastly, in the automobile industry, ma-
is significantly reduced in case passengers warn them chine learning is driving the revolution of autonomous ve-
about unseen hazards [3, 4]. This shows the importance hicles and smart traffic management systems, contribut-
of developing increasingly efficient Advanced Driver As- ing to safer and more efficient transportation [12]. The
sistance Systems (ADAS), especially with the use of arti- goal of this paper is to introduce a new approach to vi-
ficial intelligence algorithms capable of understanding sual attention estimation for safe driving. To the best
whether a driver is distracted from the road and alerting of our knowledge, most studies on driver attention are
them. The identification of points of focus of drivers based either on the evaluation of driver behavior, with-
can also be used to train autonomous driving algorithms out considering the environment surrounding the car, or
to pay more attention to some elements rather than to exclusively on the road, training models to identify the
others, thus making them more capable of safe driving. elements to focus on. Our approach, in contrast, entails
Machine learning and distributed computing approaches a comprehensive consideration of both the driver and
e.g. cloud computing have become a cornerstone of mod- the road views. Specifically, we assess the point of focus
ern data technology, playing a pivotal role in various of the driver, contextually understanding whether they
sectors [5, 6]. In the green economy, machine learning al- are paying attention to the road, and eventually which
element of the road has captured their focus. To do this,
SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi- we divide our task into two parts:
neering and Mathematics, Rome, December 3-6, 2023
$ fiani@diag.uniroma1.it (F. Fiani); samuele.russo@uniroma1.it • Gaze point detection: we identify the point where
(S. Russo); cnapoli@diag.uniroma1.it (C. Napoli) the driver is looking at to assess where the driver
0009-0005-0396-7019 (F. Fiani); 0000-0002-1846-9996 (S. Russo); is paying attention;
0000-0002-9421-8566 (C. Napoli)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License • Object identification: we identify the main ob-
Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Francesca Fiani et al. CEUR Workshop Proceedings 40–50

various advancements over time, it remains a challeng-
ing task due to aspects such as the uniqueness of faces
and eyes, potential occlusions, differences in lighting,
image quality, etc. Throughout literature, various meth-
ods have been employed, ranging from simple classifi-
cation methods, like Random Forest [17] and SVM [18],
to deep neural network models. The use of deep CNNs
has greatly enhanced accuracy of this task, with great
Figure 1: Example of paired images from the DGAZE dataset. results obtained even with wild datasets [19]. While the
(a) Driver view of driver number 22. (b) Sample 15 road view majority of works use only the eyes to perform gaze es-
of driver number 22.
timation, other works use facial features different than
the eyes, such as facial grids [20] or a combination of the
eyes images and the head pose [21].
jects on the road, namely pedestrians, motorbikes, Transformers are also a viable novel solution, with
traffic signs, traffic lights, other cars, and trucks. two types of transformers derived from the Vision Trans-
former (ViT) framework finding success [14]. The first
For the first task we will employ the GazeCNN model, one, denoted as GazeTR-Pure, processes the cropped face
a variant of a ResNet [13] that takes various facial fea- as input, divides it into patches and passes them to a
tures as input, such nose and left pupil position, head transformer encoder that will return the direction of gaze.
pose and eyes corners. In addition, to perform a compara- In contrast, GazeTR-Hybrid adopts a hybrid approach,
tive analysis between two different methods, we will also combining Convolutional Neural Networks (CNN) with
consider a Resnet+Transformer model [14] fine-tuned to transformers. The CNN extracts local feature maps from
output the exact position of where the driver is looking at. the face, which are then passed to the encoder trans-
For the second task, instead, we use a fine-tuned YOLOv8 former to capture the global relationships between the
model, part of the YOLO family of object detection algo- maps and finally obtain the desired output. These models
rithms [15], configured to consider only the classes of take advantage of the transformer’s attention mechanism
interest. To accomplish our task, we used the DGAZE to improve performances, with the GazeTR-Hybrid ob-
dataset [16], which to the best of our knowledge is one taining results comparable to the state-of-the-art. As
of the few dataset that provide both both the driver’s previously mentioned GazeTR-Hybrid will be the base
view and the road view. This data was collected in a con- for one of our two approaches.
trolled laboratory setting where 112 street videos were
projected in front of 20 ’drivers’, who were told to focus
on a designated point annotated in the projected video. 2.2. Driver Gaze Prediction
This dataset contains over 180,000 pairs of images, where Driver gaze prediction task is approached in two ways in
each pair includes a road view and the corresponding literature. The first approach focuses only on the interior
driver view, plus a label indicating the coordinates of images of the car (the driver’s view) [22, 23, 24, 25]. Gen-
the point the driver was instructed to focus on (specifi- erally, the car is divided in different zones, such as the
cally, the center of the bounding box of the object). We windscreen, the speedometer, the two side-view mirrors,
reported an example of this dataset in Figure 1. the back mirror, and so on. The algorithms try to predict
The paper is organized as follows. In Section 2, related which of these areas the driver is looking at by analyzing
works about gaze detection and driver gaze prediction the images of the driver.
are provided to frame our work in the current state-of- The other approach, instead, is focused only on the
the-art scenario. Section 3 describes the data analysis, outside the car. Many papers analyze images of the road
the pre-processing and feature extraction done on the recorded from inside the car via the windscreen to cal-
DGAZE dataset and the proposed architecture to perform culate an attention map, i.e. a heat map where brighter
our task. Section 4 reports the performed experiments colors indicate the elements where drivers focus most
and the corresponding results. Finally, Section 5 presents while driving [26, 27, 28, 29]. Attention maps are ex-
the study’s conclusions. tremely significant for autonomous driving, since they
may be useful in training models that can understand, in
a given driving situation, which of the many important
2. Related Works elements of the road to focus on the most.
For what concerns the DGAZE dataset, already ana-
2.1. Gaze Detection
lyzed in the introduction, a related model called I-DGAZE
Gaze detection is a highly significant topic in the field of has also been developed [16]. The model consists of two
Computer Vision and Human-Robot Interaction. Despite branches. The first is composed of a CNN with the ad-

41
Francesca Fiani et al. CEUR Workshop Proceedings 40–50

Figure 2: (a) Cropped driver 22 view subjected to K-Means clustering. (b) Corresponding color distribution histogram after
K-clustering. On the x axis is represented the bin number, while on the y axis the number of pixel occurrences of the bin.

dition of a final flattened layer, which takes the driver’s son. Finally, we selected three distance metrics to conduct
left eye as input. The other is composed of only dense a dataset-wide comparison between the histograms and
layers and takes as input various features of the face, computed the corresponding matrices:
namely the pose, location, and area. The features gener-
ated by the two branches are then merged and passed to • Wasserstein (Earth Mover’s) Distance:
a fully connected layer that uses them to determine the (︂∫︁ )︂
coordinates of the gaze point (x, y). 𝑊 (𝑝, 𝑞) = ∏︀ inf ||𝑥 − 𝑦||𝑑𝛾(𝑥, 𝑦)
Building on the literature just presented, our work is 𝛾∈ (𝑝,𝑞) R×R
quite innovative in using an approach that is not widely (1)
used for the identification of drivers’ attention while driv- where 𝑝 and 𝑞 are two probability distributions
and (𝑝, 𝑞) denotes the set of all joint probability
∏︀
ing. It will also compare two models for gaze detection,
as mentioned above, combining the results of these with distributions on R × R whose marginals are 𝑝
those of YOLO in such a way as to output whether or not and 𝑝. This metric is symmetric.
the driver is paying attention to the road and in particular • Chi-Squared Distance:
to which element. ∑︁ (𝑝(𝑖) − 𝑞(𝑖))2
𝜒2 (𝑝, 𝑞) = (2)
𝑖
𝑝(𝑖)
3. Materials and Methods
where 𝑝 and 𝑞 are two probability distributions.
3.1. Data Analysis • Kullback-Leibler Divergence:
Due to various challenges in gaze detection (e.g. eye- ∑︁ (︂
𝑝(𝑖)
)︂
head interplay, illumination, eye registration errors, oc- 𝐷𝐾𝐿 (𝑝||𝑞) = 𝑝(𝑖) 𝑙𝑜𝑔 (3)
𝑞(𝑖)
clusions, difficulties in generalization of eye region ap- 𝑖∈R
pearance) [30], before proceeding with the implemen-
where 𝑝 and 𝑞 are two probability distributions
tation we conducted a thorough analysis of color distri-
on the same sample space R.
bution on our data, examining it in both RGB and HSV
color spaces. Driver’s view images were cropped to a
The obtained matrices for 3D RGB graphs are reported
700 x 700 format from the top-left corner at pixelwise
in Figures 5, 6 and 7. Our data analysis indicate that
x and y coordinates (25, 100). This pre-processing step,
there are no significant differences in color distribution
consistently applied throughout our work, was designed
among various driver images with the exception of cer-
with the specific purpose of eliminating non-essential
tain drivers, such as Driver 13 and 5, with consistently
areas within the image, focusing only on the face region.
high values among all metrics. Conversely, Drivers 2, 22,
We then computed histograms within the RGB and
and 23 occasionally exhibit increased differences, but not
HSV color spaces for a randomly selected sample from
consistently across all plots. The three metrics have also
each driver’s image set. The K-Means algorithm was
been calculated for 1D channels and averaged, produc-
employed to cluster all the colors in 16 clusters, with the
ing similar results, therefore they will not be reported.
resulting histogram shown in Figure 2.
The same process has also been repeated for the HSV
The relative 1D RGB graphs are presented in Figure 3,
color space, so the 1D and flattened 3D graphs have been
while the flattened 3D RGB graph is shown in Figure 4.
computed, with an additional 2D heat map of the 3D
Both graphs have been normalized to facilitate compari-

42
Francesca Fiani et al. CEUR Workshop Proceedings 40–50

Figure 3: Graphs of the red, green and blue channel bin frequency distribution. Each channel has 16 bins (represented on the
x axis), with the frequency for each bin represented on the y axis. The frequency distribution has been normalized.

Figure 4: Graph of the flattened bin frequency distribution. 64
bins have been considered for the flattened 3D representation
(represented on the x axis), with the frequency for each bin
represented on the y axis. The frequency distribution has been
normalized.
Figure 5: Wasserstein (Earth Mover’s) distance matrix be-
tween all couples of sample drivers 3D color distribution. A
graph, and the nine distance matrices have been com- high value indicates a big color space distance between images.
puted. Given the use of RGB space during the experi- Only the upper triangular matrix has been reported given the
ments and the absence of significant differences in the symmetry of the matrix.
HSV analysis, we will skip the presentation of the ob-
tained results.
were mismatched in the number of frames with driver
3.2. Architecture videos. The input is then processed to extract key com-
ponents of the face, i.e. the driver’s face, the left eye, the
As mentioned, our idea is to divide the model into two pupil position, the nose position, the head pose and the
branches. The first branch predicts the exact coordinates eye corners. A combination of SOTA tools for analyzing
(𝑥, 𝑦) of the driver’s focus point from the input driver facial features was used: a shape predictor, obtained from
view. The decision to predict the exact point of focus of dlib [31], for the extraction of the eyes and the position
the driver is due to the desire to achieve greater accuracy of the nose and pupils, a frontal face detector, also from
in estimating visual attention. This way, we will be able dlib, for the extraction of the face, and SixDRepNet [32]
to distinguish precisely which element of the road they for the extraction of the head pose.
are paying more attention to even in case of elements Two types of models will be considered for this branch
overlapping. To the best of our knowledge, this is a situ- and confronted to evaluate the best one in terms of perfor-
ation that is not very usual in the literature and could be mance. The first model is GazeCNN, a variation in model
an important innovation to obtain increasingly accurate and layers size of I-DGAZE. The model, shown in Fig-
results in Visual Attention Estimation. The video lengths ure 8, is composed of two branches which extract features
of view and driver videos were manually aligned since used as inputs for the final fully connected layer. The
some view videos (which are common among all drivers) first branch takes the cropped 3 x 32 x 64 left eye image

43
Francesca Fiani et al. CEUR Workshop Proceedings 40–50

Figure 8: Schematic model of the GazeCNN architecture.

Table 1
Evaluation Metrics at best epoch in Test Dataset for the three
Figure 6: Chi-Squared distance matrix between all couples selected models
of sample drivers 3D color distribution. A high value indicates
a big color space distance between images. Eye Feature Branch
Layer Kernel Output Channels
Conv2D_1 3x3 8
Conv2D_2 3x3 8
MaxPool2d_1 4x4 8
Dropout 8
Conv2D_3 3x3 4
MaxPool2d_2 4x4 4
Flatten_1 336
Feature Branch
Layer Kernel Output Channels
Dense_1 16
Fused Branch
Layer Kernel Output Channels
Merge_1 352
Dense_2 64
Dense_3 2
Figure 7: Kullback-Leibler divergence matrix between all
couples sample drivers 3D color distribution. A high value
indicates a big color space distance between images. will be discussed in the following section. This branch is
composed of only a fully connected layer of output size
16. The two features vectors output from the branches
are then merged in a 352-dimensional vector, which is
as input, which is then passed through three 8-channel
then passed through two fully connected layers which
convolutional layers. The second and third one are fol-
output the final (𝑥,𝑦) coordinate vector of the driver’s
lowed by a max-pooling layer each, while the second
focus point. All the structure is summarized in Table 1.
convolutional block has an additional residual connec-
The second model is GazeTR-Hybrid, composed of a
tion compared to the original architecture. The resulting
ResNet which extracts local feature maps and a Visual
output is then flattened to obtain a 336-dimensional fea-
Transformer which calculates global relationships be-
ture vector. The other branch, instead, takes a series of
tween the feature maps and generates the gaze point.
features as input. We examined two scenarios to assess
Our aim was to assess the performance of a transformer
the actual influence of features on the final outcome. In
model in a domain where it is not commonly employed
one case, we used a 7-dimensional face feature vector as
and to verify the applicability of GazeTR-Hybrid on a
input, comprising head pose and the positions of the two
different task than the original (i.e. compute focus point
pupils, while in the other we also added the nose and eye
instead of gaze direction). The original model with its
corners positions. The performances for both scenarios
pre-trained weights, but we performed fine-tuning to

44
Francesca Fiani et al. CEUR Workshop Proceedings 40–50

Figure 9: Schematic model of the GazeTR-Hybrid architec-
ture.

adapt the model for a direct confrontation with GazeCNN.
The structure of GazeTR-Hybrid, shown in Figure 9, is
composed of various convolutional layers, forming the
ResNet-18 block, which generate 7 × 7 × 512 feature maps
Figure 10: Confusion matrix of the YOLO fine-tuned model.
from face images. The block is followed by an additional
1 × 1 convolutional layer aimed at adjusting the channel
scale to obtain 7 × 7 × 32 feature maps. The transformer
block, instead, consists of six Transformer Encoder Lay- identifying cars, people, trucks and motorcycles. This
ers which perform 8-heads self-attention mechanism, fol- could be attributed to the fact that in the images from the
lowed by a two-layer MLP with hidden size 512 and the road signs dataset we only recognize one element of the
dropout 0.1. The transformer is also equipped with a lin- considered class, leading to higher precision, whereas
ear feedforward layer which produces the 2-dimensional in the photos from COCO there are various elements of
output of the driver’s gaze point. different classes in each image. This might lead to our
The second branch performs object detection by pass- fine-tuned model having more difficulty learning from
ing as input to the model the various images of the ’road images rich of different elements, resulting in poorer per-
view’ to recognize in each of them the most relevant formance in those classes. We also see a particularly low
elements. This is instrumental in identifying the most precision for the traffic light class, probably influenced by
important objects on the road, those to which the driver the lower number of samples in our dataset. Despite this,
should pay most attention to. For this purpose, we used for the use in our Visual Attention Estimation model, the
a pre-trained YOLOv8 model, which was then fine-tuned achieved results can be considered acceptable.
on the elements that we were most interested in. This The outputs of the two branches are finally combined
way, our fine-tuned YOLO model will be able to identify to determine the final output of the model. If the driver’s
only the road elements of our interest while excluding point of gaze falls within one of the bounding boxes of
irrelevant ones. For our task, we combined a dataset the road elements identified by YOLOv8, we can assert
of road signs part of the RF100 initiative [33] with one with confidence the driver’s attention and identify which
created by us using images from the COCO dataset [34]. element they are looking at. In general, giving as input to
The images from COCO were carefully chosen to exclu- our model a pair of images corresponding to the driver’s
sively include pictures with the presence of people, cars, view and the road view at a specific moment during driv-
motorcycles, and trucks. This was done to prevent our ing (i.e. capturing what happens inside and outside the
fine-tuned YOLO model from forgetting these classes, vehicle), it can determine whether the driver is paying
which are crucial for our task. The other dataset, instead, attention to the road. Additionally, it can identify, and
contains various classes of road signs that were helpful return in output, which specific element on the road is
for training YOLO to identify these road elements, which drawing more of the driver’s interest at that moment.
are the ones every driver should pay attention to. In total, A schematic representation of the full defined model is
we used 3589 images, divided into 2480 for the training shown in Figure 11.
set and 1109 for the validation set.
We fine-tuned the pre-trained YOLO model on this
dataset for 40 epochs, resulting in a precision of 83.61%, 4. Results and Discussion
a recall of 73.99%, and a mAP50 of 79.27%. We report
To perform the experiments, the DGAZE dataset has
in the Figure 10 the confusion matrix. We can see how
been split into train set, validation set and test set accord-
our YOLO model performs quite well on almost all new
ing to the same original division [16]. Of the 20 drivers,
classes of road signs, while its performance is lower in

45
Francesca Fiani et al. CEUR Workshop Proceedings 40–50

|ℐ| the cardinality of the set and
{︃
1 if 𝑑(𝑔𝑖 , 𝑔ˆ𝑖 ) < threshold
𝑥𝑖 = (5)
0 otherwise

where 𝑑(𝑔𝑖 , 𝑔ˆ𝑖 ) = (𝑔𝑖 − 𝑔ˆ𝑖 )2 is the Euclidean
√︀

distance, 𝑔ˆ𝑖 is the estimated gaze point and 𝑔𝑖 the
true gaze point in the road view image coordi-
Figure 11: General Architecture presented in our paper. The nates. The threshold has been set to 250 pixels.
network is divided in two branches: one which computes the • Accuracy w.r.t Bounding Box:
point the driver focuses on, the other which identifies all the
principal street objects. The model then assesses the driver’s 1 ∑︁
𝑎𝑐𝑐𝑏𝑏𝑜𝑥 = 𝑥𝑖 (6)
attention (whether they are looking at an element of the road) 𝑛 𝑖∈ℐ
and which element they pay the most attention to.
where ℐ is the set of images in the dataset, 𝑛 =
|ℐ| the cardinality of the set and
16 were used for training (corresponding to 60% of the {︃
video sequences for training), 2 for validation (20%) and 1 if 𝑔𝑖 ∈ boundingbox
𝑥𝑖 = (7)
2 were used for testing (20%). As mentioned earlier, var- 0 otherwise
ious training experiments were conducted for both the
GazeCNN and the GazeTR-Hybrid models. In addition, where 𝑔ˆ𝑖 is the estimated gaze point and 𝑔𝑖 the
for the first model we also considered a scenario where true gaze point in the road view image coordi-
the input also considers the position of the nose and the nates. The bounding box considered is the one
eye corners as features (i.e. a 17-feature vector) to assess surrounding the road element that, during the
whether increasing the number of features has any effect creation of the dataset, is observed by drivers.
on the model’s performance. • Displacement via Euclidean Distance:
All the models were trained using L1 loss function, 𝑛
Adam optimizer with a learning rate of 1e-3, weight de- 𝐷(𝑔𝑖 , 𝑔ˆ𝑖 ) =
1 ∑︁ √︀
(𝑔𝑖 − 𝑔ˆ𝑖 )2 (8)
cay of 1e-5, 𝛽1 = 0.9 and 𝛽2 = 0.97. Additionally, a StepLR 𝑛 𝑖=1
scheduler with a step size of 15000 and a gamma of 0.1
was also applied to improve training performance. The where 𝑔ˆ𝑖 is the estimated gaze point and 𝑔𝑖 the
models were trained for 10 epochs with a batch size of true gaze point in the road view image coordi-
16. All the hyperparameters have been experimentally nates.
calculated to avoid overfitting and to reach the best per-
formance possible. The experiments were performed Table 2 shows the evaluation of the three metrics in
using a NVIDIA GeForce RTX 3060 Laptop GPU. In the the three selected models at the best epoch during the
next subsection we will see more in details the results of testing phase. The CNN + Transformer model performs
these training experiments. better compared to the GazeCNN model in all cases. This
demonstrates the effectiveness of this model in the con-
sidered task. We believe that, with an increase in epochs
4.1. Driver Gaze Prediction and input features, the CNN + Transformer model has
In this section, we will present the results of the experi- the potential to achieve even better results by increasing
ments conducted for the gaze detection task. We consider the accuracy in calculating the driver’s point of gaze. In-
the GazeCNN, the GazeCNN + features and the GazeTR- stead, regarding the GazeCNN + features and the CNN +
Hybrid (CNN + Transformer) models to perform this task. Transformer, we can observe that the latter proves to be
To validate results obtained, we consider three different superior in both bounding box accuracy and Euclidean
metrics: error, while the former slightly outperforms in thresh-
old accuracy. We can observe how the addition of input
• Accuracy w.r.t Threshold: features (eye corners and nose position) leads to a re-
markable improvement in performance for GazeCNN,
1 ∑︁
𝑎𝑐𝑐𝑡𝑟𝑒𝑠ℎ = 𝑥𝑖 (4) proving to be a crucial factor in the learning process.
𝑛 𝑖∈ℐ We would like to point out that, for all the models,
the bounding box (bbox) accuracy is relatively low. This
where ℐ is the set of images in the dataset, 𝑛 = can be explained by the fact that, for many videos in the
dataset, the fixation elements tend to be small, as they

46
Francesca Fiani et al. CEUR Workshop Proceedings 40–50

Table 2
Evaluation Metrics at best epoch in Test Dataset for the three selected models

Model Threshold Accuracy [%] Bbox Accuracy [%] Euclidean Error [px]
GazeCNN 37.57 15.97 371.93
GazeCNN + features 46.33 18.50 320.54
CNN + Transformer 45.62 19.72 317.40

Table 3
Comparison table between our models and other SOTA eye gaze models on train, validation and test pixel accuracy (calculated
via Mean Absolute Error)

Model Train Error [px] Val Error [px] Test Error [px]
Turker Gaze [35] 171.30 176.37 190.72
iTracker [20] 140.10 205.65 190.5
I-DGAZE [16] 133.34 204.77 186.89
GazeCNN 163.00 154.41 228.46
GazeCNN + features 171.99 174.63 199.99
CNN + Transformer 200.85 197.88 196.53

are far away, and therefore the corresponding bounding the efficacy of the method. In contrast, the train error
boxes are similarly small. Accuracy for bounding boxes is the highest. This phenomenon does not fit with any
is very restrictive, since the presence of an error, even classical training schema and is therefore not correlated
by a single pixel, could cause the point to be outside to underfitting or overfitting, but a lower validation error
the corresponding bounding box and therefore lead to a compared to train error may be caused by the samples se-
decrease in the accuracy. lected for validation being particularly simple to predict
Considering the analyzed results, the GazeTR-Hybrid for the network. Finally, it is important to note that the
(CNN + Transformer) model has been employed in the GazeCNN model has the lowest validation error. How-
overall Driver Visual Attention Estimation model to per- ever, this is associated with a higher test error, possibly
form point-gaze estimation. To confirm what has been indicating overfitting during training.
discussed so far, we present a comparison in Table 3 be-
tween the models just considered and some SOTA eye 4.2. Driver Attention Evaluation
gaze models. In particular, we consider the model pro-
posed in TurkerGaze [35], where they use pixel-level In Table 4 we describe the results obtained from the analy-
face features as input and use Ridge Regression to es- sis of drivers’ attention using the general model described
timate gaze point on the screen, the one proposed in by the Figure 11. To perform this analysis, we considered
Eye-tracking for Everyone [20], which predicts user gaze only the two drivers belonging to the test set as specified
on phone and tablet, and finally I-DGAZE, the model above out of the total 20 included in the dataset. The
presented in our reference paper [16]. dataset used, DGAZE, provides bounding boxes coordi-
The error used as a metric for this comparison is the nates as labels only corresponding to the object observed
Mean Absolute Error (MAE), calculated by taking the by the driver. Therefore, we have considered these bound-
mean of the absolute differences between model predic- ing boxes as indicative of the most important element
tions and actual values. In mathematical terms, it is ex- in the scene, and we will consider any detected object
pressed as: aside from the selected one as an incorrect focus object.
𝑛
1 ∑︁ Based on this reasoning, we identified three attention
MAE = |𝑔𝑖 − 𝑔ˆ𝑖 | (9) score scenarios:
𝑛 𝑖=1

where 𝑛 is the total number of samples, 𝑔𝑖 represents • Correct bbox (Attention Score=2): the driver is
the actual values and 𝑔ˆ𝑖 represents the model predictions. looking at the correct road element indicated by
The smaller the Mean Absolute Error, the more accu- the dataset, so the point the driver is focusing on
rate the model is in predicting the co-ordinates of the falls in the bounding box of the expected object;
gaze point. We can see that even in this case the CNN + • Another bbox (Attention Score=1): the driver is
Transformer model proves to be in line with the other attentive, but focused on an another element of
SOTA models on the validation and test error, proving the road, so the point the driver is focusing on

47
Francesca Fiani et al. CEUR Workshop Proceedings 40–50

Table 4 Table 5
Results of Visual Attention Estimation in Drivers. An attention Object focus distribution in test set for drivers. Obtained data
score of 2 indicates a correct object of focus, an attention score shows that drivers tend to focus their attention on vehicles
of 1 an incorrect object of focus but an attentive driver and (car and truck) compared to other elements.
an attention score of 0 a distracted driver.
Object Type Percentage [%]
Attention Score Percentage [%]
person 8.33
Correct bbox (Att. Score = 2) 16.06 truck 15.70
Another bbox (Att. Score = 1) 29.95 car 16.40
No bbox (Att. Score = 0) 53.99 road signal 2.90
motorcycle 2.66
traffic light 0.01
falls in the bounding box of an object different
from the one of the expected object;
• No bbox (Attention Score=0): the driver is not tioned, this is obtained by performing two sub-tasks, gaze
paying attention to the road and is therefore not estimation and the object detection. To execute the first,
looking at any important road elements, so the we examined two different architectures, GazeCNN and
point the driver is focusing on doesn’t fall in any GazeTR-Hybrid. We then assessed the performance of
bounding box. both models for the specified task, achieving better re-
sults with the GazeTR-Hybrid model. This second model
We observe that the system identifies distracted drivers was consequently used to implement driver visual at-
(Attention Score=0) 53.99% of the time, a percentage tention detection. For object detection, we employed a
which does not fall in line with expected results. Un- fine-tuned YOLOv8 model capable of recognizing cars,
fortunately, this result is attributed to the suboptimal people, trucks, motorcycles, traffic lights and various road
performance of our CNN + Transformer model, partic- signs. By combining the outputs of the two branches, i.e.
ularly in bbox accuracy which as shown in Table 4 is projecting the driver’s gaze point (whose coordinates are
particularly low (less than 20%). As mentioned earlier, obtained as output from the gaze detection branch) onto
this is a challenging task, as even small pixel errors in the corresponding ’road view’, where all relevant road
this context have significant relevance, and it therefore objects identified by YOLO are located, we evaluated the
highlights the need for greater precision in determining actual visual attention of drivers. This approach allowed
the gaze point, especially in such cases where a high us to obtain two valuable pieces of information: whether
accuracy is necessary due to safety reasons. the driver is attentive or not and, if so, to which element
In the scenario where the system recognizes drivers as of the road.
attentive, instead (approximately 46.01% of the time), we Possible future improvements are evident, starting
notice that generally they are attentive but focused on with the gaze detection task, where increased precision
road elements that are not considered the most impor- in calculating the gaze point could lead to better results
tant (Attention Score=1). The data presented in Table 5 in assessing drivers’ visual attention. We believe that
reveals that, most often, drivers concentrate their atten- the addition of more features during the training phase
tion on the vehicles in front of them, especially on cars to the GazeTR-Hybrid model could lead to the desired
and trucks. This indicates a higher level of attention to improvement in performances, thus achieving increas-
other vehicles compared to road signs or other objects, ingly precise results. This, in turn, would contribute to
which justifiable due to other vehicles being the main an effective improvement in Visual Attention Estimation
’antagonistic’ driving element and the primary source in drivers. This is a consequence of the fact that, by in-
of potential impediment to road safety. Even though creasing precision, we can identify information about
in our dataset we have predetermined attention objec- the objects the driver is focusing on even in case of oc-
tives, which consequently limits the correctness of the clusions, i.e. if they are distant or partially hidden by
obtained results, a statistical analysis can be performed other elements. However, we find our approach to the
with our framework in different scenarios to gain insight Driver Vision Attention task promising for future works,
on drivers’ attention behaviour and on the objects that particularly in the aspect of obtaining more complete
they pay most attention to in different driving situations. results on the drivers’ engagement with the road.
Drivers’ attention and the object they focus on can
be subsequently used in different contexts. For instance,
5. Conclusion the former could be applied in assessing attention in
systems designed to alert the driver when not paying
In this paper we presented a new way to perform the
adequate attention to the road, while the second can be
task of driver visual attention detection. As already men-

48
Francesca Fiani et al. CEUR Workshop Proceedings 40–50

used to train autonomous driving models, helping them [10] E. Iacobelli, V. Ponzi, S. Russo, C. Napoli, Eye-
understand what to prioritize in each driving scenario. A tracking system with low-end hardware: Devel-
mixed model able to detect both data could lead to more opment and evaluation, Information (Switzerland)
comprehensive autonomous or assisted driving systems 14 (2023). doi:10.3390/info14120644.
by reducing training times due to faster data collection. [11] F. Fiani, S. Russo, C. Napoli, An advanced solu-
tion based on machine learning for remote emdr
therapy, Technologies 11 (2023). doi:10.3390/
References technologies11060172.
[12] N. Brandizzi, S. Russo, G. Galati, C. Napoli, Address-
[1] A. Eriksson, N. A. Stanton, Takeover time in highly
ing vehicle sharing through behavioral analysis: A
automated vehicles: noncritical transitions to and
solution to user clustering using recency-frequency-
from manual control, Human factors 59 (2017) 689–
monetary and vehicle relocation based on neigh-
705.
borhood splits, Information (Switzerland) 13 (2022).
[2] T. A. Dingus, S. G. Klauer, V. L. Neale, A. Petersen,
doi:10.3390/info13110511.
S. E. Lee, J. Sudweeks, M. A. Perez, J. Hankey,
[13] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
D. Ramsey, S. Gupta, C. Bucher, Z. R. Doerzaph,
ing for image recognition, in: Proceedings of the
J. Jermeland, R. R. Knipling, The 100 car naturalis-
IEEE conference on computer vision and pattern
tic driving study: Phase II – Results of the 100-car
recognition, 2016, pp. 770–778.
field experiment (2006).
[14] Y. Cheng, F. Lu, Gaze estimation using transformer,
[3] T. Rueda-Domingo, P. Lardelli-Claret, J. de Dios
in: 2022 26th International Conference on Pattern
Luna-del Castillo, J. J. Jiménez-Moleón, M. Garciá-
Recognition (ICPR), IEEE, 2022, pp. 3341–3347.
Martiń, A. Bueno-Cavanillas, The influence of pas-
[15] J. Terven, D.-M. Córdova-Esparza, J.-A. Romero-
sengers on the risk of the driver causing a car col-
González, A comprehensive review of yolo archi-
lision in spain: Analysis of collisions from 1990 to
tectures in computer vision: From yolov1 to yolov8
1999, Accident Analysis & Prevention 36 (2004)
and yolo-nas, Machine Learning and Knowledge
481–489.
Extraction 5 (2023) 1680–1716.
[4] K. A. Braitman, N. K. Chaudhary, A. T. McCartt,
[16] I. Dua, T. A. John, R. Gupta, C. Jawahar, Dgaze:
Effect of passenger presence on older drivers’ risk
Driver gaze mapping on road, in: 2020 IEEE/RSJ
of fatal crash involvement, Traffic injury prevention
International Conference on Intelligent Robots and
15 (2014) 451–456.
Systems (IROS), IEEE, 2020, pp. 5946–5953.
[5] F. Bonanno, G. Capizzi, G. L. Sciuto, C. Napoli,
[17] Y. Sugano, Y. Matsushita, Y. Sato, Learning-by-
G. Pappalardo, E. Tramontana, A novel cloud-
synthesis for appearance-based 3d gaze estimation,
distributed toolbox for optimal energy dispatch
in: Proceedings of the IEEE conference on computer
management from renewables in igss by using wrnn
vision and pattern recognition, 2014, pp. 1821–1828.
predictors and gpu parallel solutions, 2014, pp. 1077
[18] D. Melesse, M. Khalil, E. Kagabo, T. Ning, K. Huang,
– 1084. doi:10.1109/SPEEDAM.2014.6872127.
Appearance-based gaze tracking through super-
[6] I. E. Tibermacine, A. Tibermacine, W. Guettala,
vised machine learning, in: 2020 15th IEEE Inter-
C. Napoli, S. Russo, Enhancing sentiment anal-
national Conference on Signal Processing (ICSP),
ysis on seed-iv dataset with vision transformers:
volume 1, IEEE, 2020, pp. 467–471.
A comparative study, 2023, pp. 238 – 246. doi:10.
[19] X. Zhang, Y. Sugano, M. Fritz, A. Bulling,
1145/3638985.3639024.
Appearance-based gaze estimation in the wild, in:
[7] N. N. Dat, V. Ponzi, S. Russo, F. Vincelli, Supporting
Proceedings of the IEEE conference on computer
impaired people with a following robotic assistant
vision and pattern recognition, 2015, pp. 4511–4520.
by means of end-to-end visual target navigation
[20] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan,
and reinforcement learning approaches, volume
S. Bhandarkar, W. Matusik, A. Torralba, Eye track-
3118, 2021, pp. 51 – 63.
ing for everyone, in: Proceedings of the IEEE con-
[8] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,
ference on computer vision and pattern recognition,
Analysis pre and post covid-19 pandemic rorschach
2016, pp. 2176–2184.
test data of using em algorithms and gmm models,
[21] T. Fischer, H. J. Chang, Y. Demiris, Rt-gene: Real-
volume 3360, 2022, pp. 55 – 63.
time eye gaze estimation in natural environments,
[9] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo,
in: Proceedings of the European conference on com-
J. Starczewski, C. Napoli, A novel convmixer trans-
puter vision (ECCV), 2018, pp. 334–352.
former based architecture for violent behavior de-
[22] H. S. Yoon, N. R. Baek, N. Q. Truong, K. R. Park,
tection 14126 LNAI (2023) 3 – 16. doi:10.1007/
Driver gaze detection based on deep residual net-
978-3-031-42508-0_1.
works using the combined single image of dual

49
Francesca Fiani et al. CEUR Workshop Proceedings 40–50

near-infrared cameras, IEEE Access 7 (2019) 93448– [35] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich
93461. feature hierarchies for accurate object detection
[23] N. Mizuno, A. Yoshizawa, A. Hayashi, T. Ishikawa, and semantic segmentation, in: Proceedings of the
Detecting driver’s visual attention area by using IEEE conference on computer vision and pattern
vehicle-mounted device, in: 2017 IEEE 16th Inter- recognition, 2014, pp. 580–587.
national Conference on Cognitive Informatics &
Cognitive Computing (ICCI* CC), IEEE, 2017, pp.
346–352.
[24] S. Vora, A. Rangesh, M. M. Trivedi, Driver gaze zone
estimation using convolutional neural networks:
A general framework and ablative analysis, IEEE
Transactions on Intelligent Vehicles 3 (2018) 254–
265.
[25] S. M. Shah, Z. Sun, K. Zaman, A. Hussain, M. Shoaib,
L. Pei, A driver gaze estimation method based on
deep learning, Sensors 22 (2022) 3959.
[26] T. Deng, H. Yan, L. Qin, T. Ngo, B. Manjunath, How
do drivers allocate their potential attention? driv-
ing fixation prediction via convolutional neural net-
works, IEEE Transactions on Intelligent Transporta-
tion Systems 21 (2019) 2146–2154.
[27] Y. Xia, D. Zhang, A. Pozdnoukhov, K. Nakayama,
K. Zipser, D. Whitney, Training a network to at-
tend like human drivers saves it from common
but misleading loss functions, arXiv preprint
arXiv:1711.06406 (2017).
[28] C. Gou, Y. Zhou, D. Li, Driver attention predic-
tion based on convolution and transformers, The
Journal of Supercomputing 78 (2022) 8268–8284.
[29] A. Palazzi, D. Abati, F. Solera, R. Cucchiara, et al.,
Predicting the driver’s focus of attention: the dr
(eye) ve project, IEEE transactions on pattern anal-
ysis and machine intelligence 41 (2018) 1720–1733.
[30] S. Ghosh, A. Dhall, M. Hayat, J. Knibbe, Q. Ji, Au-
tomatic gaze analysis: A survey of deep learning
based approaches, IEEE Transactions on Pattern
Analysis and Machine Intelligence 46 (2023) 61–84.
[31] D. E. King, Dlib-ml: A machine learning toolkit,
The Journal of Machine Learning Research 10 (2009)
1755–1758.
[32] T. Hempel, A. A. Abdelrahman, A. Al-Hamadi,
6d rotation representation for unconstrained head
pose estimation, in: 2022 IEEE International Con-
ference on Image Processing (ICIP), IEEE, 2022, pp.
2496–2500.
[33] R. 100, road signs dataset, https://universe.roboflow.
com/roboflow-100/road-signs-6ih4y, 2023. URL:
https://universe.roboflow.com/roboflow-100/
road-signs-6ih4y.
[34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco:
Common objects in context, in: Computer Vision–
ECCV 2014: 13th European Conference, Zurich,
Switzerland, September 6-12, 2014, Proceedings,
Part V 13, Springer, 2014, pp. 740–755.