=Paper= {{Paper |id=Vol-3695/p06 |storemode=property |title=A Fully Automatic Visual Attention Estimation Support System for A Safer Driving Experience |pdfUrl=https://ceur-ws.org/Vol-3695/p06.pdf |volume=Vol-3695 |authors=Francesca Fiani,Samuele Russo,Christian Napoli |dblpUrl=https://dblp.org/rec/conf/system/FianiR023 }} ==A Fully Automatic Visual Attention Estimation Support System for A Safer Driving Experience== https://ceur-ws.org/Vol-3695/p06.pdf
                                A Fully Automatic Visual Attention Estimation Support
                                System for A Safer Driving Experience
                                Francesca Fiani1 , Samuele Russo2 and Christian Napoli1,3,4
                                1
                                  Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185 Roma, Italy
                                2
                                  Department of Psychology, Sapienza University of Rome, 00185 Roma, Italy
                                3
                                  Institute for Systems Analysis and Computer Science, Italian National Research Council, 00185 Roma, Italy
                                4
                                  Department of Computational Intelligence, Czestochowa University of Technology, 42-201 Czestochowa, Poland


                                               Abstract
                                               Drivers’ attention is a key element in safe driving and in avoiding possible accidents. In this paper, we present a new approach
                                               to the task of Visual Attention Estimation in drivers. The model we introduce consists of two branches, one which performs
                                               Gaze Point Detection to determine the exact point of focus of the driver, and the other which executes Object Detection
                                               to recognize all relevant elements on the road (e.g. vehicles, pedestrians, and traffic signs). The combination of the two
                                               outputs from the two branches allows us to determine whether the driver is attentive and, eventually, on which element of
                                               the road they are focusing. Two models are tested for the gaze detection task: the GazeCNN model and a model consisting
                                               of a CNN+Transformer. The performance of both models is evaluated and compared with other state-of-the-art models to
                                               choose the best approach for the task. Finally, the results of the Visual Attention Estimation performed on 3761 pairs of
                                               images (driver view and corresponding road view) from the DGAZE dataset are reported and analyzed.

                                               Keywords
                                               Visual Attention Estimation, ADAS (Autonomous Driver Assistance Systems), GazeCNN, Visual Transformers, DGAZE



                                1. Introduction                                                                                         gorithms help optimize energy consumption and reduce
                                                                                                                                        carbon footprint by predicting demand and managing
                                Attention while driving is a key element in road safety                                                 supply efficiently. In the field of renewable energies,
                                to keep passengers, drivers and pedestrians safe. Distrac-                                              these algorithms aid in forecasting energy production
                                tions caused by secondary tasks have been proved as the                                                 from sources like wind and solar, thereby facilitating ef-
                                main factor in slowed responses in immediately danger-                                                  fective grid management. In the field of human-computer
                                ous situations [1], with 80% of reported crashes and 65%                                                interaction, machine learning enhances user experience
                                of near-crashes over 100 analyzed vehicles caused by un-                                                by enabling systems to understand and respond to human
                                safe driving behaviors such as inattention [2]. Moreover,                                               behavior in a more intuitive and personalized manner
                                the probability of collisions caused by driver distraction                                              [7, 8, 9, 10, 11]. Lastly, in the automobile industry, ma-
                                is significantly reduced in case passengers warn them                                                   chine learning is driving the revolution of autonomous ve-
                                about unseen hazards [3, 4]. This shows the importance                                                  hicles and smart traffic management systems, contribut-
                                of developing increasingly efficient Advanced Driver As-                                                ing to safer and more efficient transportation [12]. The
                                sistance Systems (ADAS), especially with the use of arti-                                               goal of this paper is to introduce a new approach to vi-
                                ficial intelligence algorithms capable of understanding                                                 sual attention estimation for safe driving. To the best
                                whether a driver is distracted from the road and alerting                                               of our knowledge, most studies on driver attention are
                                them. The identification of points of focus of drivers                                                  based either on the evaluation of driver behavior, with-
                                can also be used to train autonomous driving algorithms                                                 out considering the environment surrounding the car, or
                                to pay more attention to some elements rather than to                                                   exclusively on the road, training models to identify the
                                others, thus making them more capable of safe driving.                                                  elements to focus on. Our approach, in contrast, entails
                                Machine learning and distributed computing approaches                                                   a comprehensive consideration of both the driver and
                                e.g. cloud computing have become a cornerstone of mod-                                                  the road views. Specifically, we assess the point of focus
                                ern data technology, playing a pivotal role in various                                                  of the driver, contextually understanding whether they
                                sectors [5, 6]. In the green economy, machine learning al-                                              are paying attention to the road, and eventually which
                                                                                                                                        element of the road has captured their focus. To do this,
                                SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi-                                        we divide our task into two parts:
                                neering and Mathematics, Rome, December 3-6, 2023
                                $ fiani@diag.uniroma1.it (F. Fiani); samuele.russo@uniroma1.it                                               • Gaze point detection: we identify the point where
                                (S. Russo); cnapoli@diag.uniroma1.it (C. Napoli)                                                               the driver is looking at to assess where the driver
                                 0009-0005-0396-7019 (F. Fiani); 0000-0002-1846-9996 (S. Russo);                                              is paying attention;
                                0000-0002-9421-8566 (C. Napoli)
                                         © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License        • Object identification: we identify the main ob-
                                         Attribution 4.0 International (CC BY 4.0).




                                                                                                                                   40




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Francesca Fiani et al. CEUR Workshop Proceedings                                                                           40–50



                                                                      various advancements over time, it remains a challeng-
                                                                      ing task due to aspects such as the uniqueness of faces
                                                                      and eyes, potential occlusions, differences in lighting,
                                                                      image quality, etc. Throughout literature, various meth-
                                                                      ods have been employed, ranging from simple classifi-
                                                                      cation methods, like Random Forest [17] and SVM [18],
                                                                      to deep neural network models. The use of deep CNNs
                                                                      has greatly enhanced accuracy of this task, with great
Figure 1: Example of paired images from the DGAZE dataset.            results obtained even with wild datasets [19]. While the
(a) Driver view of driver number 22. (b) Sample 15 road view          majority of works use only the eyes to perform gaze es-
of driver number 22.
                                                                      timation, other works use facial features different than
                                                                      the eyes, such as facial grids [20] or a combination of the
                                                                      eyes images and the head pose [21].
        jects on the road, namely pedestrians, motorbikes,               Transformers are also a viable novel solution, with
        traffic signs, traffic lights, other cars, and trucks.        two types of transformers derived from the Vision Trans-
                                                                      former (ViT) framework finding success [14]. The first
   For the first task we will employ the GazeCNN model,               one, denoted as GazeTR-Pure, processes the cropped face
a variant of a ResNet [13] that takes various facial fea-             as input, divides it into patches and passes them to a
tures as input, such nose and left pupil position, head               transformer encoder that will return the direction of gaze.
pose and eyes corners. In addition, to perform a compara-             In contrast, GazeTR-Hybrid adopts a hybrid approach,
tive analysis between two different methods, we will also             combining Convolutional Neural Networks (CNN) with
consider a Resnet+Transformer model [14] fine-tuned to                transformers. The CNN extracts local feature maps from
output the exact position of where the driver is looking at.          the face, which are then passed to the encoder trans-
For the second task, instead, we use a fine-tuned YOLOv8              former to capture the global relationships between the
model, part of the YOLO family of object detection algo-              maps and finally obtain the desired output. These models
rithms [15], configured to consider only the classes of               take advantage of the transformer’s attention mechanism
interest. To accomplish our task, we used the DGAZE                   to improve performances, with the GazeTR-Hybrid ob-
dataset [16], which to the best of our knowledge is one               taining results comparable to the state-of-the-art. As
of the few dataset that provide both both the driver’s                previously mentioned GazeTR-Hybrid will be the base
view and the road view. This data was collected in a con-             for one of our two approaches.
trolled laboratory setting where 112 street videos were
projected in front of 20 ’drivers’, who were told to focus
on a designated point annotated in the projected video.               2.2. Driver Gaze Prediction
This dataset contains over 180,000 pairs of images, where             Driver gaze prediction task is approached in two ways in
each pair includes a road view and the corresponding                  literature. The first approach focuses only on the interior
driver view, plus a label indicating the coordinates of               images of the car (the driver’s view) [22, 23, 24, 25]. Gen-
the point the driver was instructed to focus on (specifi-             erally, the car is divided in different zones, such as the
cally, the center of the bounding box of the object). We              windscreen, the speedometer, the two side-view mirrors,
reported an example of this dataset in Figure 1.                      the back mirror, and so on. The algorithms try to predict
   The paper is organized as follows. In Section 2, related           which of these areas the driver is looking at by analyzing
works about gaze detection and driver gaze prediction                 the images of the driver.
are provided to frame our work in the current state-of-                  The other approach, instead, is focused only on the
the-art scenario. Section 3 describes the data analysis,              outside the car. Many papers analyze images of the road
the pre-processing and feature extraction done on the                 recorded from inside the car via the windscreen to cal-
DGAZE dataset and the proposed architecture to perform                culate an attention map, i.e. a heat map where brighter
our task. Section 4 reports the performed experiments                 colors indicate the elements where drivers focus most
and the corresponding results. Finally, Section 5 presents            while driving [26, 27, 28, 29]. Attention maps are ex-
the study’s conclusions.                                              tremely significant for autonomous driving, since they
                                                                      may be useful in training models that can understand, in
                                                                      a given driving situation, which of the many important
2. Related Works                                                      elements of the road to focus on the most.
                                                                         For what concerns the DGAZE dataset, already ana-
2.1. Gaze Detection
                                                                      lyzed in the introduction, a related model called I-DGAZE
Gaze detection is a highly significant topic in the field of          has also been developed [16]. The model consists of two
Computer Vision and Human-Robot Interaction. Despite                  branches. The first is composed of a CNN with the ad-



                                                                 41
Francesca Fiani et al. CEUR Workshop Proceedings                                                                    40–50




Figure 2: (a) Cropped driver 22 view subjected to K-Means clustering. (b) Corresponding color distribution histogram after
K-clustering. On the x axis is represented the bin number, while on the y axis the number of pixel occurrences of the bin.



dition of a final flattened layer, which takes the driver’s son. Finally, we selected three distance metrics to conduct
left eye as input. The other is composed of only dense a dataset-wide comparison between the histograms and
layers and takes as input various features of the face, computed the corresponding matrices:
namely the pose, location, and area. The features gener-
ated by the two branches are then merged and passed to           • Wasserstein (Earth Mover’s) Distance:
a fully connected layer that uses them to determine the                                    (︂∫︁                       )︂
coordinates of the gaze point (x, y).                               𝑊 (𝑝, 𝑞) = ∏︀    inf            ||𝑥 − 𝑦||𝑑𝛾(𝑥, 𝑦)
   Building on the literature just presented, our work is                         𝛾∈ (𝑝,𝑞)      R×R
quite innovative in using an approach that is not widely                                                              (1)
used for the identification of drivers’ attention while driv-       where   𝑝 and  𝑞 are two  probability  distributions
                                                                    and (𝑝, 𝑞) denotes the set of all joint probability
                                                                         ∏︀
ing. It will also compare two models for gaze detection,
as mentioned above, combining the results of these with             distributions on R × R whose marginals are 𝑝
those of YOLO in such a way as to output whether or not             and 𝑝. This metric is symmetric.
the driver is paying attention to the road and in particular     •  Chi-Squared Distance:
to which element.                                                                         ∑︁ (𝑝(𝑖) − 𝑞(𝑖))2
                                                                              𝜒2 (𝑝, 𝑞) =                             (2)
                                                                                            𝑖
                                                                                                      𝑝(𝑖)
3. Materials and Methods
                                                                       where 𝑝 and 𝑞 are two probability distributions.
3.1. Data Analysis                                                   • Kullback-Leibler Divergence:
Due to various challenges in gaze detection (e.g. eye-                                     ∑︁           (︂
                                                                                                           𝑝(𝑖)
                                                                                                                )︂
head interplay, illumination, eye registration errors, oc-                  𝐷𝐾𝐿 (𝑝||𝑞) =       𝑝(𝑖) 𝑙𝑜𝑔              (3)
                                                                                                           𝑞(𝑖)
clusions, difficulties in generalization of eye region ap-                                 𝑖∈R
pearance) [30], before proceeding with the implemen-
                                                                       where 𝑝 and 𝑞 are two probability distributions
tation we conducted a thorough analysis of color distri-
                                                                       on the same sample space R.
bution on our data, examining it in both RGB and HSV
color spaces. Driver’s view images were cropped to a
                                                                   The obtained matrices for 3D RGB graphs are reported
700 x 700 format from the top-left corner at pixelwise
                                                                in Figures 5, 6 and 7. Our data analysis indicate that
x and y coordinates (25, 100). This pre-processing step,
                                                                there are no significant differences in color distribution
consistently applied throughout our work, was designed
                                                                among various driver images with the exception of cer-
with the specific purpose of eliminating non-essential
                                                                tain drivers, such as Driver 13 and 5, with consistently
areas within the image, focusing only on the face region.
                                                                high values among all metrics. Conversely, Drivers 2, 22,
   We then computed histograms within the RGB and
                                                                and 23 occasionally exhibit increased differences, but not
HSV color spaces for a randomly selected sample from
                                                                consistently across all plots. The three metrics have also
each driver’s image set. The K-Means algorithm was
                                                                been calculated for 1D channels and averaged, produc-
employed to cluster all the colors in 16 clusters, with the
                                                                ing similar results, therefore they will not be reported.
resulting histogram shown in Figure 2.
                                                                The same process has also been repeated for the HSV
   The relative 1D RGB graphs are presented in Figure 3,
                                                                color space, so the 1D and flattened 3D graphs have been
while the flattened 3D RGB graph is shown in Figure 4.
                                                                computed, with an additional 2D heat map of the 3D
Both graphs have been normalized to facilitate compari-



                                                           42
Francesca Fiani et al. CEUR Workshop Proceedings                                                                               40–50




Figure 3: Graphs of the red, green and blue channel bin frequency distribution. Each channel has 16 bins (represented on the
x axis), with the frequency for each bin represented on the y axis. The frequency distribution has been normalized.




Figure 4: Graph of the flattened bin frequency distribution. 64
bins have been considered for the flattened 3D representation
(represented on the x axis), with the frequency for each bin
represented on the y axis. The frequency distribution has been
normalized.
                                                                       Figure 5: Wasserstein (Earth Mover’s) distance matrix be-
                                                                       tween all couples of sample drivers 3D color distribution. A
graph, and the nine distance matrices have been com-                   high value indicates a big color space distance between images.
puted. Given the use of RGB space during the experi-                   Only the upper triangular matrix has been reported given the
ments and the absence of significant differences in the                symmetry of the matrix.
HSV analysis, we will skip the presentation of the ob-
tained results.
                                                            were mismatched in the number of frames with driver
3.2. Architecture                                           videos. The input is then processed to extract key com-
                                                            ponents of the face, i.e. the driver’s face, the left eye, the
As mentioned, our idea is to divide the model into two pupil position, the nose position, the head pose and the
branches. The first branch predicts the exact coordinates eye corners. A combination of SOTA tools for analyzing
(𝑥, 𝑦) of the driver’s focus point from the input driver facial features was used: a shape predictor, obtained from
view. The decision to predict the exact point of focus of dlib [31], for the extraction of the eyes and the position
the driver is due to the desire to achieve greater accuracy of the nose and pupils, a frontal face detector, also from
in estimating visual attention. This way, we will be able dlib, for the extraction of the face, and SixDRepNet [32]
to distinguish precisely which element of the road they for the extraction of the head pose.
are paying more attention to even in case of elements          Two types of models will be considered for this branch
overlapping. To the best of our knowledge, this is a situ- and confronted to evaluate the best one in terms of perfor-
ation that is not very usual in the literature and could be mance. The first model is GazeCNN, a variation in model
an important innovation to obtain increasingly accurate and layers size of I-DGAZE. The model, shown in Fig-
results in Visual Attention Estimation. The video lengths ure 8, is composed of two branches which extract features
of view and driver videos were manually aligned since used as inputs for the final fully connected layer. The
some view videos (which are common among all drivers) first branch takes the cropped 3 x 32 x 64 left eye image




                                                                  43
Francesca Fiani et al. CEUR Workshop Proceedings                                                                               40–50




                                                                       Figure 8: Schematic model of the GazeCNN architecture.


                                                                       Table 1
                                                                       Evaluation Metrics at best epoch in Test Dataset for the three
Figure 6: Chi-Squared distance matrix between all couples              selected models
of sample drivers 3D color distribution. A high value indicates
a big color space distance between images.                                                Eye Feature Branch
                                                                            Layer                Kernel           Output Channels
                                                                         Conv2D_1                 3x3                     8
                                                                         Conv2D_2                 3x3                     8
                                                                        MaxPool2d_1               4x4                     8
                                                                          Dropout                                         8
                                                                         Conv2D_3                 3x3                     4
                                                                        MaxPool2d_2               4x4                     4
                                                                         Flatten_1                                       336
                                                                                            Feature Branch
                                                                            Layer                Kernel           Output Channels
                                                                           Dense_1                                       16
                                                                                             Fused Branch
                                                                            Layer                Kernel           Output Channels
                                                                           Merge_1                                       352
                                                                           Dense_2                                       64
                                                                           Dense_3                                        2
Figure 7: Kullback-Leibler divergence matrix between all
couples sample drivers 3D color distribution. A high value
indicates a big color space distance between images.      will be discussed in the following section. This branch is
                                                          composed of only a fully connected layer of output size
                                                          16. The two features vectors output from the branches
                                                          are then merged in a 352-dimensional vector, which is
as input, which is then passed through three 8-channel
                                                          then passed through two fully connected layers which
convolutional layers. The second and third one are fol-
                                                          output the final (𝑥,𝑦) coordinate vector of the driver’s
lowed by a max-pooling layer each, while the second
                                                          focus point. All the structure is summarized in Table 1.
convolutional block has an additional residual connec-
                                                             The second model is GazeTR-Hybrid, composed of a
tion compared to the original architecture. The resulting
                                                          ResNet which extracts local feature maps and a Visual
output is then flattened to obtain a 336-dimensional fea-
                                                          Transformer which calculates global relationships be-
ture vector. The other branch, instead, takes a series of
                                                          tween the feature maps and generates the gaze point.
features as input. We examined two scenarios to assess
                                                          Our aim was to assess the performance of a transformer
the actual influence of features on the final outcome. In
                                                          model in a domain where it is not commonly employed
one case, we used a 7-dimensional face feature vector as
                                                          and to verify the applicability of GazeTR-Hybrid on a
input, comprising head pose and the positions of the two
                                                          different task than the original (i.e. compute focus point
pupils, while in the other we also added the nose and eye
                                                          instead of gaze direction). The original model with its
corners positions. The performances for both scenarios
                                                          pre-trained weights, but we performed fine-tuning to



                                                                  44
Francesca Fiani et al. CEUR Workshop Proceedings                                                                       40–50




Figure 9: Schematic model of the GazeTR-Hybrid architec-
ture.



adapt the model for a direct confrontation with GazeCNN.
The structure of GazeTR-Hybrid, shown in Figure 9, is
composed of various convolutional layers, forming the
ResNet-18 block, which generate 7 × 7 × 512 feature maps
                                                                Figure 10: Confusion matrix of the YOLO fine-tuned model.
from face images. The block is followed by an additional
1 × 1 convolutional layer aimed at adjusting the channel
scale to obtain 7 × 7 × 32 feature maps. The transformer
block, instead, consists of six Transformer Encoder Lay-        identifying cars, people, trucks and motorcycles. This
ers which perform 8-heads self-attention mechanism, fol-        could be attributed to the fact that in the images from the
lowed by a two-layer MLP with hidden size 512 and the           road signs dataset we only recognize one element of the
dropout 0.1. The transformer is also equipped with a lin-       considered class, leading to higher precision, whereas
ear feedforward layer which produces the 2-dimensional          in the photos from COCO there are various elements of
output of the driver’s gaze point.                              different classes in each image. This might lead to our
   The second branch performs object detection by pass-         fine-tuned model having more difficulty learning from
ing as input to the model the various images of the ’road       images rich of different elements, resulting in poorer per-
view’ to recognize in each of them the most relevant            formance in those classes. We also see a particularly low
elements. This is instrumental in identifying the most          precision for the traffic light class, probably influenced by
important objects on the road, those to which the driver        the lower number of samples in our dataset. Despite this,
should pay most attention to. For this purpose, we used         for the use in our Visual Attention Estimation model, the
a pre-trained YOLOv8 model, which was then fine-tuned           achieved results can be considered acceptable.
on the elements that we were most interested in. This              The outputs of the two branches are finally combined
way, our fine-tuned YOLO model will be able to identify         to determine the final output of the model. If the driver’s
only the road elements of our interest while excluding          point of gaze falls within one of the bounding boxes of
irrelevant ones. For our task, we combined a dataset            the road elements identified by YOLOv8, we can assert
of road signs part of the RF100 initiative [33] with one        with confidence the driver’s attention and identify which
created by us using images from the COCO dataset [34].          element they are looking at. In general, giving as input to
The images from COCO were carefully chosen to exclu-            our model a pair of images corresponding to the driver’s
sively include pictures with the presence of people, cars,      view and the road view at a specific moment during driv-
motorcycles, and trucks. This was done to prevent our           ing (i.e. capturing what happens inside and outside the
fine-tuned YOLO model from forgetting these classes,            vehicle), it can determine whether the driver is paying
which are crucial for our task. The other dataset, instead,     attention to the road. Additionally, it can identify, and
contains various classes of road signs that were helpful        return in output, which specific element on the road is
for training YOLO to identify these road elements, which        drawing more of the driver’s interest at that moment.
are the ones every driver should pay attention to. In total,    A schematic representation of the full defined model is
we used 3589 images, divided into 2480 for the training         shown in Figure 11.
set and 1109 for the validation set.
   We fine-tuned the pre-trained YOLO model on this
dataset for 40 epochs, resulting in a precision of 83.61%,      4. Results and Discussion
a recall of 73.99%, and a mAP50 of 79.27%. We report
                                                                To perform the experiments, the DGAZE dataset has
in the Figure 10 the confusion matrix. We can see how
                                                                been split into train set, validation set and test set accord-
our YOLO model performs quite well on almost all new
                                                                ing to the same original division [16]. Of the 20 drivers,
classes of road signs, while its performance is lower in



                                                           45
Francesca Fiani et al. CEUR Workshop Proceedings                                                                         40–50



                                                                         |ℐ| the cardinality of the set and
                                                                                      {︃
                                                                                         1 if 𝑑(𝑔𝑖 , 𝑔ˆ𝑖 ) < threshold
                                                                                𝑥𝑖 =                                       (5)
                                                                                         0 otherwise

                                                                         where 𝑑(𝑔𝑖 , 𝑔ˆ𝑖 ) = (𝑔𝑖 − 𝑔ˆ𝑖 )2 is the Euclidean
                                                                                               √︀

                                                                         distance, 𝑔ˆ𝑖 is the estimated gaze point and 𝑔𝑖 the
                                                                         true gaze point in the road view image coordi-
Figure 11: General Architecture presented in our paper. The              nates. The threshold has been set to 250 pixels.
network is divided in two branches: one which computes the             • Accuracy w.r.t Bounding Box:
point the driver focuses on, the other which identifies all the
principal street objects. The model then assesses the driver’s                                      1 ∑︁
                                                                                        𝑎𝑐𝑐𝑏𝑏𝑜𝑥 =         𝑥𝑖               (6)
attention (whether they are looking at an element of the road)                                      𝑛 𝑖∈ℐ
and which element they pay the most attention to.
                                                                         where ℐ is the set of images in the dataset, 𝑛 =
                                                                         |ℐ| the cardinality of the set and
16 were used for training (corresponding to 60% of the                                 {︃
video sequences for training), 2 for validation (20%) and                                1 if 𝑔𝑖 ∈ boundingbox
                                                                                  𝑥𝑖 =                                 (7)
2 were used for testing (20%). As mentioned earlier, var-                                0 otherwise
ious training experiments were conducted for both the
GazeCNN and the GazeTR-Hybrid models. In addition,                   where 𝑔ˆ𝑖 is the estimated gaze point and 𝑔𝑖 the
for the first model we also considered a scenario where              true gaze point in the road view image coordi-
the input also considers the position of the nose and the            nates. The bounding box considered is the one
eye corners as features (i.e. a 17-feature vector) to assess         surrounding the road element that, during the
whether increasing the number of features has any effect             creation of the dataset, is observed by drivers.
on the model’s performance.                                       •  Displacement via Euclidean Distance:
   All the models were trained using L1 loss function,                                         𝑛
Adam optimizer with a learning rate of 1e-3, weight de-                      𝐷(𝑔𝑖 , 𝑔ˆ𝑖 ) =
                                                                                            1 ∑︁ √︀
                                                                                                     (𝑔𝑖 − 𝑔ˆ𝑖 )2    (8)
cay of 1e-5, 𝛽1 = 0.9 and 𝛽2 = 0.97. Additionally, a StepLR                                 𝑛 𝑖=1
scheduler with a step size of 15000 and a gamma of 0.1
was also applied to improve training performance. The                where 𝑔ˆ𝑖 is the estimated gaze point and 𝑔𝑖 the
models were trained for 10 epochs with a batch size of               true gaze point in the road view image coordi-
16. All the hyperparameters have been experimentally                 nates.
calculated to avoid overfitting and to reach the best per-
formance possible. The experiments were performed               Table 2 shows the evaluation of the three metrics in
using a NVIDIA GeForce RTX 3060 Laptop GPU. In the           the three selected models at the best epoch during the
next subsection we will see more in details the results of testing phase. The CNN + Transformer model performs
these training experiments.                                  better compared to the GazeCNN model in all cases. This
                                                             demonstrates the effectiveness of this model in the con-
                                                             sidered task. We believe that, with an increase in epochs
4.1. Driver Gaze Prediction                                  and input features, the CNN + Transformer model has
In this section, we will present the results of the experi- the potential to achieve even better results by increasing
ments conducted for the gaze detection task. We consider the accuracy in calculating the driver’s point of gaze. In-
the GazeCNN, the GazeCNN + features and the GazeTR- stead, regarding the GazeCNN + features and the CNN +
Hybrid (CNN + Transformer) models to perform this task. Transformer, we can observe that the latter proves to be
To validate results obtained, we consider three different superior in both bounding box accuracy and Euclidean
metrics:                                                     error, while the former slightly outperforms in thresh-
                                                             old accuracy. We can observe how the addition of input
      • Accuracy w.r.t Threshold:                            features (eye corners and nose position) leads to a re-
                                                             markable improvement in performance for GazeCNN,
                                     1 ∑︁
                       𝑎𝑐𝑐𝑡𝑟𝑒𝑠ℎ =          𝑥𝑖            (4) proving to be a crucial factor in the learning process.
                                    𝑛 𝑖∈ℐ                       We would like to point out that, for all the models,
                                                             the bounding box (bbox) accuracy is relatively low. This
        where ℐ is the set of images in the dataset, 𝑛 = can be explained by the fact that, for many videos in the
                                                             dataset, the fixation elements tend to be small, as they



                                                                  46
Francesca Fiani et al. CEUR Workshop Proceedings                                                                        40–50



Table 2
Evaluation Metrics at best epoch in Test Dataset for the three selected models

                     Model             Threshold Accuracy [%]        Bbox Accuracy [%]      Euclidean Error [px]
                  GazeCNN                       37.57                       15.97                    371.93
              GazeCNN + features                46.33                       18.50                    320.54
              CNN + Transformer                 45.62                       19.72                    317.40


Table 3
Comparison table between our models and other SOTA eye gaze models on train, validation and test pixel accuracy (calculated
via Mean Absolute Error)

                               Model            Train Error [px]       Val Error [px]   Test Error [px]
                         Turker Gaze [35]            171.30                176.37           190.72
                           iTracker [20]             140.10                205.65           190.5
                          I-DGAZE [16]               133.34                204.77           186.89
                             GazeCNN                 163.00                154.41           228.46
                        GazeCNN + features           171.99                174.63           199.99
                        CNN + Transformer            200.85                197.88           196.53



are far away, and therefore the corresponding bounding             the efficacy of the method. In contrast, the train error
boxes are similarly small. Accuracy for bounding boxes             is the highest. This phenomenon does not fit with any
is very restrictive, since the presence of an error, even          classical training schema and is therefore not correlated
by a single pixel, could cause the point to be outside             to underfitting or overfitting, but a lower validation error
the corresponding bounding box and therefore lead to a             compared to train error may be caused by the samples se-
decrease in the accuracy.                                          lected for validation being particularly simple to predict
   Considering the analyzed results, the GazeTR-Hybrid             for the network. Finally, it is important to note that the
(CNN + Transformer) model has been employed in the                 GazeCNN model has the lowest validation error. How-
overall Driver Visual Attention Estimation model to per-           ever, this is associated with a higher test error, possibly
form point-gaze estimation. To confirm what has been               indicating overfitting during training.
discussed so far, we present a comparison in Table 3 be-
tween the models just considered and some SOTA eye                 4.2. Driver Attention Evaluation
gaze models. In particular, we consider the model pro-
posed in TurkerGaze [35], where they use pixel-level               In Table 4 we describe the results obtained from the analy-
face features as input and use Ridge Regression to es-             sis of drivers’ attention using the general model described
timate gaze point on the screen, the one proposed in               by the Figure 11. To perform this analysis, we considered
Eye-tracking for Everyone [20], which predicts user gaze           only the two drivers belonging to the test set as specified
on phone and tablet, and finally I-DGAZE, the model                above out of the total 20 included in the dataset. The
presented in our reference paper [16].                             dataset used, DGAZE, provides bounding boxes coordi-
   The error used as a metric for this comparison is the           nates as labels only corresponding to the object observed
Mean Absolute Error (MAE), calculated by taking the                by the driver. Therefore, we have considered these bound-
mean of the absolute differences between model predic-             ing boxes as indicative of the most important element
tions and actual values. In mathematical terms, it is ex-          in the scene, and we will consider any detected object
pressed as:                                                        aside from the selected one as an incorrect focus object.
                              𝑛
                           1 ∑︁                                    Based on this reasoning, we identified three attention
                 MAE =           |𝑔𝑖 − 𝑔ˆ𝑖 |           (9)         score scenarios:
                           𝑛 𝑖=1

where 𝑛 is the total number of samples, 𝑔𝑖 represents                   • Correct bbox (Attention Score=2): the driver is
the actual values and 𝑔ˆ𝑖 represents the model predictions.               looking at the correct road element indicated by
The smaller the Mean Absolute Error, the more accu-                       the dataset, so the point the driver is focusing on
rate the model is in predicting the co-ordinates of the                   falls in the bounding box of the expected object;
gaze point. We can see that even in this case the CNN +                 • Another bbox (Attention Score=1): the driver is
Transformer model proves to be in line with the other                     attentive, but focused on an another element of
SOTA models on the validation and test error, proving                     the road, so the point the driver is focusing on




                                                              47
Francesca Fiani et al. CEUR Workshop Proceedings                                                                                   40–50



Table 4                                                                   Table 5
Results of Visual Attention Estimation in Drivers. An attention           Object focus distribution in test set for drivers. Obtained data
score of 2 indicates a correct object of focus, an attention score        shows that drivers tend to focus their attention on vehicles
of 1 an incorrect object of focus but an attentive driver and             (car and truck) compared to other elements.
an attention score of 0 a distracted driver.
                                                                                         Object Type      Percentage [%]
              Attention Score             Percentage [%]
                                                                                            person              8.33
       Correct bbox (Att. Score = 2)            16.06                                        truck             15.70
       Another bbox (Att. Score = 1)            29.95                                          car             16.40
         No bbox (Att. Score = 0)               53.99                                     road signal           2.90
                                                                                          motorcycle            2.66
                                                                                          traffic light         0.01
        falls in the bounding box of an object different
        from the one of the expected object;
      • No bbox (Attention Score=0): the driver is not                    tioned, this is obtained by performing two sub-tasks, gaze
        paying attention to the road and is therefore not                 estimation and the object detection. To execute the first,
        looking at any important road elements, so the                    we examined two different architectures, GazeCNN and
        point the driver is focusing on doesn’t fall in any               GazeTR-Hybrid. We then assessed the performance of
        bounding box.                                                     both models for the specified task, achieving better re-
                                                                          sults with the GazeTR-Hybrid model. This second model
   We observe that the system identifies distracted drivers               was consequently used to implement driver visual at-
(Attention Score=0) 53.99% of the time, a percentage                      tention detection. For object detection, we employed a
which does not fall in line with expected results. Un-                    fine-tuned YOLOv8 model capable of recognizing cars,
fortunately, this result is attributed to the suboptimal                  people, trucks, motorcycles, traffic lights and various road
performance of our CNN + Transformer model, partic-                       signs. By combining the outputs of the two branches, i.e.
ularly in bbox accuracy which as shown in Table 4 is                      projecting the driver’s gaze point (whose coordinates are
particularly low (less than 20%). As mentioned earlier,                   obtained as output from the gaze detection branch) onto
this is a challenging task, as even small pixel errors in                 the corresponding ’road view’, where all relevant road
this context have significant relevance, and it therefore                 objects identified by YOLO are located, we evaluated the
highlights the need for greater precision in determining                  actual visual attention of drivers. This approach allowed
the gaze point, especially in such cases where a high                     us to obtain two valuable pieces of information: whether
accuracy is necessary due to safety reasons.                              the driver is attentive or not and, if so, to which element
   In the scenario where the system recognizes drivers as                 of the road.
attentive, instead (approximately 46.01% of the time), we                    Possible future improvements are evident, starting
notice that generally they are attentive but focused on                   with the gaze detection task, where increased precision
road elements that are not considered the most impor-                     in calculating the gaze point could lead to better results
tant (Attention Score=1). The data presented in Table 5                   in assessing drivers’ visual attention. We believe that
reveals that, most often, drivers concentrate their atten-                the addition of more features during the training phase
tion on the vehicles in front of them, especially on cars                 to the GazeTR-Hybrid model could lead to the desired
and trucks. This indicates a higher level of attention to                 improvement in performances, thus achieving increas-
other vehicles compared to road signs or other objects,                   ingly precise results. This, in turn, would contribute to
which justifiable due to other vehicles being the main                    an effective improvement in Visual Attention Estimation
’antagonistic’ driving element and the primary source                     in drivers. This is a consequence of the fact that, by in-
of potential impediment to road safety. Even though                       creasing precision, we can identify information about
in our dataset we have predetermined attention objec-                     the objects the driver is focusing on even in case of oc-
tives, which consequently limits the correctness of the                   clusions, i.e. if they are distant or partially hidden by
obtained results, a statistical analysis can be performed                 other elements. However, we find our approach to the
with our framework in different scenarios to gain insight                 Driver Vision Attention task promising for future works,
on drivers’ attention behaviour and on the objects that                   particularly in the aspect of obtaining more complete
they pay most attention to in different driving situations.               results on the drivers’ engagement with the road.
                                                                             Drivers’ attention and the object they focus on can
                                                                          be subsequently used in different contexts. For instance,
5. Conclusion                                                             the former could be applied in assessing attention in
                                                                          systems designed to alert the driver when not paying
In this paper we presented a new way to perform the
                                                                          adequate attention to the road, while the second can be
task of driver visual attention detection. As already men-



                                                                     48
Francesca Fiani et al. CEUR Workshop Proceedings                                                                     40–50



used to train autonomous driving models, helping them            [10] E. Iacobelli, V. Ponzi, S. Russo, C. Napoli, Eye-
understand what to prioritize in each driving scenario. A             tracking system with low-end hardware: Devel-
mixed model able to detect both data could lead to more               opment and evaluation, Information (Switzerland)
comprehensive autonomous or assisted driving systems                  14 (2023). doi:10.3390/info14120644.
by reducing training times due to faster data collection.        [11] F. Fiani, S. Russo, C. Napoli, An advanced solu-
                                                                      tion based on machine learning for remote emdr
                                                                      therapy, Technologies 11 (2023). doi:10.3390/
References                                                            technologies11060172.
                                                                 [12] N. Brandizzi, S. Russo, G. Galati, C. Napoli, Address-
 [1] A. Eriksson, N. A. Stanton, Takeover time in highly
                                                                      ing vehicle sharing through behavioral analysis: A
     automated vehicles: noncritical transitions to and
                                                                      solution to user clustering using recency-frequency-
     from manual control, Human factors 59 (2017) 689–
                                                                      monetary and vehicle relocation based on neigh-
     705.
                                                                      borhood splits, Information (Switzerland) 13 (2022).
 [2] T. A. Dingus, S. G. Klauer, V. L. Neale, A. Petersen,
                                                                      doi:10.3390/info13110511.
     S. E. Lee, J. Sudweeks, M. A. Perez, J. Hankey,
                                                                 [13] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     D. Ramsey, S. Gupta, C. Bucher, Z. R. Doerzaph,
                                                                      ing for image recognition, in: Proceedings of the
     J. Jermeland, R. R. Knipling, The 100 car naturalis-
                                                                      IEEE conference on computer vision and pattern
     tic driving study: Phase II – Results of the 100-car
                                                                      recognition, 2016, pp. 770–778.
     field experiment (2006).
                                                                 [14] Y. Cheng, F. Lu, Gaze estimation using transformer,
 [3] T. Rueda-Domingo, P. Lardelli-Claret, J. de Dios
                                                                      in: 2022 26th International Conference on Pattern
     Luna-del Castillo, J. J. Jiménez-Moleón, M. Garciá-
                                                                      Recognition (ICPR), IEEE, 2022, pp. 3341–3347.
     Martiń, A. Bueno-Cavanillas, The influence of pas-
                                                                 [15] J. Terven, D.-M. Córdova-Esparza, J.-A. Romero-
     sengers on the risk of the driver causing a car col-
                                                                      González, A comprehensive review of yolo archi-
     lision in spain: Analysis of collisions from 1990 to
                                                                      tectures in computer vision: From yolov1 to yolov8
     1999, Accident Analysis & Prevention 36 (2004)
                                                                      and yolo-nas, Machine Learning and Knowledge
     481–489.
                                                                      Extraction 5 (2023) 1680–1716.
 [4] K. A. Braitman, N. K. Chaudhary, A. T. McCartt,
                                                                 [16] I. Dua, T. A. John, R. Gupta, C. Jawahar, Dgaze:
     Effect of passenger presence on older drivers’ risk
                                                                      Driver gaze mapping on road, in: 2020 IEEE/RSJ
     of fatal crash involvement, Traffic injury prevention
                                                                      International Conference on Intelligent Robots and
     15 (2014) 451–456.
                                                                      Systems (IROS), IEEE, 2020, pp. 5946–5953.
 [5] F. Bonanno, G. Capizzi, G. L. Sciuto, C. Napoli,
                                                                 [17] Y. Sugano, Y. Matsushita, Y. Sato, Learning-by-
     G. Pappalardo, E. Tramontana, A novel cloud-
                                                                      synthesis for appearance-based 3d gaze estimation,
     distributed toolbox for optimal energy dispatch
                                                                      in: Proceedings of the IEEE conference on computer
     management from renewables in igss by using wrnn
                                                                      vision and pattern recognition, 2014, pp. 1821–1828.
     predictors and gpu parallel solutions, 2014, pp. 1077
                                                                 [18] D. Melesse, M. Khalil, E. Kagabo, T. Ning, K. Huang,
     – 1084. doi:10.1109/SPEEDAM.2014.6872127.
                                                                      Appearance-based gaze tracking through super-
 [6] I. E. Tibermacine, A. Tibermacine, W. Guettala,
                                                                      vised machine learning, in: 2020 15th IEEE Inter-
     C. Napoli, S. Russo, Enhancing sentiment anal-
                                                                      national Conference on Signal Processing (ICSP),
     ysis on seed-iv dataset with vision transformers:
                                                                      volume 1, IEEE, 2020, pp. 467–471.
     A comparative study, 2023, pp. 238 – 246. doi:10.
                                                                 [19] X. Zhang, Y. Sugano, M. Fritz, A. Bulling,
     1145/3638985.3639024.
                                                                      Appearance-based gaze estimation in the wild, in:
 [7] N. N. Dat, V. Ponzi, S. Russo, F. Vincelli, Supporting
                                                                      Proceedings of the IEEE conference on computer
     impaired people with a following robotic assistant
                                                                      vision and pattern recognition, 2015, pp. 4511–4520.
     by means of end-to-end visual target navigation
                                                                 [20] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan,
     and reinforcement learning approaches, volume
                                                                      S. Bhandarkar, W. Matusik, A. Torralba, Eye track-
     3118, 2021, pp. 51 – 63.
                                                                      ing for everyone, in: Proceedings of the IEEE con-
 [8] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,
                                                                      ference on computer vision and pattern recognition,
     Analysis pre and post covid-19 pandemic rorschach
                                                                      2016, pp. 2176–2184.
     test data of using em algorithms and gmm models,
                                                                 [21] T. Fischer, H. J. Chang, Y. Demiris, Rt-gene: Real-
     volume 3360, 2022, pp. 55 – 63.
                                                                      time eye gaze estimation in natural environments,
 [9] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo,
                                                                      in: Proceedings of the European conference on com-
     J. Starczewski, C. Napoli, A novel convmixer trans-
                                                                      puter vision (ECCV), 2018, pp. 334–352.
     former based architecture for violent behavior de-
                                                                 [22] H. S. Yoon, N. R. Baek, N. Q. Truong, K. R. Park,
     tection 14126 LNAI (2023) 3 – 16. doi:10.1007/
                                                                      Driver gaze detection based on deep residual net-
     978-3-031-42508-0_1.
                                                                      works using the combined single image of dual



                                                            49
Francesca Fiani et al. CEUR Workshop Proceedings                                                          40–50



     near-infrared cameras, IEEE Access 7 (2019) 93448– [35] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich
     93461.                                                  feature hierarchies for accurate object detection
[23] N. Mizuno, A. Yoshizawa, A. Hayashi, T. Ishikawa,       and semantic segmentation, in: Proceedings of the
     Detecting driver’s visual attention area by using       IEEE conference on computer vision and pattern
     vehicle-mounted device, in: 2017 IEEE 16th Inter-       recognition, 2014, pp. 580–587.
     national Conference on Cognitive Informatics &
     Cognitive Computing (ICCI* CC), IEEE, 2017, pp.
     346–352.
[24] S. Vora, A. Rangesh, M. M. Trivedi, Driver gaze zone
     estimation using convolutional neural networks:
     A general framework and ablative analysis, IEEE
     Transactions on Intelligent Vehicles 3 (2018) 254–
     265.
[25] S. M. Shah, Z. Sun, K. Zaman, A. Hussain, M. Shoaib,
     L. Pei, A driver gaze estimation method based on
     deep learning, Sensors 22 (2022) 3959.
[26] T. Deng, H. Yan, L. Qin, T. Ngo, B. Manjunath, How
     do drivers allocate their potential attention? driv-
     ing fixation prediction via convolutional neural net-
     works, IEEE Transactions on Intelligent Transporta-
     tion Systems 21 (2019) 2146–2154.
[27] Y. Xia, D. Zhang, A. Pozdnoukhov, K. Nakayama,
     K. Zipser, D. Whitney, Training a network to at-
     tend like human drivers saves it from common
     but misleading loss functions, arXiv preprint
     arXiv:1711.06406 (2017).
[28] C. Gou, Y. Zhou, D. Li, Driver attention predic-
     tion based on convolution and transformers, The
     Journal of Supercomputing 78 (2022) 8268–8284.
[29] A. Palazzi, D. Abati, F. Solera, R. Cucchiara, et al.,
     Predicting the driver’s focus of attention: the dr
     (eye) ve project, IEEE transactions on pattern anal-
     ysis and machine intelligence 41 (2018) 1720–1733.
[30] S. Ghosh, A. Dhall, M. Hayat, J. Knibbe, Q. Ji, Au-
     tomatic gaze analysis: A survey of deep learning
     based approaches, IEEE Transactions on Pattern
     Analysis and Machine Intelligence 46 (2023) 61–84.
[31] D. E. King, Dlib-ml: A machine learning toolkit,
     The Journal of Machine Learning Research 10 (2009)
     1755–1758.
[32] T. Hempel, A. A. Abdelrahman, A. Al-Hamadi,
     6d rotation representation for unconstrained head
     pose estimation, in: 2022 IEEE International Con-
     ference on Image Processing (ICIP), IEEE, 2022, pp.
     2496–2500.
[33] R. 100, road signs dataset, https://universe.roboflow.
     com/roboflow-100/road-signs-6ih4y, 2023. URL:
     https://universe.roboflow.com/roboflow-100/
     road-signs-6ih4y.
[34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
     D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco:
     Common objects in context, in: Computer Vision–
     ECCV 2014: 13th European Conference, Zurich,
     Switzerland, September 6-12, 2014, Proceedings,
     Part V 13, Springer, 2014, pp. 740–755.



                                                      50