=Paper=
{{Paper
|id=Vol-3695/p11
|storemode=property
|title=Keeping Eyes on the Road: Understanding Driver Attention
and Its Role in Safe Driving
|pdfUrl=https://ceur-ws.org/Vol-3695/p11.pdf
|volume=Vol-3695
|authors=Francesca Fiani,Valerio Ponzi,Samuele Russo
|dblpUrl=https://dblp.org/rec/conf/system/FianiPR23
}}
==Keeping Eyes on the Road: Understanding Driver Attention
and Its Role in Safe Driving==
<pdf width="1500px">https://ceur-ws.org/Vol-3695/p11.pdf</pdf>
<pre>
                                Keeping Eyes on the Road: Understanding Driver Attention
                                and Its Role in Safe Driving
                                Francesca Fiani1 , Valerio Ponzi1,2 and Samuele Russo3
                                1
                                  Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185 Roma, Italy
                                2
                                  Institute for Systems Analysis and Computer Science, Italian National Research Council, 00185 Roma, Italy
                                3
                                  Department of Psychology, Sapienza University of Rome, 00185 Roma, Italy


                                                                          Abstract
                                                                          Monitoring the driver’s attention is an important task to maintain the driver’s safety. The estimation of the driver’s gaze
                                                                          direction can help us to evaluate if the drivers are not focusing their attention on the street. For an evaluation of this type,
                                                                          comparing the inside view and outside scenery of the vehicle is essential, therefore we decided to create a specific dataset for
                                                                          this task. In this work, we realize a machine-learning-oriented approach to driver’s attention evaluation using a coupled
                                                                          visual perception system. By analyzing the road and the driver’s gaze simultaneously it is possible to understand if the driver
                                                                          is looking at the traffic signs detected. We evaluate if a determined Region Of Interest (ROI) contains a road sign or not
                                                                          through YOLOv8.

                                                                          Keywords
                                                                          Visual Attention Estimation, Machine Learning, Artificial Intelligence, ADAS (Autonomous Driver Assistance Systems), YOLO


                                1. Introduction                                                                                        the analysis of the vehicle cabin and the driver’s gaze is
                                                                                                                                       conducted independently, without considering the evalu-
                                Artificial Intelligence (AI) employed in assessing driver ation of the surrounding environment, road conditions,
                                attention within assisted driving scenarios is swiftly ad- and the driver’s reaction to specific events.
                                vancing, propelled by the evolution of autonomous ve-                                                     Several studies focus either on observing the driver’s
                                hicles and the integration of hybrid systems designed to behavior through internal vehicle cameras or analyzing
                                assist drivers. These systems encompass a range of func- external road conditions using external cameras and sen-
                                tionalities, including cruise control, lane-keeping assis- sors [6, 7, 8, 9, 10]. However, a gap exists in comprehen-
                                tance, automatic parking, and various other features inte- sive research that integrates both internal and external
                                grated into modern vehicles. It is well known that driver perspectives without relying on complex and inaccessi-
                                inattention is a major cause of road accidents [1, 2, 3], ble equipment. To address this gap, our research adopts
                                with violations of the expected driver behavior being a a novel approach. We simultaneously analyze internal
                                fundamental factor [4]. Due to its significant contribu- driver information, such as posture and gaze, and exter-
                                tion to accidents, monitoring driver attention has become nal data about road conditions and points of interest, like
                                a critical necessity for automotive safety systems, aiming signs and pedestrians, during driving. This integrated ap-
                                to detect potential risks and proactively prevent accidents. proach allows for a more holistic understanding of driver
                                To achieve comprehensive attention monitoring, it is im- attention and behavior.
                                perative to conduct precise analyses of various factors,                                                  Machine learning is playing a pivotal role in creating
                                including the driver’s posture, head position, rotation a safer society. In the realm of energy [11], machine
                                angles, and gaze direction. These insights into driver learning algorithms are optimizing data systems [12, 13],
                                behavior enable the identification of factors influencing improving supply-demand forecasting, and enhancing
                                reactions to different conditions and scenarios, thereby the efficiency of renewable energy sources. This not only
                                mitigating distractions and drowsiness-related incidents ensures a stable energy supply but also reduces the risk
                                in the future [5].                                                                                     of blackouts. When it comes to fostering a green environ-
                                   Literature primarily addresses driver attention by di- ment, machine learning is at the forefront of monitoring
                                viding the internal and external components. Typically, and predicting environmental changes, enabling us to
                                                                                                                                       take timely action against potential threats [14, 15]. So-
                                SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi- cial benefits are manifold, including improved healthcare
                                neering and Mathematics, Rome, December 3-6, 2023                                                      through predictive diagnostics, personalized education,
                                $ fiani@diag.uniroma1.it (F. Fiani); ponzi@diag.uniroma1.it                                            and effective public services, all contributing to an im-
                                (V. Ponzi); samuele.russo@uniroma1.it (S. Russo)
                                                                                                                                       proved quality of life [16, 17, 18]. In the context of urban
                                 0009-0005-0396-7019 (F. Fiani); 0009-0000-2910-0273 (V. Ponzi);
                                0000-0002-9421-8566 (S. Russo)                                                                         driving, machine learning is the driving force behind
                                          © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License autonomous vehicles [19]. These vehicles promise to sig-
                                          Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                                 85


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Francesca Fiani et al. CEUR Workshop Proceedings                                                                        85–95


nificantly reduce traffic accidents, improve traffic flow,     and point of focus. Our novel approach involves the use
and reduce carbon emissions, making our cities safer           of a grid of nine cells to predict the Regions Of Inter-
and more sustainable. Thus, machine learning is a key          est (ROIs) of the driver’s gaze, as illustrated in Figure 1.
enabler in our pursuit of a safer society.                     To achieve this, we employ a VGG16 network to extract
   In this research, we merge various internal and exter-      features from facial video frames, augmenting this in-
nal techniques for gaze recognition and correlate them         formation with head-pose data (i.e. roll, pitch, and yaw
with external Regions Of Interest (ROIs) to develop an         angles) to enhance gaze-position prediction [20, 21, 22].
easily applicable solution that comprehensively tackles            The difference between tracking gaze-position when a
the issue of driver attention. This approach holds sig-        person is looking at a monitor and while they are driving
nificant practical implications for everyday scenarios,        is, in fact, substantial [21, 22]. When looking at a monitor,
including:                                                     head movements are imperceptible, so the only discrimi-
                                                               nant is the position of the pupil. During driving, however,
     • Autonomous vehicle development: Understand-             the driver tends to rotate their head to look at vehicles
       ing the driver’s focus during critical driving situ-    and pedestrians or tilt it to see street names, signs, or
       ations, including the duration of their attention       higher traffic lights. They also shift their gaze to look at
       to specific elements and their perception of ir-        mirrors or to initiate a reverse maneuver. For these rea-
       relevant factors, plays a pivotal role in the ad-       sons, analyzing only pupil movement was insufficient for
       vancement of Advanced Driver Assistance Sys-            our task and it was necessary to have additional informa-
       tem (ADAS) solutions.                                   tion about head pose (rotation angles) and characteristics
     • Car crashes: Having information about driver            of eyes or facial images, see Figure 2.
       attention during a road accident could facilitate           In addition to methodological research, another sig-
       the execution of investigations, checks, and in-        nificant challenge we faced was sourcing an appropri-
       surance procedures. By utilizing an affordable          ate dataset for driver attention monitoring. We encoun-
       camera system, video data on the driver involved        tered existing datasets with comprehensive documenta-
       in the accident could be collected and provided         tion of driver behavior, but they lacked corresponding
       to an application.                                      real-world external observations. Furthermore, datasets
     • Emergency services: Emergency response vehi-            focused solely on gaze analysis typically consisted of
       cles, including ambulances and fire trucks, of-         images of individuals looking at points on a computer
       ten need to navigate through traffic quickly and        screen, which did not align with our real-world driving
       safely. Driver attention monitoring systems can         scenario. To address this gap, we decided to create our
       help emergency service providers ensure their           own dataset, encompassing both internal and external
       drivers remain vigilant while responding to emer-       videos captured during driving sessions. This approach
       gencies, minimizing the likelihood of accidents         enabled our final application to process and correlate
       and delays.
     • Public transportation infrastructure: Driver atten-
       tion monitoring systems can also be integrated
       into public transportation infrastructure, such as
       traffic lights and pedestrian crossings. By detect-
       ing instances of driver distraction or inattention,
       these systems can improve traffic flow and pedes-
       trian safety, reducing the risk of accidents and
       congestion in urban areas.

   To advance driver attention monitoring, we have di-
rected our efforts towards computer vision-based method-
ologies, which are gaining traction over physiology-
based approaches. Unlike physiological methods, in fact,
vision-based techniques rely solely on cameras to observe
and analyze driver behaviors, eliminating the need for
intrusive devices such as eye-tracking glasses or brain-       Figure 1: Example of an external image after road signs de-
wave recognition gadgets and consequently reducing the         tection, with the ROI grid in green. Several regions of interest
cost associated with experiments.                              which contain one or more road signs have been identified in
                                                               the image (specifically, cells 4, 5 and 6). The red rectangles
   The most in-depth analysis in our work focused on
                                                               represent the traffic signs bounding boxes.
finding the best method and features to extract from im-
ages to accurately determine the driver’s gaze direction


                                                          86
Francesca Fiani et al. CEUR Workshop Proceedings                                                                      85–95


information from multiple perspectives simultaneously.           most classical metric being the gaze direction, generally
   To train the two components of our application, we uti-       assessed by analyzing facial features such as the face
lized two additional datasets. For the internal component,       mesh. Other approaches are however available, for in-
which involves predicting the driver’s gaze position, we         stance, the position of the hands and arms, which can be
curated the HEAD-POSE dataset, featuring data from dif-          used to assess whether the driver keeps their hands on
ferent subjects. Unlike many existing datasets that often        the steering wheel or in other positions, such as holding
focus on a single subject, our dataset offers a broader          a phone [2].
and more diverse range of observations. For the external            On the other hand, other approaches solely focus on ex-
component, which entails predicting the position of road         ternal factors by studying the surrounding environment
signs, we leveraged a customized dataset of road signs           and collecting information about the vehicle’s movement
sourced from the internet. We carefully selected images          (speed, position) to study the driver’s reactivity in spe-
from datasets such as MAPILLARY and GTSDB, ensuring              cific circumstances. For example, various sensors such
that they adhered to European traffic regulations gov-           as cameras and lidar, applied to the external part of the
erned by the Vienna Convention of 1968. This meticulous          vehicle, can allow the observation of the driver’s reaction
curation process ensured the relevance and accuracy of           in certain situations [10]. Another classical study when
the data used in our research and development efforts.           analyzing the external environment surrounding the car
                                                                 is the analysis of road elements present in the scene via
                                                                 neural networks such as YOLO [25, 26].
2. Related Works                                                    While poor in number compared to decoupled ap-
                                                                 proaches, some studies simultaneously analyze both in-
As previously discussed, in recent years a growing in-
                                                                 ternal and external images of the vehicle while assessing
terest in analyzing driver attention during driving has
                                                                 driver attention to the road from the driver’s perspec-
been noticed. This includes understanding whether a
                                                                 tive. In cases where interior cabin images are associ-
person is observing the road, being distracted, remaining
                                                                 ated with external frames, the driver’s viewpoint is often
vigilant, or experiencing drowsiness. Most state-of-the-
                                                                 recorded using glasses or equipment that track eye move-
art approaches are based on the unique observation of
                                                                 ments, which directly indicates what is being observed
the driver’s interior cabin to understand their behaviors
                                                                 [27]. There are also some recent datasets created in a
[23, 24, 2, 20]. One or more internal cameras are used
                                                                 controlled setting that simulate common driving situa-
to observe the driver and determine if they are looking
                                                                 tions, such as the DGAZE dataset with its corresponding
at the infotainment system, the road, the mirrors, or, for
                                                                 algorithm I-DGAZE [28].
example, other passengers. Several methods can be used
                                                                    Regarding specifically the gaze detection task, various
to determine the attention level of the driver, with the
                                                                 approaches are used in simulated or real environments,
                                                                 both indoors and outdoors [29]. In most literature works
                                                                 and datasets, recordings are made using a personal com-
                                                                 puter’s webcam while the subject looks at specific points
                                                                 on the screen for certain moments. With this regression
                                                                 problem, the aim is to recognize the precise gaze position
                                                                 on the monitor by studying the direction of the pupils
                                                                 and gaze triangulation [30, 22, 31, 32, 33, 34]. These types
                                                                 of problems can, however, also be approached through
                                                                 classification. For example, images can be taken of a sta-
                                                                 tionary subject in front of a personal computer screen,
                                                                 ideally divided into a 9-cell grid, and the gaze position
                                                                 can then be returned not as precise point coordinates,
                                                                 but instead as the ID number of the observed cell (clas-
                                                                 sification) [21]. In such cases, pupil characteristics are
                                                                 generally extracted and then classified using classic ma-
                                                                 chine learning methods such as Support Vector Machines
                                                                 (SVMs), Convolutional Neural Networks (CNNs), or Deep
Figure 2: Example of internal image with facial features ex-     Neural Networks (DNNs). This last approach in particu-
traction. The red rectangle is the measured face bounding        lar has inspired our choice to implement a classification
box and the blue dots represent the found facial landmarks.      algorithm, given the problems and requirements already
The face of the subject has been blurred according to privacy    described and specific to the field of driving.
regulations.                                                        Other existing algorithms for studying gaze position
                                                                 start, as previously mentioned, from datasets of tens


                                                            87
Francesca Fiani et al. CEUR Workshop Proceedings                                                                        85–95


of thousands of photos collected using a personal com-                • Traffic Objects Dataset: This dataset is a modi-
puter’s webcam, and then extract facial information from                fied version of the Mapillary Dataset, containing
the given images to crop eye images and pass them to                    images depicting various traffic signs.
networks such as VGG-16 [22]. In addition, information                • Traffic Signs Dataset in YOLO format (TSDY):
related to head position (rotation angles - roll, pitch, yaw)           This dataset comprises images sourced from
can also be considered [20]. It is particularly of note that            the German Traffic Sign Detection Benchmark
to improve regression on the viewpoint position it is                   (GTSDB), available for download from Kaggle.
fundamental to collect images from multiple subjects, in
multiple vehicles, and under different weather conditions.          To create our dataset, we maintained a consistent
Finally, additional approaches make use of recordings            equipment setup as depicted in Figure 3. For both the
in simulated environments using various technologies,            Gaze Directions and Head Posture Dataset and People
from simulators to simple computer-played videos. For            Driving Dataset we utilized a city car, while an iPhone
example, the user’s gaze position can be recorded while          15 camera was employed to capture internal images and
watching driving videos shortly before certain incidents,        record internal videos. The iPhone was strategically posi-
in order to understand which objects the driver (simu-           tioned behind the steering wheel to ensure clear visibility
lated in this case) would have focused on [29].                  of the driver while minimizing extraneous details. Fur-
                                                                 thermore, we positioned a GoPro Hero10 camera at the
                                                                 center of the car’s dashboard to capture external footage
3. Methods                                                       throughout the drive.
                                                                    For the GDHPD, we compiled images from distinct
This research explores an innovative method for recogniz-        subjects, consisting of two males and two females. In
ing gaze patterns while driving to evaluate driver atten-        some cases, subjects wore glasses, while in the other
tion. Subsequently, we focused on the internal aspect of         they did not. The image collection process encompassed
the vehicle, where we trained and tested neural networks         various times of the day and diverse lighting conditions,
for gaze classification. Our experimentation involved            resulting in a total of 1012 images. Table 1 provides a
various models, including SVM, ClassNET, VGG16-based             breakdown of the distribution of these images.
Net, and HEGClass Net. Additionally, we conducted a                 Subjects were positioned inside a car, with their seat-
training phase for the external aspect using a custom            ing adjusted to achieve a standard driving posture. Sub-
dataset comprising traffic sign objects. Once we obtained        sequently, images were captured while subjects varied
results for both components, we merged the two mod-              their gaze and head positions. To facilitate classifica-
ules to conduct a comprehensive analysis of video record-        tion, we devised a virtual grid dividing the external view
ings obtained during real-world driving scenarios. This          and the driver’s gaze into a 9-cell configuration. This
integrated approach facilitated a more holistic compre-
hension of gaze behavior and its correlation with driver
attention in typical driving situations.

3.1. Dataset
A variety of images and videos were gathered and uti-
lized at different stages of development to construct cus-
tom datasets tailored to our research objectives. These
datasets can be categorized into four distinct collections:

     • Gaze Directions and Head Posture Dataset
       (GDHPD): This dataset comprises images cap-
       tured by us, featuring individuals in a driving
       environment. The images are utilized to catego-
       rize the gaze position of individuals within a grid
       consisting of nine cells, including the exterior of
       the grid.                                                 Figure 3: The setup of the city-car environment used during
     • People Driving Dataset (PDD): This dataset                the collection of images and videos. In red, the virtual grid
                                                                 represents how the gaze position area is divided into ROIs,
       consists of both external and internal videos,
                                                                 each associated with a region number from 1 to 9 (the area
       recorded by our team, showcasing the driving              outside the virtual grid is labeled 0). In orange, the GoPro
       activities.                                               Hero10 used to record the external street. In blue, the iPhone
                                                                 15 used to record the driver.


                                                            88
Francesca Fiani et al. CEUR Workshop Proceedings                                                                    85–95


              GDHPD dataset specifications                    variations.
            Classes Male    Female      Total
                                                                 The corresponding JSON files were then converted into
               0     50       64         114
                                                              TXT files and formatted to suit YOLO’s training model re-
               1     50       64         114
               2     51       71         122                  quirements. This conversion process involved extracting
               3     51       71         122                  relevant information such as bounding box coordinates
               4     48       60         108                  and class labels of traffic signs, facilitating model train-
               5     48       60         108                  ing for sign recognition and localization. To simplify
               6     50       50         100                  the classification task, the dataset’s labels were modi-
               7     51       50         101                  fied to include three super categories: ’PROHIBITORY’,
               8     51       61         112                  ’DANGER’, and ’MANDATORY’.
               9     50       61         111                     These categories encapsulate the majority of relevant
             Total  500       612       1012                  traffic signs essential for driving safety, thus streamlining
Table 1                                                       the training process. Similarly, the Traffic Signs Dataset
The Gaze Directions and Head Posture Dataset (GDHPD) in YOLO format (TSDY) was used as a refined version of
dataset collects the gaze positions of different drivers. The the larger GTSDB dataset. Comprising 750 images with
collection contains a total of 1012 images, 500 for males and labels already expressed in YOLO format, each image
612 images for females.                                       had a resolution of 1360x800. The number of classes was
                                                              reduced to four: ’PROHIBITORY’, ’DANGER’, ’MANDA-
                                                              TORY’, and ’OTHER’, simplifying the classification task
grid facilitated the association of head and eye positions and enhancing the focus on critical sign types relevant
with specific regions of the external images, enabling to driver safety.
the identification of Regions Of Interest (ROIs) during
experimentation.                                              3.2. Gaze Classification
   The dataset comprises ten classes: the nine cells of the
grid, alongside an additional class representing situations Several algorithms, were analyzed to identify the ap-
where the subject’s attention is not directed towards the proach with the best trade-off between accuracy in gaze
road (e.g., face turned sideways, gaze directed upwards direction prediction and generalization capabilities, al-
or downwards, etc.).                                          lowing to efficiently recognize images with varied con-
   The PDD dataset comprises videos captured during trast and/or brightness or different drivers. The structure
driving sessions, utilizing the same recording setup as takes as input an image (either singular or an extracted
the previous dataset. Subjects were filmed while driv- frame from the recording) and generates a label predic-
ing under various conditions, capturing both the driver tion through two different but subsequent sub-models.
and the road view simultaneously. To ensure synchro- The head-pose estimation algorithm is common for all
nization, the videos underwent pre-processing using a tested approaches, while the classification algorithm has
third-party software, DaVinci Resolve. Synchronization been severely varied in model type, structure, and input
was achieved through voice cues, guaranteeing precise during the search for the optimal solution.
alignment between internal and external footage.
   Subsequently, the videos were segmented into sub- 3.2.1. Head-Pose Estimation Part
clips of 30 seconds each to streamline subsequent pro-
                                                              The face detection is performed through a pre-trained
cessing steps. Each of the extracted sub-clips was anno-
                                                              Multi-task Cascaded Convolutional Network (MTCNN)
tated with labels indicating whether the driver exhibited
                                                              model, used for both face detection and alignment in
a "CAREFUL" or "NOT CAREFUL" driving style, along
                                                              literature [35]. MTCNN consists of a cascade of convo-
with information regarding the driver’s use of glasses.
                                                              lutional networks (P-Net, R-Net, and O-Net), for face
After pre-processing, the external images displayed a
                                                              landmarks identification. The model first identifies the
resolution of 1440x1080, while the internal images were
                                                              bounding box of the face region through candidate gen-
resized to 1080x1920.
                                                              eration with P-Net and refinement with R-Net and then
   For traffic sign detection and recognition, the primary
                                                              extracts the main 5 landmarks of the face (left and right
dataset utilized was the Traffic Object dataset from the
                                                              eye, nose tip, and mouth corners) with O-Net. Among
Mapillary Traffic Sign Dataset, encompassing tens of
                                                              similar methods, such as Haar Cascade Classifiers [36],
thousands of images sourced from roads worldwide. Fo-
                                                              MTCNN has shown the best results even in the presence
cusing solely on Italian/European traffic signs, around
                                                              of glasses, partially occluded eyes and beards, and has
3,000 images were selected from the dataset after filter-
                                                              therefore been selected as our chosen method.
ing out images with significantly different sign shapes
                                                                 The identified landmarks are then used to analytically
or contents. The chosen images offer a varied range of
                                                              calculate the roll, pitch and yaw angles of the driver’s
brightness, positioning within frames, and contextual


                                                            89
Francesca Fiani et al. CEUR Workshop Proceedings                                                                    85–95


head, while the extracted pupil positions will be used         network. This Classifier Network architecture consists of
as input features for the final classifier to determine the    two convolutional layers with ReLU activation functions,
observed ROI. This feature is fundamental for our clas-        followed by a max pooling layer, and culminates in two
sification task compared to other facial features and is       fully connected layers with ReLU and Softmax activation
therefore particularly important to determine accurately.      functions. Training of ClassNet spanned 300 epochs,
                                                               employing the MSE loss function (Mean Squared Error)
3.2.2. Classification Part                                     and Adam optimizer.
                                                                  Finally, the last experimental iteration involved em-
Our approach involves classifying observed ROI and road        ploying the VGG-16 architecture. Training data consisted
signs through a classification method. The field of view       of 1012 samples from the GDHPD dataset, each composed
is divided into nine sections, with an additional label        of the face image tensor, alongside head rotation angles
for identifying any gaze position outside these sections       (roll, pitch, yaw), and pupil centers. In this setup, fea-
(e.g., distracted driving or maneuvering). We’ve explored      tures from each image were directly extracted within the
various methods for analysis, ranging from traditional         classification network from the RGB image of the entire
SVM to CNNs, all accurately adapted for our application.       face, rather than solely from the eyes. Additionally, other
We will first introduce our novel method, followed by an       features (roll, pitch, yaw, and pupil centers) obtained pre-
overview of other models considered in the analysis.           viously through MTCNN were incorporated alongside
   The standout classification model is HEGClass (Head-        the 512 features of the image in the first fully connected
Eyes-Gaze Classifier), a hybrid approach outlined in this      layer. Training was executed over 100 epochs, using sim-
paper. It takes cropped face images from head-pose esti-       ilarly to the HEGClass model the Cross-Entropy Loss
mation, along with head rotation angles and pupil cen-         function and Adam optimizer.
ter coordinates, as inputs. This combined approach has
yielded high precision in classifying the Region of Inter-
est toward which the gaze is directed. In the HEGClass         3.3. YOLO Training for Traffic Signs
network, as depicted in Figure 4, initial features are ex-     For the detection and recognition of traffic signs, we start
tracted from cropped face RGB images using a pre-trained       from the pre-trained YOLOv8 model, with experimenta-
VGG-16 network. The features are then flattened and            tion also conducted using the YOLOv5 model prior to
concatenated with a normalized array containing head           transitioning to the v8 version. Fine-tuning of YOLOv8
roll-pitch-yaw and pupil center coordinates. This com-         was carried out using two distinct datasets: Traffic Ob-
bined feature vector of dimension 4096+7 passes through        jects and TSDY. The final set of weights chosen for the
two fully connected linear layers, followed by ReLU ac-        application was derived from the dual fine-tuning of
tivation functions, and finally through a last fully con-      YOLOv8 with both datasets.
nected linear layer with Softmax activation to determine          The initial fine-tuning with the Traffic Objects dataset
class membership among the 10 possibilities (9 ROIs for        involved 1802 images for training and 919 for validation.
frontal regions and 1 for others). Model training utilized     Despite starting with 3000 images, adjustments were
our GDHPD dataset, with 10 epochs to mitigate overfit-         made to the training and validation sets due to imbal-
ting, 32 samples per batch, Cross-Entropy loss function,       ance issues within the original dataset, which persisted
and Adam optimizer.                                            even after categorizing labels based on sign categories as
   The first classical model used for comparison is Sup-       described in the Dataset section. Subsequently, proceed-
port Vector Machine (SVM). In our scenario, where we’re        ing from the fine-tuned weights, the model underwent
classifying 10 distinct classes, we employed an SVM with       retraining with images from the TSDY dataset, utilizing
a polynomial kernel of degree 4, regularization param-         600 images for training and 141 for validation.
eter set at 100, and coefficient set to 10. Training ex-          The dual fine-tuning approach resulted in enhanced
clusively used images from the GDHPD dataset. We ex-           performance, as evidenced by improved final accuracy
tracted roll, pitch, yaw, and pupil centers from the images    and heightened generalization capabilities in detecting
with MTCNN. Then, using Haar Cascade Classifier [36],          road signs, even in images sourced from the PDD dataset
we isolated eye patches from each grayscale image and          and exhibiting varied lighting conditions. To further en-
passed them through a pre-trained ResNet to obtain 2048        rich the diversity and generalization capabilities of the
features for each eye. These two sets of features were         fine-tuned YOLO model, diverse image augmentation
then averaged to create a unified array of 2048 elements       techniques from the Albumentations library were em-
containing information from both eyes. The resulting           ployed during training to simulate real-world conditions,
samples underwent L2 normalization before being fed            including Blur, MedianBlur, and CLAHE (Contrast Lim-
into the SVM for both training and testing phases.             ited Adaptive Histogram Equalization). To streamline the
   Using the same 2048 features extracted ResNet and roll,     training process, the Stochastic Gradient Descent (SGD)
pitch, yaw and pupil centers, we also trained the ClassNet     optimizer was utilized, with an initial learning rate of


                                                          90
Francesca Fiani et al. CEUR Workshop Proceedings                                                                           85–95


Figure 4: Model of our novel approach HEGClass. The base of the model is a standard pre-trained VGG-16 network, which
receives as input the cropped image of the subject’s face. In the first fully connected layers 7 additional features [Roll, Pitch,
Yaw, rx, ry, lx, ly] are added to arrive at the final ROI prediction (with 10 classes with values [0,9]).


0.01.                                                     module. The image and information are then passed
   Post-training, the model returns a text file for each  through the network to classify the driver’s gaze posi-
image processed, containing one line per detected sign    tion. Upon obtaining the prediction of the observed cell
alongside its position. By processing the pixel coordi-   in a given frame, it is compared with the position of the
nates from this file, sign position information was ex-   corresponding sign. In frames with multiple detected
tracted to reconstruct the sign’s center and edges within road signs, each corresponding ROI is considered active,
grid cells. In instances where signs spanned multiple     thus rendering the driver alert when focusing on any of
cells, multiple coordinates were necessary for accurate   them without favoring any type of signal.
identification.                                              Given that a single road sign can span multiple cells, 5
                                                          points characterize the object’s position: the four corners
3.4. Application: Merging the Methods                     and the center. If the driver looks at a cell containing a
                                                          partial view of the sign, accounting for the peripheral
The final application, depicted in Figure 5, comprises vision of human eyes we consider them attentive. More-
two primary components and generates a CSV report over, in the absence of signals or when the driver’s gaze
detailing the overall behavior of a driver. It requires is directed to cells 4/5/6 (representing the entire road
as input an internal video capturing the driver and an surface), they are still deemed attentive to the street. If
external video recording the street view. For ease of gaze is directed to cells 7 or 8, indicating focus on the
analysis, synchronization of a 30-second video between car’s dashboard or infotainment system, the driver’s en-
the two components is necessary.                          gagement is noted accordingly.
   After inizialization, external frames undergo analysis    Finally, a CSV file is generated to store the analysis
using the YOLOv8 model trained on road signs, produc- results from the 30-second videos. Each row contains
ing a text file detailing the detected signals along with data including the frame count (consistent across internal
their specifications, including the Regions of Interest and external videos), the number of road signs detected
(ROIs) where these signals are located. For frames with in that frame, the cell number(s) housing the detected
at least one detected sign, the corresponding internal signals, the predicted cell value from the driver’s gaze
image is used to extract information via the GDHPD network, the number of observed signals following ROI


                                                               91
Francesca Fiani et al. CEUR Workshop Proceedings                                                                        85–95


                                                                           Gaze Classifiers Results
                                                                            Method           Accuracy       F1-Score
                                                                           HEGClass              96           94,3
                                                                              SVM               74,5           74
                                                                            ClassNet             56           45,6
                                                                         VGG16-based Net         81            79
                                                                 Table 2
                                                                 Accuracy and F1-Score of all the tested and compared methods
                                                                 for Gaze Classification. Our novel approach shows the overall
                                                                 best results between all the analyzed methods.


                                                                 final classification network. However, occasional failures
                                                                 in face detection or slight misplacements of landmarks
                                                                 introduce minor errors in this initial phase.
                                                                    For predicting gaze direction, a zone-based classifi-
                                                                 cation approach was chosen over regression due to the
                                                                 difficulty in precisely determining the exact point on the
                                                                 road the individual is looking at, coupled with the hu-
                                                                 man eye’s ability to perceive a broad area. Despite testing
                                                                 various methods, the SVM-based approach struggled to
Figure 5: Pipeline of the application presented in the paper.
The internal and external videos are simultaneously processed
                                                                 exceed a 70% accuracy level, likely due to the similarity
to extract relevant features (facial landmarks and road signs    in feature values across the 1012 samples, particularly
bounding boxes respectively). The features are then used         those derived from ResNet for eye images.
to determine the active ROIs, which are then compared to            Transitioning to neural network-based methods, the
generate the attention level of the driver.                      ClassNet network yielded lower accuracy than SVM, even
                                                                 after experimenting with different feature combinations.
                                                                 Training a network based on VGG-16 architecture from
                                                                 scratch yielded better results with an accuracy level of
matching, and an indication of the driver’s attentiveness.
                                                                 81%. However, the limited size of our dataset and com-
                                                                 putational constraints hindered achieving satisfactory
4. Results                                                       performance through this approach. Hence, we adopted
                                                                 the hybrid HEGClass approach, achieving an impressive
The objective of this research is to develop a comprehen-        96% accuracy and 94.3% F1-score without additional data.
sive system capable of analyzing an individual’s atten-          Comprehensive accuracy and F1-score results are shown
tion while driving using only two synchronized videos            in Table 2.
as input. Given the scarcity of references on the simulta-
neous analysis of internal and external perspectives, all        4.2. YOLOv8 Classification and Detection
subsequent evaluations and comparisons will focus on
the individual components constituting the final system.         Through dual fine-tuning of the YOLOv8 network by us-
Nonetheless, through extensive testing conducted with            ing the Traffic Object Dataset and the TSDY dataset, an
the PDD dataset, comprising approximately 194 videos             impressive final F1-score of around 95% was achieved,
each lasting 30 seconds, the final application demon-            with an example of prediction on the PDD dataset shown
strates commendable performance.                                 in Figure 6. We have observed an interesting phe-
                                                                 nomenon when training the network solely with the
                                                                 Traffic Objects dataset, where the F1-score is significantly
4.1. Gaze Classification
                                                                 lower. Specifically, the training of YOLOv8 with the
For what concerns face detection and landmark extrac-            Traffic Objects dataset yielded an overall accuracy of
tion for facial rotation angle calculation, the MTCNN            approximately 65%-70%. Performing the same process
model outperformed the Haar Cascade Classifier. This             with YOLOv5, instead, showed unexpectedly a higher
superiority stems from MTCNN’s ability to handle vari-           accuracy (around 80%), albeit with occasional misclassifi-
ous facial orientations, which is crucial for our GDHPD          cations of elements such as empty spaces between tree
dataset as it contains images with rotated or profiled faces.    branches. In any case, for the scope of this project this
Additionally, MTCNN’s prediction of landmarks, includ-           comparison is not particularly relevant, given the much
ing the center of the pupil, proved vital for training the       higher accuracy with dual fine-tuning.


                                                            92
Francesca Fiani et al. CEUR Workshop Proceedings                                                                       85–95


                                                              dataset. Following initial evaluations, videos compro-
                                                              mised by low light conditions (such as those filmed in
                                                              almost night-time environments) or excessive blurring
                                                              of frames, rendering accurate prediction unfeasible, were
                                                              eliminated.
                                                                 In instances where images lack clarity, are blurry, or ex-
                                                              hibit excessive shaking, the predominant predicted class
                                                              is 0, indicating the model’s failure to accurately identify
                                                              the correct gaze position. Moreover, during nighttime or
                                                              low-light scenarios, accurate gaze evaluation is signifi-
                                                              cantly impeded by diminished brightness. Additionally,
                                                              YOLO struggles with precise detection of relevant signs,
Figure 6: Frame extracted from the PDD dataset predicted by often leading to confusion. Classes 5 and 6 are frequently
the fine-tuned model YOLOv8. The predicted traffic signs and identified as the gaze position during driving, aligning
the correlated label and accuracy are highlighted in orange.  with the fact that these areas correspond to central re-
                                                              gions of the windscreen. In some situations, such as
                                                              when the vehicle is stationary at a traffic light or in traf-
   Despite the substantial improvement in generalization fic congestion, the system may recognize the same signs
capabilities achieved through dual training, errors in sign across multiple frames. However, drivers may not con-
recognition persist. Certain objects along the road may sistently attend to them throughout, as they may have
be mistaken for road signs, such as advertisements con- already observed them and they may not be of immediate
taining elements that, with low resolution, could be con- significance at that moment.
fused. While this issue is present, its impact on the overall
results remains manageable and could potentially be mit-
igated with a wider variety of images. Another challenge
                                                              5. Conclusions
arises from grouping signs of different shapes and colors This work aimed to develop and assess a comprehensive
into the same class, creating a bias in their classification. system for evaluating attention to traffic signs in driving
Additionally, signs containing other signs within them environments. We accomplished this by creating two
may only become relevant in specific situations, such new datasets (GDHPD, PDD) and modifying two exist-
as parking signs reserved for disabled individuals. For ing ones (Traffic Objects, TSDY) to better suit our task
this reason, some signs were excluded from the training requirements. The final application was divided into two
phase.                                                        parts, utilizing YOLOv8 for sign prediction and MTCNN
   Given the high accuracy values in detection, adjusting + HEGClass for gaze position classification.
the confidence threshold can help alleviate misclassifi-         Despite encountering challenges during various train-
cation issues. Signs may be recognized even when ro- ing and testing phases, as described in the Results section,
tated, facing the opposite direction of the lane, or located the overall accuracy of the final system remains very
in irrelevant areas. In such cases, they are counted as high, notwithstanding the partial errors accumulated by
points of inattention. Due to class imbalance, accurately its constituent parts.
classifying the type of road sign remains a challenge.           These challenges serve as valuable insights for future
Consequently, for our purposes, only information related research endeavors. Opportunities for improvement in-
to the bounding box defining the sign’s position is ex- clude implementing mechanisms to track seen and un-
tracted, without specifying the type of sign. Despite seen signals, enhancing prediction accuracy in diverse
attempts to simplify the dataset to recognize only one lighting and atmospheric conditions through dataset aug-
class, "TRAF_SIGN", challenges persisted between detect- mentation or pre-processing techniques, and expanding
ing signs and identifying unrelated environmental areas. datasets to ensure greater completeness.
Therefore, the decision was made to revert to using the          Overall, this work presents significant potential for fur-
original labels.                                              ther refinement and advancement, promising avenues for
                                                                 enhancing the performance and robustness of attention
4.3. Overall Analysis                                            evaluation systems in driving contexts.
The final results of the application, pertaining to the pre-
diction of the driver’s average attention while viewing a        References
video, exhibit high performance across most cases, with
a few notable exceptions. Tests were conducted on 191             [1] G. Fitch, S. Soccolich, F. Guo, J. McClafferty, Y. Fang,
videos, each lasting 30 seconds, sourced from the PDD                 R. Olson, M. Pérez-Toledano, R. Hanowski, J. Han-


                                                            93
Francesca Fiani et al. CEUR Workshop Proceedings                                                                   85–95


     key, T. Dingus, The impact of hand-held and hands-           978-3-031-18050-7_38.
     free cell phone use on driving performance and          [14] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo,
     safety-critical event risk, 2013.                            J. Starczewski, C. Napoli, A novel convmixer trans-
 [2] W. Wang, X. Lu, P. Zhang, H. Xie, W. Zeng, Driver            former based architecture for violent behavior de-
     action recognition based on attention mechanism,             tection 14126 LNAI (2023) 3 – 16. doi:10.1007/
     in: 2019 6th International Conference on Sys-                978-3-031-42508-0_1.
     tems and Informatics (ICSAI), 2019, pp. 1255–1259.      [15] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,
     doi:10.1109/ICSAI48974.2019.9010589.                         Analysis pre and post covid-19 pandemic rorschach
 [3] A. J. McKnight, A. S. McKnight, The effect of cel-           test data of using em algorithms and gmm models,
     lular phone use upon driver attention, Accident              volume 3360, 2022, pp. 55 – 63.
     Analysis & Prevention 25 (1993) 259–265.                [16] S. Russo, S. I. Illari, R. Avanzato, C. Napoli, Reducing
 [4] J. C. de Winter, D. Dodou, The driver behaviour              the psychological burden of isolated oncological
     questionnaire as a predictor of accidents: A meta-           patients by means of decision trees, volume 2768,
     analysis, Journal of safety research 41 (2010) 463–          2020, pp. 46 – 53.
     470.                                                    [17] E. Iacobelli, V. Ponzi, S. Russo, C. Napoli, Eye-
 [5] K. J. Anstey, J. Wood, S. Lord, J. G. Walker, Cogni-         tracking system with low-end hardware: Devel-
     tive, sensory and physical factors enabling driving          opment and evaluation, Information (Switzerland)
     safety in older adults, Clinical psychology review           14 (2023). doi:10.3390/info14120644.
     25 (2005) 45–65.                                        [18] F. Fiani, S. Russo, C. Napoli, An advanced solu-
 [6] E. Yurtsever, J. Lambert, A. Carballo, K. Takeda, A          tion based on machine learning for remote emdr
     survey of autonomous driving: Common practices               therapy, Technologies 11 (2023). doi:10.3390/
     and emerging technologies, IEEE access 8 (2020)              technologies11060172.
     58443–58469.                                            [19] N. Brandizzi, S. Russo, G. Galati, C. Napoli, Address-
 [7] Y. Hou, C. Wang, J. Wang, X. Xue, X. L. Zhang,               ing vehicle sharing through behavioral analysis: A
     J. Zhu, D. Wang, S. Chen, Visual evaluation for au-          solution to user clustering using recency-frequency-
     tonomous driving, IEEE Transactions on Visualiza-            monetary and vehicle relocation based on neigh-
     tion and Computer Graphics 28 (2021) 1030–1039.              borhood splits, Information (Switzerland) 13 (2022).
 [8] Y. Xia, D. Zhang, J. Kim, K. Nakayama, K. Zipser,            doi:10.3390/info13110511.
     D. Whitney, Predicting driver attention in critical     [20] D. Yang, X. Li, X. Dai, R. Zhang, L. Qi, W. Zhang,
     situations, in: Computer Vision–ACCV 2018: 14th              Z. Jiang, All in one network for driver atten-
     Asian Conference on Computer Vision, Perth, Aus-             tion monitoring, in: ICASSP 2020 - 2020 IEEE In-
     tralia, December 2–6, 2018, Revised Selected Papers,         ternational Conference on Acoustics, Speech and
     Part V 14, Springer, 2019, pp. 658–674.                      Signal Processing (ICASSP), 2020, pp. 2258–2262.
 [9] D. Yang, Y. Wang, R. Wei, J. Guan, X. Huang, W. Cai,         doi:10.1109/ICASSP40776.2020.9053659.
     Z. Jiang, An efficient multi-task learning cnn for      [21] D. Melesse, M. Khalil, E. Kagabo, T. Ning, K. Huang,
     driver attention monitoring, Journal of Systems              Appearance-based gaze tracking through super-
     Architecture (2024) 103085.                                  vised machine learning, in: 2020 15th IEEE
[10] E. Yüksel, T. Acarman, Experimental study on                 International Conference on Signal Processing
     driver’s authority and attention monitoring, in:             (ICSP), volume 1, 2020, pp. 467–471. doi:10.1109/
     Proceedings of 2011 IEEE International Conference            ICSP48669.2020.9321075.
     on Vehicular Electronics and Safety, 2011, pp. 252–     [22] X. Zhang, Y. Sugano, M. Fritz, A. Bulling, MPIIgaze:
     257. doi:10.1109/ICVES.2011.5983824.                         Real-world dataset and deep appearance-based gaze
[11] G. Capizzi, G. L. Sciuto, C. Napoli, M. Woźniak,             estimation, 2017. URL: https://arxiv.org/abs/1711.
     G. Susi, A spiking neural network-based long-term            09017.
     prediction system for biogas production, Neural         [23] S. Vora, A. Rangesh, M. M. Trivedi, Driver gaze zone
     Networks 129 (2020) 271 – 279. doi:10.1016/j.                estimation using convolutional neural networks: A
     neunet.2020.06.001.                                          general framework and ablative analysis, 2018. URL:
[12] B. A. Nowak, R. K. Nowicki, M. Woźniak, C. Napoli,           https://arxiv.org/abs/1802.02690.
     Multi-class nearest neighbour classifier for incom-     [24] N. Mizuno, A. Yoshizawa, A. Hayashi, T. Ishikawa,
     plete data handling, volume 9119, 2015, pp. 469 –            Detecting driver’s visual attention area by using
     480. doi:10.1007/978-3-319-19324-3_42.                       vehicle-mounted device, in: 2017 IEEE 16th Inter-
[13] C. Ciancarelli, G. De Magistris, S. Cognetta,                national Conference on Cognitive Informatics &
     D. Appetito, C. Napoli, D. Nardi, A gan ap-                  Cognitive Computing (ICCI*CC), 2017, pp. 346–352.
     proach for anomaly detection in spacecraft teleme-           doi:10.1109/ICCI-CC.2017.8109772.
     tries 531 LNNS (2023) 393 – 402. doi:10.1007/           [25] J. Terven, D.-M. Córdova-Esparza, J.-A. Romero-


                                                        94
Francesca Fiani et al. CEUR Workshop Proceedings                                                       85–95


     González, A comprehensive review of yolo archi- [36] P. Viola, M. Jones, Rapid object detection using a
     tectures in computer vision: From yolov1 to yolov8    boosted cascade of simple features, in: Proceedings
     and yolo-nas, Machine Learning and Knowledge          of the 2001 IEEE Computer Society Conference on
     Extraction 5 (2023) 1680–1716.                        Computer Vision and Pattern Recognition. CVPR
[26] A. A. Lima, M. M. Kabir, S. C. Das, M. N. Hasan,      2001, volume 1, 2001, pp. I–I. doi:10.1109/CVPR.
     M. Mridha, Road sign detection using variants of      2001.990517.
     yolo and r-cnn: An analysis from the perspective
     of bangladesh, in: Proceedings of the International
     Conference on Big Data, IoT, and Machine Learning:
     BIM 2021, Springer, 2022, pp. 555–565.
[27] A. Palazzi, D. Abati, S. Calderara, F. Solera,
     R. CUcchiara, Predicting the driver’s focus of
     attention: The dr(eye)ve project abs/1807.02588
     (2018). URL: https://arxiv.org/abs/1705.03854.
     arXiv:1705.03854.
[28] I. Dua, T. A. John, R. Gupta, C. Jawahar, Dgaze:
     Driver gaze mapping on road, in: 2020 IEEE/RSJ
     International Conference on Intelligent Robots and
     Systems (IROS), IEEE, 2020, pp. 5946–5953.
[29] A. Yoshizawa, H. Iwasaki, Analysis of driver’s
     visual attention using near-miss incidents, in:
     2017 IEEE 16th International Conference on Cogni-
     tive Informatics & Cognitive Computing (ICCI*CC),
     2017, pp. 353–360. doi:10.1109/ICCI-CC.2017.
     8109773.
[30] H. M. Peixoto, A. M. G. Guerreiro, A. D. D. Neto, Im-
     age processing for eye detection and classification
     of the gaze direction, in: 2009 International Joint
     Conference on Neural Networks, 2009, pp. 2475–
     2480. doi:10.1109/IJCNN.2009.5178924.
[31] K. Guo, G. Yu, Z. Li, An new algorithm for an-
     alyzing driver’s attention state, in: 2009 IEEE
     Intelligent Vehicles Symposium, 2009, pp. 21–23.
     doi:10.1109/IVS.2009.5164246.
[32] H. Lee, J. Seo, H. Jo, Gaze tracking system using
     structure sensor & zoom camera, in: 2015 Interna-
     tional Conference on Information and Communi-
     cation Technology Convergence (ICTC), 2015, pp.
     830–832. doi:10.1109/ICTC.2015.7354677.
[33] A. G. Mavely, J. E. Judith, P. A. Sahal, S. A. Ku-
     ruvilla, Eye gaze tracking based driver monitoring
     system, in: 2017 IEEE International Conference
     on Circuits and Systems (ICCS), 2017, pp. 364–367.
     doi:10.1109/ICCS1.2017.8326022.
[34] H. Mohsin, S. H. Abdullah, Pupil detection algo-
     rithm based on feature extraction for eye gaze, in:
     2017 6th International Conference on Information
     and Communication Technology and Accessibility
     (ICTA), 2017, pp. 1–4. doi:10.1109/ICTA.2017.
     8336048.
[35] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face de-
     tection and alignment using multi-task cascaded
     convolutional networks, 2022. URL: https://arxiv.
     org/abs/1604.02878. doi:10.48550/ARXIV.2210.
     07548.


                                                     95

</pre>