=Paper=
{{Paper
|id=Vol-3695/p11
|storemode=property
|title=Keeping Eyes on the Road: Understanding Driver Attention
and Its Role in Safe Driving
|pdfUrl=https://ceur-ws.org/Vol-3695/p11.pdf
|volume=Vol-3695
|authors=Francesca Fiani,Valerio Ponzi,Samuele Russo
|dblpUrl=https://dblp.org/rec/conf/system/FianiPR23
}}
==Keeping Eyes on the Road: Understanding Driver Attention
and Its Role in Safe Driving==
Keeping Eyes on the Road: Understanding Driver Attention
and Its Role in Safe Driving
Francesca Fiani1 , Valerio Ponzi1,2 and Samuele Russo3
1
Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185 Roma, Italy
2
Institute for Systems Analysis and Computer Science, Italian National Research Council, 00185 Roma, Italy
3
Department of Psychology, Sapienza University of Rome, 00185 Roma, Italy
Abstract
Monitoring the driver’s attention is an important task to maintain the driver’s safety. The estimation of the driver’s gaze
direction can help us to evaluate if the drivers are not focusing their attention on the street. For an evaluation of this type,
comparing the inside view and outside scenery of the vehicle is essential, therefore we decided to create a specific dataset for
this task. In this work, we realize a machine-learning-oriented approach to driver’s attention evaluation using a coupled
visual perception system. By analyzing the road and the driver’s gaze simultaneously it is possible to understand if the driver
is looking at the traffic signs detected. We evaluate if a determined Region Of Interest (ROI) contains a road sign or not
through YOLOv8.
Keywords
Visual Attention Estimation, Machine Learning, Artificial Intelligence, ADAS (Autonomous Driver Assistance Systems), YOLO
1. Introduction the analysis of the vehicle cabin and the driver’s gaze is
conducted independently, without considering the evalu-
Artificial Intelligence (AI) employed in assessing driver ation of the surrounding environment, road conditions,
attention within assisted driving scenarios is swiftly ad- and the driver’s reaction to specific events.
vancing, propelled by the evolution of autonomous ve- Several studies focus either on observing the driver’s
hicles and the integration of hybrid systems designed to behavior through internal vehicle cameras or analyzing
assist drivers. These systems encompass a range of func- external road conditions using external cameras and sen-
tionalities, including cruise control, lane-keeping assis- sors [6, 7, 8, 9, 10]. However, a gap exists in comprehen-
tance, automatic parking, and various other features inte- sive research that integrates both internal and external
grated into modern vehicles. It is well known that driver perspectives without relying on complex and inaccessi-
inattention is a major cause of road accidents [1, 2, 3], ble equipment. To address this gap, our research adopts
with violations of the expected driver behavior being a a novel approach. We simultaneously analyze internal
fundamental factor [4]. Due to its significant contribu- driver information, such as posture and gaze, and exter-
tion to accidents, monitoring driver attention has become nal data about road conditions and points of interest, like
a critical necessity for automotive safety systems, aiming signs and pedestrians, during driving. This integrated ap-
to detect potential risks and proactively prevent accidents. proach allows for a more holistic understanding of driver
To achieve comprehensive attention monitoring, it is im- attention and behavior.
perative to conduct precise analyses of various factors, Machine learning is playing a pivotal role in creating
including the driver’s posture, head position, rotation a safer society. In the realm of energy [11], machine
angles, and gaze direction. These insights into driver learning algorithms are optimizing data systems [12, 13],
behavior enable the identification of factors influencing improving supply-demand forecasting, and enhancing
reactions to different conditions and scenarios, thereby the efficiency of renewable energy sources. This not only
mitigating distractions and drowsiness-related incidents ensures a stable energy supply but also reduces the risk
in the future [5]. of blackouts. When it comes to fostering a green environ-
Literature primarily addresses driver attention by di- ment, machine learning is at the forefront of monitoring
viding the internal and external components. Typically, and predicting environmental changes, enabling us to
take timely action against potential threats [14, 15]. So-
SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi- cial benefits are manifold, including improved healthcare
neering and Mathematics, Rome, December 3-6, 2023 through predictive diagnostics, personalized education,
$ fiani@diag.uniroma1.it (F. Fiani); ponzi@diag.uniroma1.it and effective public services, all contributing to an im-
(V. Ponzi); samuele.russo@uniroma1.it (S. Russo)
proved quality of life [16, 17, 18]. In the context of urban
0009-0005-0396-7019 (F. Fiani); 0009-0000-2910-0273 (V. Ponzi);
0000-0002-9421-8566 (S. Russo) driving, machine learning is the driving force behind
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License autonomous vehicles [19]. These vehicles promise to sig-
Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
85
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Francesca Fiani et al. CEUR Workshop Proceedings 85–95
nificantly reduce traffic accidents, improve traffic flow, and point of focus. Our novel approach involves the use
and reduce carbon emissions, making our cities safer of a grid of nine cells to predict the Regions Of Inter-
and more sustainable. Thus, machine learning is a key est (ROIs) of the driver’s gaze, as illustrated in Figure 1.
enabler in our pursuit of a safer society. To achieve this, we employ a VGG16 network to extract
In this research, we merge various internal and exter- features from facial video frames, augmenting this in-
nal techniques for gaze recognition and correlate them formation with head-pose data (i.e. roll, pitch, and yaw
with external Regions Of Interest (ROIs) to develop an angles) to enhance gaze-position prediction [20, 21, 22].
easily applicable solution that comprehensively tackles The difference between tracking gaze-position when a
the issue of driver attention. This approach holds sig- person is looking at a monitor and while they are driving
nificant practical implications for everyday scenarios, is, in fact, substantial [21, 22]. When looking at a monitor,
including: head movements are imperceptible, so the only discrimi-
nant is the position of the pupil. During driving, however,
• Autonomous vehicle development: Understand- the driver tends to rotate their head to look at vehicles
ing the driver’s focus during critical driving situ- and pedestrians or tilt it to see street names, signs, or
ations, including the duration of their attention higher traffic lights. They also shift their gaze to look at
to specific elements and their perception of ir- mirrors or to initiate a reverse maneuver. For these rea-
relevant factors, plays a pivotal role in the ad- sons, analyzing only pupil movement was insufficient for
vancement of Advanced Driver Assistance Sys- our task and it was necessary to have additional informa-
tem (ADAS) solutions. tion about head pose (rotation angles) and characteristics
• Car crashes: Having information about driver of eyes or facial images, see Figure 2.
attention during a road accident could facilitate In addition to methodological research, another sig-
the execution of investigations, checks, and in- nificant challenge we faced was sourcing an appropri-
surance procedures. By utilizing an affordable ate dataset for driver attention monitoring. We encoun-
camera system, video data on the driver involved tered existing datasets with comprehensive documenta-
in the accident could be collected and provided tion of driver behavior, but they lacked corresponding
to an application. real-world external observations. Furthermore, datasets
• Emergency services: Emergency response vehi- focused solely on gaze analysis typically consisted of
cles, including ambulances and fire trucks, of- images of individuals looking at points on a computer
ten need to navigate through traffic quickly and screen, which did not align with our real-world driving
safely. Driver attention monitoring systems can scenario. To address this gap, we decided to create our
help emergency service providers ensure their own dataset, encompassing both internal and external
drivers remain vigilant while responding to emer- videos captured during driving sessions. This approach
gencies, minimizing the likelihood of accidents enabled our final application to process and correlate
and delays.
• Public transportation infrastructure: Driver atten-
tion monitoring systems can also be integrated
into public transportation infrastructure, such as
traffic lights and pedestrian crossings. By detect-
ing instances of driver distraction or inattention,
these systems can improve traffic flow and pedes-
trian safety, reducing the risk of accidents and
congestion in urban areas.
To advance driver attention monitoring, we have di-
rected our efforts towards computer vision-based method-
ologies, which are gaining traction over physiology-
based approaches. Unlike physiological methods, in fact,
vision-based techniques rely solely on cameras to observe
and analyze driver behaviors, eliminating the need for
intrusive devices such as eye-tracking glasses or brain- Figure 1: Example of an external image after road signs de-
wave recognition gadgets and consequently reducing the tection, with the ROI grid in green. Several regions of interest
cost associated with experiments. which contain one or more road signs have been identified in
the image (specifically, cells 4, 5 and 6). The red rectangles
The most in-depth analysis in our work focused on
represent the traffic signs bounding boxes.
finding the best method and features to extract from im-
ages to accurately determine the driver’s gaze direction
86
Francesca Fiani et al. CEUR Workshop Proceedings 85–95
information from multiple perspectives simultaneously. most classical metric being the gaze direction, generally
To train the two components of our application, we uti- assessed by analyzing facial features such as the face
lized two additional datasets. For the internal component, mesh. Other approaches are however available, for in-
which involves predicting the driver’s gaze position, we stance, the position of the hands and arms, which can be
curated the HEAD-POSE dataset, featuring data from dif- used to assess whether the driver keeps their hands on
ferent subjects. Unlike many existing datasets that often the steering wheel or in other positions, such as holding
focus on a single subject, our dataset offers a broader a phone [2].
and more diverse range of observations. For the external On the other hand, other approaches solely focus on ex-
component, which entails predicting the position of road ternal factors by studying the surrounding environment
signs, we leveraged a customized dataset of road signs and collecting information about the vehicle’s movement
sourced from the internet. We carefully selected images (speed, position) to study the driver’s reactivity in spe-
from datasets such as MAPILLARY and GTSDB, ensuring cific circumstances. For example, various sensors such
that they adhered to European traffic regulations gov- as cameras and lidar, applied to the external part of the
erned by the Vienna Convention of 1968. This meticulous vehicle, can allow the observation of the driver’s reaction
curation process ensured the relevance and accuracy of in certain situations [10]. Another classical study when
the data used in our research and development efforts. analyzing the external environment surrounding the car
is the analysis of road elements present in the scene via
neural networks such as YOLO [25, 26].
2. Related Works While poor in number compared to decoupled ap-
proaches, some studies simultaneously analyze both in-
As previously discussed, in recent years a growing in-
ternal and external images of the vehicle while assessing
terest in analyzing driver attention during driving has
driver attention to the road from the driver’s perspec-
been noticed. This includes understanding whether a
tive. In cases where interior cabin images are associ-
person is observing the road, being distracted, remaining
ated with external frames, the driver’s viewpoint is often
vigilant, or experiencing drowsiness. Most state-of-the-
recorded using glasses or equipment that track eye move-
art approaches are based on the unique observation of
ments, which directly indicates what is being observed
the driver’s interior cabin to understand their behaviors
[27]. There are also some recent datasets created in a
[23, 24, 2, 20]. One or more internal cameras are used
controlled setting that simulate common driving situa-
to observe the driver and determine if they are looking
tions, such as the DGAZE dataset with its corresponding
at the infotainment system, the road, the mirrors, or, for
algorithm I-DGAZE [28].
example, other passengers. Several methods can be used
Regarding specifically the gaze detection task, various
to determine the attention level of the driver, with the
approaches are used in simulated or real environments,
both indoors and outdoors [29]. In most literature works
and datasets, recordings are made using a personal com-
puter’s webcam while the subject looks at specific points
on the screen for certain moments. With this regression
problem, the aim is to recognize the precise gaze position
on the monitor by studying the direction of the pupils
and gaze triangulation [30, 22, 31, 32, 33, 34]. These types
of problems can, however, also be approached through
classification. For example, images can be taken of a sta-
tionary subject in front of a personal computer screen,
ideally divided into a 9-cell grid, and the gaze position
can then be returned not as precise point coordinates,
but instead as the ID number of the observed cell (clas-
sification) [21]. In such cases, pupil characteristics are
generally extracted and then classified using classic ma-
chine learning methods such as Support Vector Machines
(SVMs), Convolutional Neural Networks (CNNs), or Deep
Figure 2: Example of internal image with facial features ex- Neural Networks (DNNs). This last approach in particu-
traction. The red rectangle is the measured face bounding lar has inspired our choice to implement a classification
box and the blue dots represent the found facial landmarks. algorithm, given the problems and requirements already
The face of the subject has been blurred according to privacy described and specific to the field of driving.
regulations. Other existing algorithms for studying gaze position
start, as previously mentioned, from datasets of tens
87
Francesca Fiani et al. CEUR Workshop Proceedings 85–95
of thousands of photos collected using a personal com- • Traffic Objects Dataset: This dataset is a modi-
puter’s webcam, and then extract facial information from fied version of the Mapillary Dataset, containing
the given images to crop eye images and pass them to images depicting various traffic signs.
networks such as VGG-16 [22]. In addition, information • Traffic Signs Dataset in YOLO format (TSDY):
related to head position (rotation angles - roll, pitch, yaw) This dataset comprises images sourced from
can also be considered [20]. It is particularly of note that the German Traffic Sign Detection Benchmark
to improve regression on the viewpoint position it is (GTSDB), available for download from Kaggle.
fundamental to collect images from multiple subjects, in
multiple vehicles, and under different weather conditions. To create our dataset, we maintained a consistent
Finally, additional approaches make use of recordings equipment setup as depicted in Figure 3. For both the
in simulated environments using various technologies, Gaze Directions and Head Posture Dataset and People
from simulators to simple computer-played videos. For Driving Dataset we utilized a city car, while an iPhone
example, the user’s gaze position can be recorded while 15 camera was employed to capture internal images and
watching driving videos shortly before certain incidents, record internal videos. The iPhone was strategically posi-
in order to understand which objects the driver (simu- tioned behind the steering wheel to ensure clear visibility
lated in this case) would have focused on [29]. of the driver while minimizing extraneous details. Fur-
thermore, we positioned a GoPro Hero10 camera at the
center of the car’s dashboard to capture external footage
3. Methods throughout the drive.
For the GDHPD, we compiled images from distinct
This research explores an innovative method for recogniz- subjects, consisting of two males and two females. In
ing gaze patterns while driving to evaluate driver atten- some cases, subjects wore glasses, while in the other
tion. Subsequently, we focused on the internal aspect of they did not. The image collection process encompassed
the vehicle, where we trained and tested neural networks various times of the day and diverse lighting conditions,
for gaze classification. Our experimentation involved resulting in a total of 1012 images. Table 1 provides a
various models, including SVM, ClassNET, VGG16-based breakdown of the distribution of these images.
Net, and HEGClass Net. Additionally, we conducted a Subjects were positioned inside a car, with their seat-
training phase for the external aspect using a custom ing adjusted to achieve a standard driving posture. Sub-
dataset comprising traffic sign objects. Once we obtained sequently, images were captured while subjects varied
results for both components, we merged the two mod- their gaze and head positions. To facilitate classifica-
ules to conduct a comprehensive analysis of video record- tion, we devised a virtual grid dividing the external view
ings obtained during real-world driving scenarios. This and the driver’s gaze into a 9-cell configuration. This
integrated approach facilitated a more holistic compre-
hension of gaze behavior and its correlation with driver
attention in typical driving situations.
3.1. Dataset
A variety of images and videos were gathered and uti-
lized at different stages of development to construct cus-
tom datasets tailored to our research objectives. These
datasets can be categorized into four distinct collections:
• Gaze Directions and Head Posture Dataset
(GDHPD): This dataset comprises images cap-
tured by us, featuring individuals in a driving
environment. The images are utilized to catego-
rize the gaze position of individuals within a grid
consisting of nine cells, including the exterior of
the grid. Figure 3: The setup of the city-car environment used during
• People Driving Dataset (PDD): This dataset the collection of images and videos. In red, the virtual grid
represents how the gaze position area is divided into ROIs,
consists of both external and internal videos,
each associated with a region number from 1 to 9 (the area
recorded by our team, showcasing the driving outside the virtual grid is labeled 0). In orange, the GoPro
activities. Hero10 used to record the external street. In blue, the iPhone
15 used to record the driver.
88
Francesca Fiani et al. CEUR Workshop Proceedings 85–95
GDHPD dataset specifications variations.
Classes Male Female Total
The corresponding JSON files were then converted into
0 50 64 114
TXT files and formatted to suit YOLO’s training model re-
1 50 64 114
2 51 71 122 quirements. This conversion process involved extracting
3 51 71 122 relevant information such as bounding box coordinates
4 48 60 108 and class labels of traffic signs, facilitating model train-
5 48 60 108 ing for sign recognition and localization. To simplify
6 50 50 100 the classification task, the dataset’s labels were modi-
7 51 50 101 fied to include three super categories: ’PROHIBITORY’,
8 51 61 112 ’DANGER’, and ’MANDATORY’.
9 50 61 111 These categories encapsulate the majority of relevant
Total 500 612 1012 traffic signs essential for driving safety, thus streamlining
Table 1 the training process. Similarly, the Traffic Signs Dataset
The Gaze Directions and Head Posture Dataset (GDHPD) in YOLO format (TSDY) was used as a refined version of
dataset collects the gaze positions of different drivers. The the larger GTSDB dataset. Comprising 750 images with
collection contains a total of 1012 images, 500 for males and labels already expressed in YOLO format, each image
612 images for females. had a resolution of 1360x800. The number of classes was
reduced to four: ’PROHIBITORY’, ’DANGER’, ’MANDA-
TORY’, and ’OTHER’, simplifying the classification task
grid facilitated the association of head and eye positions and enhancing the focus on critical sign types relevant
with specific regions of the external images, enabling to driver safety.
the identification of Regions Of Interest (ROIs) during
experimentation. 3.2. Gaze Classification
The dataset comprises ten classes: the nine cells of the
grid, alongside an additional class representing situations Several algorithms, were analyzed to identify the ap-
where the subject’s attention is not directed towards the proach with the best trade-off between accuracy in gaze
road (e.g., face turned sideways, gaze directed upwards direction prediction and generalization capabilities, al-
or downwards, etc.). lowing to efficiently recognize images with varied con-
The PDD dataset comprises videos captured during trast and/or brightness or different drivers. The structure
driving sessions, utilizing the same recording setup as takes as input an image (either singular or an extracted
the previous dataset. Subjects were filmed while driv- frame from the recording) and generates a label predic-
ing under various conditions, capturing both the driver tion through two different but subsequent sub-models.
and the road view simultaneously. To ensure synchro- The head-pose estimation algorithm is common for all
nization, the videos underwent pre-processing using a tested approaches, while the classification algorithm has
third-party software, DaVinci Resolve. Synchronization been severely varied in model type, structure, and input
was achieved through voice cues, guaranteeing precise during the search for the optimal solution.
alignment between internal and external footage.
Subsequently, the videos were segmented into sub- 3.2.1. Head-Pose Estimation Part
clips of 30 seconds each to streamline subsequent pro-
The face detection is performed through a pre-trained
cessing steps. Each of the extracted sub-clips was anno-
Multi-task Cascaded Convolutional Network (MTCNN)
tated with labels indicating whether the driver exhibited
model, used for both face detection and alignment in
a "CAREFUL" or "NOT CAREFUL" driving style, along
literature [35]. MTCNN consists of a cascade of convo-
with information regarding the driver’s use of glasses.
lutional networks (P-Net, R-Net, and O-Net), for face
After pre-processing, the external images displayed a
landmarks identification. The model first identifies the
resolution of 1440x1080, while the internal images were
bounding box of the face region through candidate gen-
resized to 1080x1920.
eration with P-Net and refinement with R-Net and then
For traffic sign detection and recognition, the primary
extracts the main 5 landmarks of the face (left and right
dataset utilized was the Traffic Object dataset from the
eye, nose tip, and mouth corners) with O-Net. Among
Mapillary Traffic Sign Dataset, encompassing tens of
similar methods, such as Haar Cascade Classifiers [36],
thousands of images sourced from roads worldwide. Fo-
MTCNN has shown the best results even in the presence
cusing solely on Italian/European traffic signs, around
of glasses, partially occluded eyes and beards, and has
3,000 images were selected from the dataset after filter-
therefore been selected as our chosen method.
ing out images with significantly different sign shapes
The identified landmarks are then used to analytically
or contents. The chosen images offer a varied range of
calculate the roll, pitch and yaw angles of the driver’s
brightness, positioning within frames, and contextual
89
Francesca Fiani et al. CEUR Workshop Proceedings 85–95
head, while the extracted pupil positions will be used network. This Classifier Network architecture consists of
as input features for the final classifier to determine the two convolutional layers with ReLU activation functions,
observed ROI. This feature is fundamental for our clas- followed by a max pooling layer, and culminates in two
sification task compared to other facial features and is fully connected layers with ReLU and Softmax activation
therefore particularly important to determine accurately. functions. Training of ClassNet spanned 300 epochs,
employing the MSE loss function (Mean Squared Error)
3.2.2. Classification Part and Adam optimizer.
Finally, the last experimental iteration involved em-
Our approach involves classifying observed ROI and road ploying the VGG-16 architecture. Training data consisted
signs through a classification method. The field of view of 1012 samples from the GDHPD dataset, each composed
is divided into nine sections, with an additional label of the face image tensor, alongside head rotation angles
for identifying any gaze position outside these sections (roll, pitch, yaw), and pupil centers. In this setup, fea-
(e.g., distracted driving or maneuvering). We’ve explored tures from each image were directly extracted within the
various methods for analysis, ranging from traditional classification network from the RGB image of the entire
SVM to CNNs, all accurately adapted for our application. face, rather than solely from the eyes. Additionally, other
We will first introduce our novel method, followed by an features (roll, pitch, yaw, and pupil centers) obtained pre-
overview of other models considered in the analysis. viously through MTCNN were incorporated alongside
The standout classification model is HEGClass (Head- the 512 features of the image in the first fully connected
Eyes-Gaze Classifier), a hybrid approach outlined in this layer. Training was executed over 100 epochs, using sim-
paper. It takes cropped face images from head-pose esti- ilarly to the HEGClass model the Cross-Entropy Loss
mation, along with head rotation angles and pupil cen- function and Adam optimizer.
ter coordinates, as inputs. This combined approach has
yielded high precision in classifying the Region of Inter-
est toward which the gaze is directed. In the HEGClass 3.3. YOLO Training for Traffic Signs
network, as depicted in Figure 4, initial features are ex- For the detection and recognition of traffic signs, we start
tracted from cropped face RGB images using a pre-trained from the pre-trained YOLOv8 model, with experimenta-
VGG-16 network. The features are then flattened and tion also conducted using the YOLOv5 model prior to
concatenated with a normalized array containing head transitioning to the v8 version. Fine-tuning of YOLOv8
roll-pitch-yaw and pupil center coordinates. This com- was carried out using two distinct datasets: Traffic Ob-
bined feature vector of dimension 4096+7 passes through jects and TSDY. The final set of weights chosen for the
two fully connected linear layers, followed by ReLU ac- application was derived from the dual fine-tuning of
tivation functions, and finally through a last fully con- YOLOv8 with both datasets.
nected linear layer with Softmax activation to determine The initial fine-tuning with the Traffic Objects dataset
class membership among the 10 possibilities (9 ROIs for involved 1802 images for training and 919 for validation.
frontal regions and 1 for others). Model training utilized Despite starting with 3000 images, adjustments were
our GDHPD dataset, with 10 epochs to mitigate overfit- made to the training and validation sets due to imbal-
ting, 32 samples per batch, Cross-Entropy loss function, ance issues within the original dataset, which persisted
and Adam optimizer. even after categorizing labels based on sign categories as
The first classical model used for comparison is Sup- described in the Dataset section. Subsequently, proceed-
port Vector Machine (SVM). In our scenario, where we’re ing from the fine-tuned weights, the model underwent
classifying 10 distinct classes, we employed an SVM with retraining with images from the TSDY dataset, utilizing
a polynomial kernel of degree 4, regularization param- 600 images for training and 141 for validation.
eter set at 100, and coefficient set to 10. Training ex- The dual fine-tuning approach resulted in enhanced
clusively used images from the GDHPD dataset. We ex- performance, as evidenced by improved final accuracy
tracted roll, pitch, yaw, and pupil centers from the images and heightened generalization capabilities in detecting
with MTCNN. Then, using Haar Cascade Classifier [36], road signs, even in images sourced from the PDD dataset
we isolated eye patches from each grayscale image and and exhibiting varied lighting conditions. To further en-
passed them through a pre-trained ResNet to obtain 2048 rich the diversity and generalization capabilities of the
features for each eye. These two sets of features were fine-tuned YOLO model, diverse image augmentation
then averaged to create a unified array of 2048 elements techniques from the Albumentations library were em-
containing information from both eyes. The resulting ployed during training to simulate real-world conditions,
samples underwent L2 normalization before being fed including Blur, MedianBlur, and CLAHE (Contrast Lim-
into the SVM for both training and testing phases. ited Adaptive Histogram Equalization). To streamline the
Using the same 2048 features extracted ResNet and roll, training process, the Stochastic Gradient Descent (SGD)
pitch, yaw and pupil centers, we also trained the ClassNet optimizer was utilized, with an initial learning rate of
90
Francesca Fiani et al. CEUR Workshop Proceedings 85–95
Figure 4: Model of our novel approach HEGClass. The base of the model is a standard pre-trained VGG-16 network, which
receives as input the cropped image of the subject’s face. In the first fully connected layers 7 additional features [Roll, Pitch,
Yaw, rx, ry, lx, ly] are added to arrive at the final ROI prediction (with 10 classes with values [0,9]).
0.01. module. The image and information are then passed
Post-training, the model returns a text file for each through the network to classify the driver’s gaze posi-
image processed, containing one line per detected sign tion. Upon obtaining the prediction of the observed cell
alongside its position. By processing the pixel coordi- in a given frame, it is compared with the position of the
nates from this file, sign position information was ex- corresponding sign. In frames with multiple detected
tracted to reconstruct the sign’s center and edges within road signs, each corresponding ROI is considered active,
grid cells. In instances where signs spanned multiple thus rendering the driver alert when focusing on any of
cells, multiple coordinates were necessary for accurate them without favoring any type of signal.
identification. Given that a single road sign can span multiple cells, 5
points characterize the object’s position: the four corners
3.4. Application: Merging the Methods and the center. If the driver looks at a cell containing a
partial view of the sign, accounting for the peripheral
The final application, depicted in Figure 5, comprises vision of human eyes we consider them attentive. More-
two primary components and generates a CSV report over, in the absence of signals or when the driver’s gaze
detailing the overall behavior of a driver. It requires is directed to cells 4/5/6 (representing the entire road
as input an internal video capturing the driver and an surface), they are still deemed attentive to the street. If
external video recording the street view. For ease of gaze is directed to cells 7 or 8, indicating focus on the
analysis, synchronization of a 30-second video between car’s dashboard or infotainment system, the driver’s en-
the two components is necessary. gagement is noted accordingly.
After inizialization, external frames undergo analysis Finally, a CSV file is generated to store the analysis
using the YOLOv8 model trained on road signs, produc- results from the 30-second videos. Each row contains
ing a text file detailing the detected signals along with data including the frame count (consistent across internal
their specifications, including the Regions of Interest and external videos), the number of road signs detected
(ROIs) where these signals are located. For frames with in that frame, the cell number(s) housing the detected
at least one detected sign, the corresponding internal signals, the predicted cell value from the driver’s gaze
image is used to extract information via the GDHPD network, the number of observed signals following ROI
91
Francesca Fiani et al. CEUR Workshop Proceedings 85–95
Gaze Classifiers Results
Method Accuracy F1-Score
HEGClass 96 94,3
SVM 74,5 74
ClassNet 56 45,6
VGG16-based Net 81 79
Table 2
Accuracy and F1-Score of all the tested and compared methods
for Gaze Classification. Our novel approach shows the overall
best results between all the analyzed methods.
final classification network. However, occasional failures
in face detection or slight misplacements of landmarks
introduce minor errors in this initial phase.
For predicting gaze direction, a zone-based classifi-
cation approach was chosen over regression due to the
difficulty in precisely determining the exact point on the
road the individual is looking at, coupled with the hu-
man eye’s ability to perceive a broad area. Despite testing
various methods, the SVM-based approach struggled to
Figure 5: Pipeline of the application presented in the paper.
The internal and external videos are simultaneously processed
exceed a 70% accuracy level, likely due to the similarity
to extract relevant features (facial landmarks and road signs in feature values across the 1012 samples, particularly
bounding boxes respectively). The features are then used those derived from ResNet for eye images.
to determine the active ROIs, which are then compared to Transitioning to neural network-based methods, the
generate the attention level of the driver. ClassNet network yielded lower accuracy than SVM, even
after experimenting with different feature combinations.
Training a network based on VGG-16 architecture from
scratch yielded better results with an accuracy level of
matching, and an indication of the driver’s attentiveness.
81%. However, the limited size of our dataset and com-
putational constraints hindered achieving satisfactory
4. Results performance through this approach. Hence, we adopted
the hybrid HEGClass approach, achieving an impressive
The objective of this research is to develop a comprehen- 96% accuracy and 94.3% F1-score without additional data.
sive system capable of analyzing an individual’s atten- Comprehensive accuracy and F1-score results are shown
tion while driving using only two synchronized videos in Table 2.
as input. Given the scarcity of references on the simulta-
neous analysis of internal and external perspectives, all 4.2. YOLOv8 Classification and Detection
subsequent evaluations and comparisons will focus on
the individual components constituting the final system. Through dual fine-tuning of the YOLOv8 network by us-
Nonetheless, through extensive testing conducted with ing the Traffic Object Dataset and the TSDY dataset, an
the PDD dataset, comprising approximately 194 videos impressive final F1-score of around 95% was achieved,
each lasting 30 seconds, the final application demon- with an example of prediction on the PDD dataset shown
strates commendable performance. in Figure 6. We have observed an interesting phe-
nomenon when training the network solely with the
Traffic Objects dataset, where the F1-score is significantly
4.1. Gaze Classification
lower. Specifically, the training of YOLOv8 with the
For what concerns face detection and landmark extrac- Traffic Objects dataset yielded an overall accuracy of
tion for facial rotation angle calculation, the MTCNN approximately 65%-70%. Performing the same process
model outperformed the Haar Cascade Classifier. This with YOLOv5, instead, showed unexpectedly a higher
superiority stems from MTCNN’s ability to handle vari- accuracy (around 80%), albeit with occasional misclassifi-
ous facial orientations, which is crucial for our GDHPD cations of elements such as empty spaces between tree
dataset as it contains images with rotated or profiled faces. branches. In any case, for the scope of this project this
Additionally, MTCNN’s prediction of landmarks, includ- comparison is not particularly relevant, given the much
ing the center of the pupil, proved vital for training the higher accuracy with dual fine-tuning.
92
Francesca Fiani et al. CEUR Workshop Proceedings 85–95
dataset. Following initial evaluations, videos compro-
mised by low light conditions (such as those filmed in
almost night-time environments) or excessive blurring
of frames, rendering accurate prediction unfeasible, were
eliminated.
In instances where images lack clarity, are blurry, or ex-
hibit excessive shaking, the predominant predicted class
is 0, indicating the model’s failure to accurately identify
the correct gaze position. Moreover, during nighttime or
low-light scenarios, accurate gaze evaluation is signifi-
cantly impeded by diminished brightness. Additionally,
YOLO struggles with precise detection of relevant signs,
Figure 6: Frame extracted from the PDD dataset predicted by often leading to confusion. Classes 5 and 6 are frequently
the fine-tuned model YOLOv8. The predicted traffic signs and identified as the gaze position during driving, aligning
the correlated label and accuracy are highlighted in orange. with the fact that these areas correspond to central re-
gions of the windscreen. In some situations, such as
when the vehicle is stationary at a traffic light or in traf-
Despite the substantial improvement in generalization fic congestion, the system may recognize the same signs
capabilities achieved through dual training, errors in sign across multiple frames. However, drivers may not con-
recognition persist. Certain objects along the road may sistently attend to them throughout, as they may have
be mistaken for road signs, such as advertisements con- already observed them and they may not be of immediate
taining elements that, with low resolution, could be con- significance at that moment.
fused. While this issue is present, its impact on the overall
results remains manageable and could potentially be mit-
igated with a wider variety of images. Another challenge
5. Conclusions
arises from grouping signs of different shapes and colors This work aimed to develop and assess a comprehensive
into the same class, creating a bias in their classification. system for evaluating attention to traffic signs in driving
Additionally, signs containing other signs within them environments. We accomplished this by creating two
may only become relevant in specific situations, such new datasets (GDHPD, PDD) and modifying two exist-
as parking signs reserved for disabled individuals. For ing ones (Traffic Objects, TSDY) to better suit our task
this reason, some signs were excluded from the training requirements. The final application was divided into two
phase. parts, utilizing YOLOv8 for sign prediction and MTCNN
Given the high accuracy values in detection, adjusting + HEGClass for gaze position classification.
the confidence threshold can help alleviate misclassifi- Despite encountering challenges during various train-
cation issues. Signs may be recognized even when ro- ing and testing phases, as described in the Results section,
tated, facing the opposite direction of the lane, or located the overall accuracy of the final system remains very
in irrelevant areas. In such cases, they are counted as high, notwithstanding the partial errors accumulated by
points of inattention. Due to class imbalance, accurately its constituent parts.
classifying the type of road sign remains a challenge. These challenges serve as valuable insights for future
Consequently, for our purposes, only information related research endeavors. Opportunities for improvement in-
to the bounding box defining the sign’s position is ex- clude implementing mechanisms to track seen and un-
tracted, without specifying the type of sign. Despite seen signals, enhancing prediction accuracy in diverse
attempts to simplify the dataset to recognize only one lighting and atmospheric conditions through dataset aug-
class, "TRAF_SIGN", challenges persisted between detect- mentation or pre-processing techniques, and expanding
ing signs and identifying unrelated environmental areas. datasets to ensure greater completeness.
Therefore, the decision was made to revert to using the Overall, this work presents significant potential for fur-
original labels. ther refinement and advancement, promising avenues for
enhancing the performance and robustness of attention
4.3. Overall Analysis evaluation systems in driving contexts.
The final results of the application, pertaining to the pre-
diction of the driver’s average attention while viewing a References
video, exhibit high performance across most cases, with
a few notable exceptions. Tests were conducted on 191 [1] G. Fitch, S. Soccolich, F. Guo, J. McClafferty, Y. Fang,
videos, each lasting 30 seconds, sourced from the PDD R. Olson, M. Pérez-Toledano, R. Hanowski, J. Han-
93
Francesca Fiani et al. CEUR Workshop Proceedings 85–95
key, T. Dingus, The impact of hand-held and hands- 978-3-031-18050-7_38.
free cell phone use on driving performance and [14] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo,
safety-critical event risk, 2013. J. Starczewski, C. Napoli, A novel convmixer trans-
[2] W. Wang, X. Lu, P. Zhang, H. Xie, W. Zeng, Driver former based architecture for violent behavior de-
action recognition based on attention mechanism, tection 14126 LNAI (2023) 3 – 16. doi:10.1007/
in: 2019 6th International Conference on Sys- 978-3-031-42508-0_1.
tems and Informatics (ICSAI), 2019, pp. 1255–1259. [15] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,
doi:10.1109/ICSAI48974.2019.9010589. Analysis pre and post covid-19 pandemic rorschach
[3] A. J. McKnight, A. S. McKnight, The effect of cel- test data of using em algorithms and gmm models,
lular phone use upon driver attention, Accident volume 3360, 2022, pp. 55 – 63.
Analysis & Prevention 25 (1993) 259–265. [16] S. Russo, S. I. Illari, R. Avanzato, C. Napoli, Reducing
[4] J. C. de Winter, D. Dodou, The driver behaviour the psychological burden of isolated oncological
questionnaire as a predictor of accidents: A meta- patients by means of decision trees, volume 2768,
analysis, Journal of safety research 41 (2010) 463– 2020, pp. 46 – 53.
470. [17] E. Iacobelli, V. Ponzi, S. Russo, C. Napoli, Eye-
[5] K. J. Anstey, J. Wood, S. Lord, J. G. Walker, Cogni- tracking system with low-end hardware: Devel-
tive, sensory and physical factors enabling driving opment and evaluation, Information (Switzerland)
safety in older adults, Clinical psychology review 14 (2023). doi:10.3390/info14120644.
25 (2005) 45–65. [18] F. Fiani, S. Russo, C. Napoli, An advanced solu-
[6] E. Yurtsever, J. Lambert, A. Carballo, K. Takeda, A tion based on machine learning for remote emdr
survey of autonomous driving: Common practices therapy, Technologies 11 (2023). doi:10.3390/
and emerging technologies, IEEE access 8 (2020) technologies11060172.
58443–58469. [19] N. Brandizzi, S. Russo, G. Galati, C. Napoli, Address-
[7] Y. Hou, C. Wang, J. Wang, X. Xue, X. L. Zhang, ing vehicle sharing through behavioral analysis: A
J. Zhu, D. Wang, S. Chen, Visual evaluation for au- solution to user clustering using recency-frequency-
tonomous driving, IEEE Transactions on Visualiza- monetary and vehicle relocation based on neigh-
tion and Computer Graphics 28 (2021) 1030–1039. borhood splits, Information (Switzerland) 13 (2022).
[8] Y. Xia, D. Zhang, J. Kim, K. Nakayama, K. Zipser, doi:10.3390/info13110511.
D. Whitney, Predicting driver attention in critical [20] D. Yang, X. Li, X. Dai, R. Zhang, L. Qi, W. Zhang,
situations, in: Computer Vision–ACCV 2018: 14th Z. Jiang, All in one network for driver atten-
Asian Conference on Computer Vision, Perth, Aus- tion monitoring, in: ICASSP 2020 - 2020 IEEE In-
tralia, December 2–6, 2018, Revised Selected Papers, ternational Conference on Acoustics, Speech and
Part V 14, Springer, 2019, pp. 658–674. Signal Processing (ICASSP), 2020, pp. 2258–2262.
[9] D. Yang, Y. Wang, R. Wei, J. Guan, X. Huang, W. Cai, doi:10.1109/ICASSP40776.2020.9053659.
Z. Jiang, An efficient multi-task learning cnn for [21] D. Melesse, M. Khalil, E. Kagabo, T. Ning, K. Huang,
driver attention monitoring, Journal of Systems Appearance-based gaze tracking through super-
Architecture (2024) 103085. vised machine learning, in: 2020 15th IEEE
[10] E. Yüksel, T. Acarman, Experimental study on International Conference on Signal Processing
driver’s authority and attention monitoring, in: (ICSP), volume 1, 2020, pp. 467–471. doi:10.1109/
Proceedings of 2011 IEEE International Conference ICSP48669.2020.9321075.
on Vehicular Electronics and Safety, 2011, pp. 252– [22] X. Zhang, Y. Sugano, M. Fritz, A. Bulling, MPIIgaze:
257. doi:10.1109/ICVES.2011.5983824. Real-world dataset and deep appearance-based gaze
[11] G. Capizzi, G. L. Sciuto, C. Napoli, M. Woźniak, estimation, 2017. URL: https://arxiv.org/abs/1711.
G. Susi, A spiking neural network-based long-term 09017.
prediction system for biogas production, Neural [23] S. Vora, A. Rangesh, M. M. Trivedi, Driver gaze zone
Networks 129 (2020) 271 – 279. doi:10.1016/j. estimation using convolutional neural networks: A
neunet.2020.06.001. general framework and ablative analysis, 2018. URL:
[12] B. A. Nowak, R. K. Nowicki, M. Woźniak, C. Napoli, https://arxiv.org/abs/1802.02690.
Multi-class nearest neighbour classifier for incom- [24] N. Mizuno, A. Yoshizawa, A. Hayashi, T. Ishikawa,
plete data handling, volume 9119, 2015, pp. 469 – Detecting driver’s visual attention area by using
480. doi:10.1007/978-3-319-19324-3_42. vehicle-mounted device, in: 2017 IEEE 16th Inter-
[13] C. Ciancarelli, G. De Magistris, S. Cognetta, national Conference on Cognitive Informatics &
D. Appetito, C. Napoli, D. Nardi, A gan ap- Cognitive Computing (ICCI*CC), 2017, pp. 346–352.
proach for anomaly detection in spacecraft teleme- doi:10.1109/ICCI-CC.2017.8109772.
tries 531 LNNS (2023) 393 – 402. doi:10.1007/ [25] J. Terven, D.-M. Córdova-Esparza, J.-A. Romero-
94
Francesca Fiani et al. CEUR Workshop Proceedings 85–95
González, A comprehensive review of yolo archi- [36] P. Viola, M. Jones, Rapid object detection using a
tectures in computer vision: From yolov1 to yolov8 boosted cascade of simple features, in: Proceedings
and yolo-nas, Machine Learning and Knowledge of the 2001 IEEE Computer Society Conference on
Extraction 5 (2023) 1680–1716. Computer Vision and Pattern Recognition. CVPR
[26] A. A. Lima, M. M. Kabir, S. C. Das, M. N. Hasan, 2001, volume 1, 2001, pp. I–I. doi:10.1109/CVPR.
M. Mridha, Road sign detection using variants of 2001.990517.
yolo and r-cnn: An analysis from the perspective
of bangladesh, in: Proceedings of the International
Conference on Big Data, IoT, and Machine Learning:
BIM 2021, Springer, 2022, pp. 555–565.
[27] A. Palazzi, D. Abati, S. Calderara, F. Solera,
R. CUcchiara, Predicting the driver’s focus of
attention: The dr(eye)ve project abs/1807.02588
(2018). URL: https://arxiv.org/abs/1705.03854.
arXiv:1705.03854.
[28] I. Dua, T. A. John, R. Gupta, C. Jawahar, Dgaze:
Driver gaze mapping on road, in: 2020 IEEE/RSJ
International Conference on Intelligent Robots and
Systems (IROS), IEEE, 2020, pp. 5946–5953.
[29] A. Yoshizawa, H. Iwasaki, Analysis of driver’s
visual attention using near-miss incidents, in:
2017 IEEE 16th International Conference on Cogni-
tive Informatics & Cognitive Computing (ICCI*CC),
2017, pp. 353–360. doi:10.1109/ICCI-CC.2017.
8109773.
[30] H. M. Peixoto, A. M. G. Guerreiro, A. D. D. Neto, Im-
age processing for eye detection and classification
of the gaze direction, in: 2009 International Joint
Conference on Neural Networks, 2009, pp. 2475–
2480. doi:10.1109/IJCNN.2009.5178924.
[31] K. Guo, G. Yu, Z. Li, An new algorithm for an-
alyzing driver’s attention state, in: 2009 IEEE
Intelligent Vehicles Symposium, 2009, pp. 21–23.
doi:10.1109/IVS.2009.5164246.
[32] H. Lee, J. Seo, H. Jo, Gaze tracking system using
structure sensor & zoom camera, in: 2015 Interna-
tional Conference on Information and Communi-
cation Technology Convergence (ICTC), 2015, pp.
830–832. doi:10.1109/ICTC.2015.7354677.
[33] A. G. Mavely, J. E. Judith, P. A. Sahal, S. A. Ku-
ruvilla, Eye gaze tracking based driver monitoring
system, in: 2017 IEEE International Conference
on Circuits and Systems (ICCS), 2017, pp. 364–367.
doi:10.1109/ICCS1.2017.8326022.
[34] H. Mohsin, S. H. Abdullah, Pupil detection algo-
rithm based on feature extraction for eye gaze, in:
2017 6th International Conference on Information
and Communication Technology and Accessibility
(ICTA), 2017, pp. 1–4. doi:10.1109/ICTA.2017.
8336048.
[35] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face de-
tection and alignment using multi-task cascaded
convolutional networks, 2022. URL: https://arxiv.
org/abs/1604.02878. doi:10.48550/ARXIV.2210.
07548.
95