=Paper=
{{Paper
|id=Vol-3695/p10
|storemode=property
|title=A Machine Learning based Real-Time Application for
Engagement Detection
|pdfUrl=https://ceur-ws.org/Vol-3695/p10.pdf
|volume=Vol-3695
|authors=Emanuele Iacobelli,Samuele Russo,Christian Napoli
|dblpUrl=https://dblp.org/rec/conf/system/IacobelliR023
}}
==A Machine Learning based Real-Time Application for
Engagement Detection==
A Machine Learning based Real-Time Application for
Engagement Detection
Emanuele Iacobelli1 , Samuele Russo2 and Christian Napoli1,3,4
1
Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185 Roma, Italy;
2
Department of Psychology, Sapienza University of Rome, 00185 Roma, Italy;
3
Institute for Systems Analysis and Computer Science, Italian National Research Council, 00185 Roma, Italy;
4
Department of Computational Intelligence, Czestochowa University of Technology, 42-201 Czestochowa, Poland;
Abstract
The study of human engagement has significantly grown in recent years, particularly accelerated by the interaction with a
growing number of smart computing machines [1, 2, 3]. Engagement estimation has significant importance across various
domains of study, including advertising, marketing, human-computer interaction, and healthcare [4, 5, 6]. In this paper, we
propose a real-time application that leverages a single RGB camera to capture user behavior. Our approach implements
a novel method for estimating human engagement in real-world scenarios by extracting valuable information from the
combination of facial expressions and gaze direction analysis. To acquire this data, we employed fast and accurate machine
learning algorithms from the external library dlib, along with custom versions of Residual Neural Networks implemented
from scratch. For training our models, we used a modified version of the DAiSEE dataset, a multi-label user affective states
classification dataset that collects frontal videos of 112 different people recorded in real-world scenarios. In the absence
of a baseline for comparing the results obtained by our application, we conducted experiments to assess its robustness in
estimating engagement levels, leading to very encouraging results.
Keywords
Engagement Detection, Eye Tracking, Face Expression Recognition, Machine Learning, Residual Neural Networks
1. Introduction ment, focus, and interaction with their surroundings. For
detecting it, facial expressions and gaze direction are cru-
In today’s rapidly evolving digital landscape, humanity cial elements. In particular, the motion of the eyes is
interacts with a growing number of smart computing an important element to employ since it highlights the
machines. This situation highlights the increasing trend psychological mechanisms behind the human mind and
of direct interactions with smart devices in various do- naturally gravitates toward objects, people, or specific
mains, including household assistance, customer service, regions of interest in the environment.
and industrial applications. Despite this technological In this paper, we propose a real-time application that
advancement, many devices lack algorithms capable of combines gaze direction and face expression analysis to
perceiving and responding to users’ attentional states. determine the engagement level of a person while in-
Traditional user interfaces still heavily rely on explicit teracting with intelligent systems. To achieve this, we
input or predefined triggers, resulting in often inefficient defined two machine-learning pipelines leveraging RGB
and mechanical interactions. videos of a person interacting with the system. The first
The potential for automatic acquisition and interpreta- pipeline focuses on the user’s facial expressions analysis
tion of users’ engagement represents a huge usability im- and employs a residual neural network architecture. The
provement for Human-Computer Interaction (HCI) and second pipeline concentrates on the user’s gaze direc-
Human-Robot Interaction (HRI) systems. This capability tion estimation by combining pre-trained face and facial
holds the promise of ushering in more advanced and intu- landmark detection models with a fast computer vision
itive interactions, elevating system responsiveness, and algorithm that we developed. Predictions of the user’s
enhancing overall user experience. In detail, engagement engagement level are ultimately calculated by merging
is a fundamental aspect of the human experience and the outputs of these two pipelines using a weighted linear
captures in depth the quality of an individual’s involve- interpolation formula.
Addressing the challenge posed by the absence of a
SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi- baseline for reference, our primary hurdle in handling
neering and Mathematics, Rome, December 3-6, 2023
Envelope-Open iacobelli@diag.uniroma1.it (E. Iacobelli);
this task involved creating an appropriate dataset for
samuele.russo@uniroma1.it (S. Russo); cnapoli@diag.uniroma1.it training our models. We opted to customize the Affective
(C. Napoli) States in E-Environment Dataset (DAiSEE) [7], a compre-
Orcid 0009-0003-1379-9106 (E. Iacobelli); 0000-0002-1846-9996 hensive collection of multi-label videos designed for iden-
(S. Russo); 0000-0002-3336-5853 (C. Napoli) tifying user affective states. Given that the estimation of
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
75
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Emanuele Iacobelli et al. CEUR Workshop Proceedings 75–84
engagement levels necessitates both temporal and spatial ease of implementation and their proven effectiveness
information, videos proved to be an ideal choice. How- in achieving accurate results. The prominence of such
ever, to mitigate the high computational and resource techniques underscores the significance of visual cues,
costs associated with treating videos as opposed to single particularly facial expressions, in gauging user engage-
images, we implemented mandatory preprocessing steps ment levels during online interactions.
to optimize memory and computational efficiency. The work presented in [18] investigates the suitability
Continuing to tackle the absence of a reference base- of three popular models: All-CNN [19], NiN-CNN [20],
line, we conducted experiments to assess the robustness and VD-CNN [21], along with a customized Convolu-
and effectiveness of our application. This evaluation was tional Neural Network (CNN) [22] for detecting engage-
carried out using quantitative metrics. ment level of online learners in educational activities.
All the analyzed models leverage facial expressions for
1.1. Roadmap scalable and accessible engagement detection. Each of
the three base models has its distinct features and the cus-
This paper is organized in the following way: first of all, a tomized CNN combines these advantageous features. For
summary of the state-of-the-art systems and techniques instance, by replacing linear convolutional layers with a
to recognize the human engagement level is presented multilayer perceptron, increasing depth with small con-
(see Section 2). Subsequently, a description of the dataset volutional filters, and replacing some max-pooling layers
that we have developed for training our models is illus- with convolutional layers with increased stride. All the
trated (see Section 3). Following this, a detailed overview analyzed models were evaluated on the DAiSEE dataset
of the architectures employed for our application is pro- (extensively explained in Section 3) and the results reveal
vided (see Section 4). Then, the results obtained by test- that the customized CNN outperforms the base models
ing our system considering the quantitative metrics are in detecting the engagement level.
presented (see Section 5). Finally, we summarize the In a similar study proposed in [23], the automatic
article’s content and outline the possible viable improve- recognition of student engagement from facial expres-
ments that can be made to our application (see Section sions is examined using a three-stage pipeline. The initial
6). step involves face registration, detection, and the estima-
tion of key facial landmarks (e.g., eyes, nose, and mouth)
by using the approach described in [24]. The second stage
2. Related Works employs four binary classifiers to classify the cropped
The field of engagement level detection has seen signifi- face, distinguishing whether it belongs to one of four
cant growth, particularly fueled by the global pandemic. engagement levels (𝑙 ∈ 1, 2, 3, 4), where 1 signifies no
With many individuals compelled to participate in re- engagement and 4 represents full focus. The authors
mote meetings, analyzing engagement in online sessions compared three models for the binary classifier: Sup-
has become a pivotal focus, leading to the development port Vector Machines with Gabor features (SVM (Gabor))
of numerous systems. Some studies have explored phys- [24], Multinomial Logistic Regression with expression
iological factors like fatigue [8], brain status and data outputs from the Computer Expression Recognition Tool-
[9, 10], blood flow and heart rate [11], and galvanic skin box (MLR(CERT)) [24], and GentleBoost with Box Filter
conductance [12]. However, due to the recent needs and features (Boost(BF)) [25]. This study reveals that SVM
the remote nature of this task, there has been widespread (Gabor) yields the best results. The third stage integrates
exploration of inexpensive and unobtrusive technolo- the outputs of all four binary classifiers, utilizing a Multi-
gies. Eye trackers [13, 14] and facial expression recogni- nomial Logistic Regressor model to estimate the final
tion models [15, 16] using simple RGB cameras are now engagement level.
among the most promising options. In [26], the authors introduced a regression model
In a comprehensive review treated in [17], the state- for predicting engagement level as a single scalar value
of-the-art engagement detection techniques within the from RGB video streams captured by two cameras on the
context of online learning are explored. The authors torso and head of an autonomous mobile robot, utilized
classify existing methods into three primary categories: for tours at The Collection museum in Lincoln, UK. The
automatic, semi-automatic, and manual. This classifica- model incorporates CNN and Long Short-Term Memory
tion is based on the methods’ dependencies on learners’ (LSTM) [27] networks for video data analysis. Train-
participation. Furthermore, each category is subdivided ing and evaluation of this regressor network were con-
based on the type of input data used (e.g., audio, video, ducted using a dataset built from the recordings of the
text). Among these, video-based methods in the auto- autonomous tour guide robot in the public museum. The
matic category that leverage facial expressions emerge as dataset, manually annotated by three independent peo-
the most prevalent. These methods are favored for their ple, assigns scalar values in the range [0,1] to represent
the user’s engagement level. The model demonstrates
76
Emanuele Iacobelli et al. CEUR Workshop Proceedings 75–84
Figure 1: Some sample instances present in our customized version of the DAiSEE before converting them in grayscale and
applying an histogram equalization. From the left to the right, we have: a) very low engaged, b) low engaged, c) highly
engaged, and d) very highly engaged.
optimal engagement level predictions, achieving a Mean driver attention estimation, generating a heat map on the
Squared Error (MSE) prediction loss of up to 0.126 on the images representing the road. The training dataset for
test dataset. this model is constructed using virtual reality and a driv-
The research conducted in [28] focused on investigat- ing simulator, incorporating images from the DR(eye)VE
ing the Deep Facial Spatiotemporal Network (DFSTN). dataset [32] that depict the frontal view of the road ob-
Comprising two integral modules, namely the pretrained served by the driver. Experimental results showcase the
SE-ResNet-50 (SENet) utilized for extracting facial spatial feasibility and superiority of the proposed method over
features and an LSTM network with Global Attention existing approaches.
for generating an attentional hidden state, the DFSTN
synergistically captures both facial spatial and tempo-
ral information. This combined information is crucial 3. Dataset
for enhancing engagement prediction performance. The
The baseline dataset utilized for training our networks is
model underwent testing on the DAiSEE dataset, achiev-
a customized version of the Dataset for Affective States
ing an accuracy of 58.84%, showcasing its capability to
in E-Environments (DAiSEE), a large collection of multi-
outperform numerous existing engagement prediction
label videos designed for identifying user affective states,
networks trained on the same dataset.
including boredom, confusion, engagement, and frustra-
In [29], the estimation of human attention is based on
tion in real-world scenarios. This dataset comprises 9068
the direction of the user’s face, considering five different
frontal view videos featuring 112 distinct individuals ex-
directions: central, lateral to the left, lateral to the right,
pressing different levels of affective states. Each of these
towards up, and towards down. If the user looks in any di-
states was manually ranked utilizing the following scale:
rection other than the central one, they are assumed to be
very low, low, high, and very high.
distracted, with only the central gaze indicating full focus.
To create our customized dataset we initially modified
The authors created a dataset for training, comprising
the task from which DAiSEE was originally built. We
270 videos of approximately 20 seconds each from 18 dif-
switched from multi-label to multi-class classification,
ferent individuals. To enhance data diversity, GAN-based
associating only the level of engagement with each video
data augmentation techniques were employed to gener-
and removing the labels for the other affective states. Ex-
ate new samples, diversifying somatic features in the
ample instances present inside our customized version
recorded videos. Transfer Learning [30] was utilized to
of the DAiSEE are displayed in Fig. 1. Subsequently, we
construct the classifier. Specifically, a pre-trained VGG16
divided the dataset into Training, Validation, and Test
[21] architecture was employed, with three additional
sets, with proportions of 60%, 20%, and 20%, respectively.
dense layers attached at the end for attention estimation.
However, the resulting sets were highly unbalanced due
The approach presented in [31] offers a novel method
to a small portion of videos classified as very low and
for estimating driver attention. Departing from conven-
low engagement. To address this issue, we downsampled
tional methods that primarily focus on a single frontal
the dataset in several ways to achieve a more balanced
scene image to analyze driver gaze or head pose, this
distribution. First of all, redundancy in subjects was
method introduces a dual-view scene. The additional
reduced by removing multiple videos of the same indi-
input data includes the frontal view of the car that the
viduals. Then, through the use of a normal distribution,
driver is observing. Specifically, the gaze direction is
we sampled the remaining data instances considering
detected and transformed into a probability map of the
the frequency of labels in the videos with the following
same size as the road view image, while salient features
formula:
of temporal and spatial dimensions are extracted from the 𝑛𝑡𝑜𝑡 − 𝑓𝑖
road view images. These features are then combined and 𝑛𝑖 = 𝑓𝑖 ⋅ ⋅𝜆 (1)
𝑛𝑡𝑜𝑡
fed into a multi-resolution neural network tasked with
77
Emanuele Iacobelli et al. CEUR Workshop Proceedings 75–84
Table 1
This table displays the number of sample instances before and after the customization of the Training, Validation, and Test
sets derived from the DAiSEE.
Engagement Level Original Dataset Customized Dataset
Training Validation Test Training Validation Test
Very Low 34 23 4 34 23 4
Low 214 160 81 52 37 17
High 2649 912 861 341 110 105
Very High 2585 625 777 344 100 102
where 𝜆 is the reduction coefficient (that we have set
to 0.25), 𝑛𝑡𝑜𝑡 represents the total number of samples in a
given set, and 𝑓𝑖 denotes the frequency of label 𝑖 in that
set. Table 1 displays information both before and after
the preprocessing procedures on the dataset.
Since DAiSEE contains recordings captured in dynamic
environments, each of these videos may present different
and various disturbances, such as changing light condi-
tions, visual occlusions, or unconstrained user motion.
To improve video quality, we applied manual color and
intensity adjustments, focusing on enhancing contrast,
brightness, and sharpness for optimal detail resolution.
Examples include adjustments to the gamma value, which
effectively improves visibility in varying light conditions
or exposure levels by normalizing image histograms,
making videos more suitable for continuous analysis;
Another example is the sharpness adjustments, which
enhance fine details and edges, making facial features
more prominent.
Despite these modifications, the dataset still demanded
excessive memory requirements. Consequently, we
opted for further adjustments. Considering that the ma-
jority of engagement information is likely derived from
human expressions and gaze attention, with a smaller
contribution from gestures, we decided to crop from each
video only the user’s faces. This step also aimed to elimi-
nate potential issues and biases arising from background
data. The face cropping was automated using a pre-
trained Single Shot Multibox Detector (SSD) model from
the Caffe framework [33].
To prevent the generation of unstable videos, we ap-
plied a stabilization algorithm (see pseudocode in Fig. 2) Figure 2: Pseudocode of the face stabilization algorithm used
that facilitates smooth transitions between subsequently to prevent unstable videos while cropping the user’s faces
from the original video in the DAiSEE.
detected faces by stabilizing the position of their bound-
ing boxes. At the start of each video, the size of the first
detected face’s bounding box is stored. In all the follow-
ing frames, this dimension is used to resize the bounding histogram equalization is applied to normalize the color
box of the subsequently detected faces. Additionally, if information.
the distance between the centers of two consecutive de-
tected faces is smaller than a manually adjusted threshold
𝛾, the center of the newest detected face is replaced with
the center of the bounding box of the previously detected
face. Finally, each frame is converted to grayscale, and
78
Emanuele Iacobelli et al. CEUR Workshop Proceedings 75–84
Figure 3: Full pipeline of the real-time application. The Webcam Reader Module acquires the data in real-time and passes
them to the Face Cropping Model. This model crops the user’s face from the webcam images and passes them both to the
Facial Landmark Detection module and the Frame Buffer which has a capacity of 60 frames. Once the buffer is full, each new
frame is passed to the Gaze Direction Module and the Face Engagement Model. The predictions of these models are then
combined to produce the actual output of our system.
4. Methodology tional layers instead of the traditional 2D convolutional
layers implemented in the original ResNet architecture.
The complete architecture of the real-time application we Learning temporal information is crucial for video anal-
developed is illustrated in Fig. 3. Specifically, the system ysis, as it allows the network to recognize complex pat-
utilizes a single input video stream captured through terns such as actions, gestures, or sequences of facial
a webcam reader module, implemented in the external expressions. Due to this requirement, the model necessi-
library OpenCV [34], to feed two distinct models. The tates an initial period to populate a buffer of 60 frames,
Face Engagement Model evaluates engagement based ensuring a sufficient amount of data for the correct uti-
on facial expressions, while the Gaze Direction Model lization of the 3D convolutional layers. Once the buffer
predicts engagement by analyzing where the user’s focal reaches its capacity, the prediction of the engagement
point. Lastly, the predictions of these two models are level can begin. Subsequently, with the arrival of each
combined to derive the final engagement level estimated new frame, the buffer is updated, and the oldest frame is
by our application. discarded. We tested three versions of this architecture,
differing mainly in the depth and the internal structure
4.1. Face Engagement Model of the convolutional block used. Specifically, we imple-
mented the 18-, 34-, and 50-layer versions.
This model is designed to estimate the user engagement For all these architectures, we introduced 3D layers
level from frontal recording videos. We designed it as a for batch normalization, max pooling, and average pool-
customized version of the ResNet architecture [35] and ing. In detail, each convolutional block includes a batch
we implemented different versions to identify the most normalization layer, and all convolutional layers employ
effective one. In essence, a residual network employs skip the ReLU activation function. Only the last fully con-
connections to address the vanishing gradient problem. nected layer, responsible for the final prediction of the
These connections allow information to directly back- human’s engagement level, uses the Softmax activation
propagate, circumventing previous layers. Moreover, a function. During training, we utilized the He/Kaiming
skip connection facilitates a residual block in learning initialization technique [36], which initializes weights us-
the residual, which is the difference between the desired ing a normal distribution with zero mean and a variance
output and the current input of the layer. This approach of 2𝑛 , where 𝑛 is the total number of inputs to the neuron.
makes it easier for the network to understand what input This initialization is specifically tailored for networks
modifications are needed to achieve the desired output, employing the ReLU activation function, mitigating the
rather than altering the entire input from scratch. This of- vanishing or exploding gradient problem.
ten translates to a more straightforward learning process Additionally, we employed the Focal Loss [37] as the
for the network. training function, opting for it over the conventional Cat-
To address the human engagement level classification egorical Cross-Entropy. The principal reason is that the
problem, our models needed to capture both spatial and Focal Loss addresses the issue of unbalanced data by pri-
temporal information. To enable the network to learn oritizing examples the model struggles with, rather than
temporal information by analyzing multiple frames si- those it confidently predicts. This ensures continuous
multaneously in the same layer, we opted for 3D convolu-
79
Emanuele Iacobelli et al. CEUR Workshop Proceedings 75–84
Figure 4: Gaze Direction Model Workflow: The input image, captured by the webcam, undergoes processing through the
Face Cropping Module. This module is responsible for cropping the detected face, and the resulting image is then fed into the
Face Landmark Detector. The Face Landmark Detector estimates the position of facial keypoints, which are subsequently
utilized to crop the eye regions. Each eye image undergoes further analysis in the Gaze Direction Module, which assesses both
horizontal and vertical directions of the gaze.
improvement on challenging examples, preventing the Table 2
model from becoming overly confident with easy ones. This table displays the implementation details of the three
We implemented the following Focal Loss formula: different architectures that we tested for the Face Engagement
Model. When the stride information is not present it means
− (1 − 𝑝𝑖 )𝛾 ln(𝑝𝑖 ) (2) that the stride is equal to 1.
Layer Name Architecture Name
where 𝛾 represents the focusing parameter (typically a
18-layer 34-layer 50-layer
positive number) to be fine-tuned using cross-validation,
and 𝑝𝑖 denotes the predicted probability of the correct Convolution kernel = [7,7,7], filters = 64, stride = 2
class. Also, our training process incorporates early stop- Max Pool k = [3,3,3], s = 2
ping with a learned patience value of 10 epochs, L2 reg- Convolution k=[3,3,3],f=64 k=[3,3,3],f=64 k=[1,1,1],f=64
Block k=[3,3,3],f=64 k=[3,3,3],f=64 k=[3,3,3],f=64
ularization featuring a weight decay set to 1𝑒 −3 , and an k=[1,1,1],f=256
Adam Optimizer accompanied by a Learning Rate Sched- x2 x3 x3
uler [38] with a maximum learning rate of 1𝑒 − 4 and a Convolution k=[3,3,3],f=128 k=[3,3,3],f=128 k=[1,1,1],f=128
Gradient Scaler to reduce the range of magnitudes in the Block k=[3,3,3],f=128 k=[3,3,3],f=128 k=[3,3,3],f=128
gradients. All the implementation details of the tested k=[1,1,1],f=512
x2 x4 x4
models are reported in the Table 2. Following the train-
Convolution k=[3,3,3],f=256 k=[3,3,3],f=256 k=[1,1,1],f=256
ing phase, we opted for the 50-layer version model, with Block k=[3,3,3],f=256 k=[3,3,3],f=256 k=[3,3,3],f=256
a batch size equal to 16, as the engagement network for k=[1,1,1],f=1024
our application, as it demonstrated the highest accuracy x2 x6 x6
among the tested versions. Convolution k=[3,3,3],f=512 k=[3,3,3],f=512 k=[1,1,1],f=512
Block k=[3,3,3],f=512 k=[3,3,3],f=512 k=[3,3,3],f=512
k=[1,1,1],f=2048
4.2. Gaze Direction Model x2 x3 x3
Average Pool Output Size = 1x1x1
This model is designed to extract attention information
Dropout Rate = 0.4
from a person’s gaze in frontal recording videos. The
Linear Neurons = 1024
gaze direction provides valuable insights into a person’s
engagement during a task. The complete workflow of
this model is displayed in Fig. 4. To implement this
model, we combined two pre-trained neural networks considering the customized dataset that we have em-
available in the dlib library [39]. ployed for training the engagement model, it plays a
The first network is the Face Cropping Model, a CNN crucial role in the real-time application. Specifically, it
trained for face detection in general images. It not only crops faces from the live stream frames and passes these
identifies faces but also provides their bounding box co- images to both the Engagement Model and the Facial
ordinates and converts the input image to grayscale. Al- Landmark Detector.
though the use of this network may appear redundant The Facial Landmark Detector, the second network
80
Emanuele Iacobelli et al. CEUR Workshop Proceedings 75–84
that we have employed from the dlib library, recognizes Table 3
68 2D facial landmarks (e.g., nose tip, corners of the Table displaying the conversion rules from engagement level
mouth, and eyes) in a given face image. These facial labels to score and vice versa.
landmarks serve two purposes: they are used to crop the Engagement Label Engagement Score
eye regions based on the eye landmarks and to calculate
the face orientation with respect to the vertical axis (yaw 0 0.1
1 0.35
angle). This orientation is determined through the use
2 0.65
of a vector starting from the midpoint between the eyes 3 0.9
and terminating at the nose tip.
Estimating the focal point of the user is accomplished
through the Gaze Direction Module, a simple computer
on the sine of the face orientation. Updates related to
vision pipeline. Initially, the eye landmarks outlining
face position involve calculating the distance between
the eye contours are employed to create a mask that re-
the face bounding box center and the frame center. If this
moves extraneous pixels from each cropped eye image.
distance exceeds one-sixth of the total frame dimension,
Subsequently, the Otsu’s method [40] is applied to auto-
the right and left limits are adjusted. The adjustment is
matically threshold the image, distinguishing between
determined by normalizing the distance between the face
foreground (iris and pupil pixels) and background (sclera
and frame centers between 0 and 0.5. If the face shifts
pixels).
to the right, the distance is subtracted from the limits;
The resulting image is then horizontally and vertically
otherwise, it is added.
divided around its center to estimate the gaze direction.
The engagement level, derived from the gaze direction,
Both vertical and horizontal gaze directions are quanti-
is within the range [0,1]. It is obtained by subtracting
fied as values within the range of [-1,1]. Regarding hori-
the sum of horizontal and vertical gaze errors from 1. A
zontal gaze direction, a value approaching -1 indicates
score of 1 indicates complete focus on the screen, with
the user is looking to the left, while a value approaching
no gaze exceeding the defined limits. A score of 0 implies
1 suggests a rightward gaze. A value around 0 indicates
no face detection in the current frame. The closer the
the user is looking at the center of the screen. Similarly,
engagement level is to zero, the more the user surpasses
for vertical gaze direction, a value nearing 1 signifies
the admissible field of view limits, indicating a lack of
a downward gaze and a value nearing -1 indicates an
focus on the task. Specifically, when the gaze exceeds the
upward gaze.
limits, horizontal and vertical gaze errors are calculated
To compute these directions, the density of white pix-
as the difference in modulo between the estimated gaze
els representing the sclera is analyzed. For each eye im-
direction and the corresponding limits.
age, the total number of white pixels is calculated. If this
value is zero, it implies incorrect eye detection, and the
current frame is skipped. Otherwise, for each sub-image 4.3. Engagement Level Estimation
generated, the percentage of white pixels in relation to To obtain the final detected engagement level, we com-
the total number of white pixels in the corresponding bined the predictions from the Face Engagement Model
original eye image is calculated. Then, the percentages and the Gaze Direction Model using a linear interpolation
belonging to the same direction of both eyes are averaged formula:
(e.g., the percentage of white pixels in the left sub-image
of the left eye is averaged with the percentage of white 𝛼 ⋅ 𝐸𝑛𝑔𝑎𝑔𝑒𝑚𝑒𝑛𝑡𝐹 𝑎𝑐𝑒 + (1 − 𝛼) ⋅ 𝐸𝑛𝑔𝑎𝑔𝑒𝑚𝑒𝑛𝑡𝐺𝑎𝑧𝑒 (3)
pixels in the left sub-image of the right eye). Finally, the
difference between these averages produces the value Where 𝛼 is a learnable parameter used to weigh the
within the range of [-1,1] described earlier. importance of the models’ predictions. In addition, to
To effectively use the estimated gaze direction, it’s correctly apply this formula, the prediction of the Face
crucial to consider the limits of the user’s field of view, Engagement Model needs to be converted from labels to a
which may vary based on the task. In our screen-based value within the range [0,1]. The conversion is performed
task implementation, we assume that a face orientation according to the rules displayed in Table 3.
deviation exceeding 30 degrees from the camera-aligned
orientation indicates the user is no longer looking at the 5. Results
monitor.
Initially, these limits are set at the task’s beginning and To evaluate the accuracy of our system, we measured the
dynamically adjusted based on the user’s face position disparity between the predicted engagement scores and
and orientation relative to the camera frame’s center. If the ground truth values using two regression metrics:
the face orientation exceeds 20 degrees from the frontal Mean Absolute Error (MAE) and Mean Absolute Percent-
position, the horizontal limits shift proportionally based age Error (MAPE). To facilitate the application of these
81
Emanuele Iacobelli et al. CEUR Workshop Proceedings 75–84
0.5, our system achieved an accuracy of approximately
58% (57.7%), closely aligning with the performance of
state-of-the-art works in engagement level detection dis-
cussed in Section 2 that work with the original version
of the DAiSEE.
6. Conclusions
Our work introduces a novel approach to engagement
level estimation by integrating two distinct machine
learning pipelines focused on analyzing facial expres-
sions and gaze direction. Noteworthy is our real-time
application’s emphasis on cost-effectiveness and accessi-
bility, achieved through the utilization of a single RGB
Figure 5: Trend of the Mean Absolute Error (MAE) with vary- camera, fast and lightweight machine learning algo-
ing values of the parameter 𝛼 in Eq. (3). rithms, and computationally efficient computer vision
techniques.
In terms of system training, we customized the DAiSEE
dataset to optimize memory usage, reduce class imbal-
ance, mitigate bias introduced by repeated instances of
the same individuals, and focus exclusively on facial crop-
ping to eliminate potential background-related biases.
The achieved results underscore the potential of our sys-
tem as a robust foundation, offering a secure benchmark
for the development of innovative applications integrat-
ing automatic user engagement recognition, thereby dy-
namically adapting to user interactions. This not only
enhances overall usability but also heralds a new era in
application interfaces, promising heightened levels of
user experience and interaction.
Looking forward, future improvements to our system
can be directed towards enhancing the accuracy, robust-
Figure 6: Trend of the Mean Absolute Percentage Error ness, and generalization capabilities by expanding the
(MAPE) with varying values of the parameter 𝛼 in Eq. (3). dataset’s dimensions. This expansion may involve incor-
porating data from a more diverse group, encompassing
individuals with varying demographic characteristics,
metrics, we converted the engagement level labels associ- cultural backgrounds, and engagement patterns.
ated with the samples in our customized dataset using the Also, exploring attention estimation in multi-face con-
conversion rules outlined in Table 3. This transformation texts, where multiple individuals are present simultane-
effectively turned the multi-label class problem, designed ously, represents another intriguing avenue for future re-
for the DAiSEE dataset, into a regression problem. search. Lastly, a significant refinement to our application
During training, we experimented with different val- involves substituting the CNN layers in the Face Detec-
ues for the parameter 𝛼 in Eq. (3) to maximize the sys- tion Model with Visual Transformers [41](ViTs), known
tem’s accuracy. As illustrated in Figs. 5 and 6, the lowest for their excellence in image manipulation and long-
error for both MAE and MAPE occurred when 𝛼 was set range dependency modeling compared to traditional con-
to 0.5. This indicates that both predictions from the Face volutional layers. This substitution could enhance the
Engagement Model and the Gaze Direction Model carry precision of engagement level estimation from facial ex-
equal importance and are essential for achieving accurate pressions, as different facial regions can be effectively
predictions. combined at the same time.
Analysis of the scenarios where 𝛼 is 0 (using only the
Face Model) or 1 (using only the Gaze Model) reveals
significantly higher errors in both performance metrics.
References
Independently, these predictions struggle to accurately [1] G. Capizzi, G. L. Sciuto, C. Napoli, M. Woźniak,
gauge the user’s engagement level. With 𝛼 initialized to G. Susi, A spiking neural network-based long-
82
Emanuele Iacobelli et al. CEUR Workshop Proceedings 75–84
term prediction system for biogas production, Neu- tion on driver distraction, in: 2015 IEEE 12th inter-
ral Networks 129 (2020) 271 – 279. doi:10.1016/j. national conference on wearable and implantable
neunet.2020.06.001 . body sensor networks (BSN), IEEE, 2015, pp. 1–6.
[2] N. Brandizzi, S. Russo, G. Galati, C. Napoli, Address- [13] E. Iacobelli, V. Ponzi, S. Russo, C. Napoli, Eye-
ing vehicle sharing through behavioral analysis: A tracking system with low-end hardware: Devel-
solution to user clustering using recency-frequency- opment and evaluation, Information 14 (2023) 644.
monetary and vehicle relocation based on neigh- [14] F. Fiani, S. Russo, C. Napoli, An advanced solu-
borhood splits, Information (Switzerland) 13 (2022). tion based on machine learning for remote emdr
doi:10.3390/info13110511 . therapy, Technologies 11 (2023). doi:10.3390/
[3] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo, technologies11060172 .
J. Starczewski, C. Napoli, A novel convmixer trans- [15] P. Kaur, K. Krishan, S. K. Sharma, T. Kanchan,
former based architecture for violent behavior de- Facial-recognition algorithms: A literature review,
tection 14126 LNAI (2023) 3 – 16. doi:10.1007/ Medicine, Science and the Law 60 (2020) 131–139.
978- 3- 031- 42508- 0_1 . [16] G. De Magistris, M. Romano, J. Starczewski,
[4] G. Capizzi, G. L. Sciuto, C. Napoli, E. Tramontana, C. Napoli, A novel dwt-based encoder for human
A multithread nested neural network architecture pose estimation, volume 3360, 2022, pp. 33 – 40.
to model surface plasmon polaritons propagation, [17] M. Dewan, M. Murshed, F. Lin, Engagement detec-
Micromachines 7 (2016). doi:10.3390/mi7070110 . tion in online learning: a review, Smart Learning
[5] C. Napoli, G. Pappalardo, E. Tramontana, R. K. Now- Environments 6 (2019) 1–20.
icki, J. T. Starczewski, M. Woźniak, Toward work [18] M. Murshed, M. A. A. Dewan, F. Lin, D. Wen, En-
groups classification based on probabilistic neural gagement detection in e-learning environments us-
network approach, volume 9119, 2015, pp. 79 – 89. ing convolutional neural networks, in: 2019 IEEE
doi:10.1007/978- 3- 319- 19324- 3_8 . Intl Conf on Dependable, Autonomic and Secure
[6] N. Brandizzi, S. Russo, R. Brociek, A. Wajda, First Computing, Intl Conf on Pervasive Intelligence and
studies to apply the theory of mind theory to green Computing, Intl Conf on Cloud and Big Data Com-
and smart mobility by using gaussian area cluster- puting, Intl Conf on Cyber Science and Technology
ing, volume 3118, 2021, pp. 71 – 76. Congress (DASC/PiCom/CBDCom/CyberSciTech),
[7] A. Gupta, A. D’Cunha, K. Awasthi, V. Balasubrama- IEEE, 2019, pp. 80–86.
nian, Daisee: Towards user engagement recogni- [19] J. T. Springenberg, A. Dosovitskiy, T. Brox, M. Ried-
tion in the wild, arXiv preprint arXiv:1609.01885 miller, Striving for simplicity: The all convolutional
(2016). net, arXiv preprint arXiv:1412.6806 (2014).
[8] Z. Wan, J. He, A. Voisine, An attention level moni- [20] M. Lin, Q. Chen, S. Yan, Network in network, arXiv
toring and alarming system for the driver fatigue preprint arXiv:1312.4400 (2013).
in the pervasive environment, in: Brain and Health [21] K. Simonyan, A. Zisserman, Very deep convolu-
Informatics: International Conference, BHI 2013, tional networks for large-scale image recognition,
Maebashi, Japan, October 29-31, 2013. Proceedings, arXiv preprint arXiv:1409.1556 (2014).
Springer, 2013, pp. 287–296. [22] S. Albawi, T. A. Mohammed, S. Al-Zawi, Under-
[9] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli, standing of a convolutional neural network, in:
Analysis pre and post covid-19 pandemic rorschach 2017 international conference on engineering and
test data of using em algorithms and gmm models, technology (ICET), Ieee, 2017, pp. 1–6.
volume 3360, 2022, pp. 55 – 63. [23] J. Whitehill, Z. Serpell, Y.-C. Lin, A. Foster, J. R.
[10] C.-M. Chen, J.-Y. Wang, C.-M. Yu, Assessing the Movellan, The faces of engagement: Automatic
attention levels of students by using a novel at- recognition of student engagementfrom facial ex-
tention aware system based on brainwave signals, pressions, IEEE Transactions on Affective Comput-
British Journal of Educational Technology 48 (2017) ing 5 (2014) 86–98.
348–369. [24] G. Littlewort, J. Whitehill, T. Wu, I. Fasel, M. Frank,
[11] S. Di Palma, A. Tonacci, A. Narzisi, C. Domenici, J. Movellan, M. Bartlett, Computer expression
G. Pioggia, F. Muratori, L. Billeci, M. S. Group, recognition toolbox, Proc. Automatic Face and Ges-
et al., Monitoring of autonomic response to so- ture Recognition (FG’11) 20 (2011) 24–25.
ciocognitive tasks during treatment in children with [25] P. Viola, M. Jones, et al., Robust real-time object
autism spectrum disorders by wearable technolo- detection, International journal of computer vision
gies: A feasibility study, Computers in biology and 4 (2001) 4.
medicine 85 (2017) 143–152. [26] F. Del Duchetto, P. Baxter, M. Hanheide, Are you
[12] O. Dehzangi, C. Williams, Towards multi-modal still with me? continuous engagement assessment
wearable driver monitoring: Impact of road condi- from a robot’s point of view, Frontiers in Robotics
83
Emanuele Iacobelli et al. CEUR Workshop Proceedings 75–84
and AI 7 (2020) 116. [41] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
[27] S. Hochreiter, J. Schmidhuber, Long short-term senborn, X. Zhai, T. Unterthiner, M. Dehghani,
memory, Neural computation 9 (1997) 1735–1780. M. Minderer, G. Heigold, S. Gelly, et al., An image is
[28] J. Liao, Y. Liang, J. Pan, Deep facial spatiotempo- worth 16x16 words: Transformers for image recog-
ral network for engagement prediction in online nition at scale, arXiv preprint arXiv:2010.11929
learning, Applied Intelligence 51 (2021) 6609–6621. (2020).
[29] S. Pepe, S. Tedeschi, N. Brandizzi, S. Russo, L. Iocchi,
C. Napoli, Human attention assessment using a
machine learning approach with gan-based data
augmentation technique trained using a custom
dataset, OBM Neurobiology 6 (2022) 1–10.
[30] S. J. Pan, Q. Yang, A survey on transfer learning,
IEEE Transactions on knowledge and data engineer-
ing 22 (2009) 1345–1359.
[31] Z. Hu, C. Lv, P. Hang, C. Huang, Y. Xing,
Data-driven estimation of driver attention using
calibration-free eye gaze and scene features, IEEE
Transactions on Industrial Electronics 69 (2021)
1800–1808.
[32] A. Palazzi, D. Abati, F. Solera, R. Cucchiara, et al.,
Predicting the driver’s focus of attention: the dr
(eye) ve project, IEEE transactions on pattern anal-
ysis and machine intelligence 41 (2018) 1720–1733.
[33] E. Cengil, A. Çınar, E. Özbay, Image classification
with caffe deep learning framework, in: 2017 In-
ternational Conference on Computer Science and
Engineering (UBMK), IEEE, 2017, pp. 440–444.
[34] I. Culjak, D. Abram, T. Pribanic, H. Dzapo, M. Cifrek,
A brief introduction to opencv, in: 2012 proceedings
of the 35th international convention MIPRO, IEEE,
2012, pp. 1725–1730.
[35] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
ing for image recognition, in: Proceedings of the
IEEE conference on computer vision and pattern
recognition, 2016, pp. 770–778.
[36] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into
rectifiers: Surpassing human-level performance on
imagenet classification, in: Proceedings of the IEEE
international conference on computer vision, 2015,
pp. 1026–1034.
[37] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Fo-
cal loss for dense object detection, in: Proceedings
of the IEEE international conference on computer
vision, 2017, pp. 2980–2988.
[38] L. N. Smith, N. Topin, Super-convergence: Very
fast training of neural networks using large learn-
ing rates, in: Artificial intelligence and machine
learning for multi-domain operations applications,
volume 11006, SPIE, 2019, pp. 369–386.
[39] D. E. King, Dlib-ml: A machine learning toolkit,
The Journal of Machine Learning Research 10 (2009)
1755–1758.
[40] N. Otsu, A threshold selection method from gray-
level histograms, IEEE transactions on systems,
man, and cybernetics 9 (1979) 62–66.
84