=Paper=
{{Paper
|id=Vol-3695/p10
|storemode=property
|title=A Machine Learning based Real-Time Application for
Engagement Detection
|pdfUrl=https://ceur-ws.org/Vol-3695/p10.pdf
|volume=Vol-3695
|authors=Emanuele Iacobelli,Samuele Russo,Christian Napoli
|dblpUrl=https://dblp.org/rec/conf/system/IacobelliR023
}}
==A Machine Learning based Real-Time Application for
Engagement Detection==
<pdf width="1500px">https://ceur-ws.org/Vol-3695/p10.pdf</pdf>
<pre>
                                A Machine Learning based Real-Time Application for
                                Engagement Detection
                                Emanuele Iacobelli1 , Samuele Russo2 and Christian Napoli1,3,4
                                1
                                  Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185 Roma, Italy;
                                2
                                  Department of Psychology, Sapienza University of Rome, 00185 Roma, Italy;
                                3
                                  Institute for Systems Analysis and Computer Science, Italian National Research Council, 00185 Roma, Italy;
                                4
                                  Department of Computational Intelligence, Czestochowa University of Technology, 42-201 Czestochowa, Poland;


                                                  Abstract
                                                  The study of human engagement has significantly grown in recent years, particularly accelerated by the interaction with a
                                                  growing number of smart computing machines [1, 2, 3]. Engagement estimation has significant importance across various
                                                  domains of study, including advertising, marketing, human-computer interaction, and healthcare [4, 5, 6]. In this paper, we
                                                  propose a real-time application that leverages a single RGB camera to capture user behavior. Our approach implements
                                                  a novel method for estimating human engagement in real-world scenarios by extracting valuable information from the
                                                  combination of facial expressions and gaze direction analysis. To acquire this data, we employed fast and accurate machine
                                                  learning algorithms from the external library dlib, along with custom versions of Residual Neural Networks implemented
                                                  from scratch. For training our models, we used a modified version of the DAiSEE dataset, a multi-label user affective states
                                                  classification dataset that collects frontal videos of 112 different people recorded in real-world scenarios. In the absence
                                                  of a baseline for comparing the results obtained by our application, we conducted experiments to assess its robustness in
                                                  estimating engagement levels, leading to very encouraging results.

                                                  Keywords
                                                  Engagement Detection, Eye Tracking, Face Expression Recognition, Machine Learning, Residual Neural Networks


                                1. Introduction                                                                                                  ment, focus, and interaction with their surroundings. For
                                                                                                                                                 detecting it, facial expressions and gaze direction are cru-
                                In today’s rapidly evolving digital landscape, humanity cial elements. In particular, the motion of the eyes is
                                interacts with a growing number of smart computing an important element to employ since it highlights the
                                machines. This situation highlights the increasing trend psychological mechanisms behind the human mind and
                                of direct interactions with smart devices in various do- naturally gravitates toward objects, people, or specific
                                mains, including household assistance, customer service, regions of interest in the environment.
                                and industrial applications. Despite this technological                                                             In this paper, we propose a real-time application that
                                advancement, many devices lack algorithms capable of combines gaze direction and face expression analysis to
                                perceiving and responding to users’ attentional states. determine the engagement level of a person while in-
                                Traditional user interfaces still heavily rely on explicit teracting with intelligent systems. To achieve this, we
                                input or predefined triggers, resulting in often inefficient defined two machine-learning pipelines leveraging RGB
                                and mechanical interactions.                                                                                     videos of a person interacting with the system. The first
                                              The potential for automatic acquisition and interpreta- pipeline focuses on the user’s facial expressions analysis
                                tion of users’ engagement represents a huge usability im- and employs a residual neural network architecture. The
                                provement for Human-Computer Interaction (HCI) and second pipeline concentrates on the user’s gaze direc-
                                Human-Robot Interaction (HRI) systems. This capability tion estimation by combining pre-trained face and facial
                                holds the promise of ushering in more advanced and intu- landmark detection models with a fast computer vision
                                itive interactions, elevating system responsiveness, and algorithm that we developed. Predictions of the user’s
                                enhancing overall user experience. In detail, engagement engagement level are ultimately calculated by merging
                                is a fundamental aspect of the human experience and the outputs of these two pipelines using a weighted linear
                                captures in depth the quality of an individual’s involve- interpolation formula.
                                                                                                                                                    Addressing the challenge posed by the absence of a
                                SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi- baseline for reference, our primary hurdle in handling
                                neering and Mathematics, Rome, December 3-6, 2023
                                Envelope-Open iacobelli@diag.uniroma1.it (E. Iacobelli);
                                                                                                                                                 this task involved creating an appropriate dataset for
                                samuele.russo@uniroma1.it (S. Russo); cnapoli@diag.uniroma1.it                                                   training our models. We opted to customize the Affective
                                (C. Napoli)                                                                                                      States in E-Environment Dataset (DAiSEE) [7], a compre-
                                Orcid 0009-0003-1379-9106 (E. Iacobelli); 0000-0002-1846-9996                                                    hensive collection of multi-label videos designed for iden-
                                (S. Russo); 0000-0002-3336-5853 (C. Napoli)                                                                      tifying user affective states. Given that the estimation of
                                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


                                                                                                                    75


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Emanuele Iacobelli et al. CEUR Workshop Proceedings                                                                     75–84


engagement levels necessitates both temporal and spatial         ease of implementation and their proven effectiveness
information, videos proved to be an ideal choice. How-           in achieving accurate results. The prominence of such
ever, to mitigate the high computational and resource            techniques underscores the significance of visual cues,
costs associated with treating videos as opposed to single       particularly facial expressions, in gauging user engage-
images, we implemented mandatory preprocessing steps             ment levels during online interactions.
to optimize memory and computational efficiency.                    The work presented in [18] investigates the suitability
   Continuing to tackle the absence of a reference base-         of three popular models: All-CNN [19], NiN-CNN [20],
line, we conducted experiments to assess the robustness          and VD-CNN [21], along with a customized Convolu-
and effectiveness of our application. This evaluation was        tional Neural Network (CNN) [22] for detecting engage-
carried out using quantitative metrics.                          ment level of online learners in educational activities.
                                                                 All the analyzed models leverage facial expressions for
1.1. Roadmap                                                     scalable and accessible engagement detection. Each of
                                                                 the three base models has its distinct features and the cus-
This paper is organized in the following way: first of all, a    tomized CNN combines these advantageous features. For
summary of the state-of-the-art systems and techniques           instance, by replacing linear convolutional layers with a
to recognize the human engagement level is presented             multilayer perceptron, increasing depth with small con-
(see Section 2). Subsequently, a description of the dataset      volutional filters, and replacing some max-pooling layers
that we have developed for training our models is illus-         with convolutional layers with increased stride. All the
trated (see Section 3). Following this, a detailed overview      analyzed models were evaluated on the DAiSEE dataset
of the architectures employed for our application is pro-        (extensively explained in Section 3) and the results reveal
vided (see Section 4). Then, the results obtained by test-       that the customized CNN outperforms the base models
ing our system considering the quantitative metrics are          in detecting the engagement level.
presented (see Section 5). Finally, we summarize the                In a similar study proposed in [23], the automatic
article’s content and outline the possible viable improve-       recognition of student engagement from facial expres-
ments that can be made to our application (see Section           sions is examined using a three-stage pipeline. The initial
6).                                                              step involves face registration, detection, and the estima-
                                                                 tion of key facial landmarks (e.g., eyes, nose, and mouth)
                                                                 by using the approach described in [24]. The second stage
2. Related Works                                                 employs four binary classifiers to classify the cropped
The field of engagement level detection has seen signifi-        face, distinguishing whether it belongs to one of four
cant growth, particularly fueled by the global pandemic.         engagement levels (𝑙 ∈ 1, 2, 3, 4), where 1 signifies no
With many individuals compelled to participate in re-            engagement and 4 represents full focus. The authors
mote meetings, analyzing engagement in online sessions           compared three models for the binary classifier: Sup-
has become a pivotal focus, leading to the development           port Vector Machines with Gabor features (SVM (Gabor))
of numerous systems. Some studies have explored phys-            [24], Multinomial Logistic Regression with expression
iological factors like fatigue [8], brain status and data        outputs from the Computer Expression Recognition Tool-
[9, 10], blood flow and heart rate [11], and galvanic skin       box (MLR(CERT)) [24], and GentleBoost with Box Filter
conductance [12]. However, due to the recent needs and           features (Boost(BF)) [25]. This study reveals that SVM
the remote nature of this task, there has been widespread        (Gabor) yields the best results. The third stage integrates
exploration of inexpensive and unobtrusive technolo-             the outputs of all four binary classifiers, utilizing a Multi-
gies. Eye trackers [13, 14] and facial expression recogni-       nomial Logistic Regressor model to estimate the final
tion models [15, 16] using simple RGB cameras are now            engagement level.
among the most promising options.                                   In [26], the authors introduced a regression model
   In a comprehensive review treated in [17], the state-         for predicting engagement level as a single scalar value
of-the-art engagement detection techniques within the            from RGB video streams captured by two cameras on the
context of online learning are explored. The authors             torso and head of an autonomous mobile robot, utilized
classify existing methods into three primary categories:         for tours at The Collection museum in Lincoln, UK. The
automatic, semi-automatic, and manual. This classifica-          model incorporates CNN and Long Short-Term Memory
tion is based on the methods’ dependencies on learners’          (LSTM) [27] networks for video data analysis. Train-
participation. Furthermore, each category is subdivided          ing and evaluation of this regressor network were con-
based on the type of input data used (e.g., audio, video,        ducted using a dataset built from the recordings of the
text). Among these, video-based methods in the auto-             autonomous tour guide robot in the public museum. The
matic category that leverage facial expressions emerge as        dataset, manually annotated by three independent peo-
the most prevalent. These methods are favored for their          ple, assigns scalar values in the range [0,1] to represent
                                                                 the user’s engagement level. The model demonstrates


                                                            76
Emanuele Iacobelli et al. CEUR Workshop Proceedings                                                                      75–84


Figure 1: Some sample instances present in our customized version of the DAiSEE before converting them in grayscale and
applying an histogram equalization. From the left to the right, we have: a) very low engaged, b) low engaged, c) highly
engaged, and d) very highly engaged.


optimal engagement level predictions, achieving a Mean             driver attention estimation, generating a heat map on the
Squared Error (MSE) prediction loss of up to 0.126 on the          images representing the road. The training dataset for
test dataset.                                                      this model is constructed using virtual reality and a driv-
   The research conducted in [28] focused on investigat-           ing simulator, incorporating images from the DR(eye)VE
ing the Deep Facial Spatiotemporal Network (DFSTN).                dataset [32] that depict the frontal view of the road ob-
Comprising two integral modules, namely the pretrained             served by the driver. Experimental results showcase the
SE-ResNet-50 (SENet) utilized for extracting facial spatial        feasibility and superiority of the proposed method over
features and an LSTM network with Global Attention                 existing approaches.
for generating an attentional hidden state, the DFSTN
synergistically captures both facial spatial and tempo-
ral information. This combined information is crucial              3. Dataset
for enhancing engagement prediction performance. The
                                                                   The baseline dataset utilized for training our networks is
model underwent testing on the DAiSEE dataset, achiev-
                                                                   a customized version of the Dataset for Affective States
ing an accuracy of 58.84%, showcasing its capability to
                                                                   in E-Environments (DAiSEE), a large collection of multi-
outperform numerous existing engagement prediction
                                                                   label videos designed for identifying user affective states,
networks trained on the same dataset.
                                                                   including boredom, confusion, engagement, and frustra-
   In [29], the estimation of human attention is based on
                                                                   tion in real-world scenarios. This dataset comprises 9068
the direction of the user’s face, considering five different
                                                                   frontal view videos featuring 112 distinct individuals ex-
directions: central, lateral to the left, lateral to the right,
                                                                   pressing different levels of affective states. Each of these
towards up, and towards down. If the user looks in any di-
                                                                   states was manually ranked utilizing the following scale:
rection other than the central one, they are assumed to be
                                                                   very low, low, high, and very high.
distracted, with only the central gaze indicating full focus.
                                                                      To create our customized dataset we initially modified
The authors created a dataset for training, comprising
                                                                   the task from which DAiSEE was originally built. We
270 videos of approximately 20 seconds each from 18 dif-
                                                                   switched from multi-label to multi-class classification,
ferent individuals. To enhance data diversity, GAN-based
                                                                   associating only the level of engagement with each video
data augmentation techniques were employed to gener-
                                                                   and removing the labels for the other affective states. Ex-
ate new samples, diversifying somatic features in the
                                                                   ample instances present inside our customized version
recorded videos. Transfer Learning [30] was utilized to
                                                                   of the DAiSEE are displayed in Fig. 1. Subsequently, we
construct the classifier. Specifically, a pre-trained VGG16
                                                                   divided the dataset into Training, Validation, and Test
[21] architecture was employed, with three additional
                                                                   sets, with proportions of 60%, 20%, and 20%, respectively.
dense layers attached at the end for attention estimation.
                                                                   However, the resulting sets were highly unbalanced due
   The approach presented in [31] offers a novel method
                                                                   to a small portion of videos classified as very low and
for estimating driver attention. Departing from conven-
                                                                   low engagement. To address this issue, we downsampled
tional methods that primarily focus on a single frontal
                                                                   the dataset in several ways to achieve a more balanced
scene image to analyze driver gaze or head pose, this
                                                                   distribution. First of all, redundancy in subjects was
method introduces a dual-view scene. The additional
                                                                   reduced by removing multiple videos of the same indi-
input data includes the frontal view of the car that the
                                                                   viduals. Then, through the use of a normal distribution,
driver is observing. Specifically, the gaze direction is
                                                                   we sampled the remaining data instances considering
detected and transformed into a probability map of the
                                                                   the frequency of labels in the videos with the following
same size as the road view image, while salient features
                                                                   formula:
of temporal and spatial dimensions are extracted from the                                        𝑛𝑡𝑜𝑡 − 𝑓𝑖
road view images. These features are then combined and                                 𝑛𝑖 = 𝑓𝑖 ⋅           ⋅𝜆                (1)
                                                                                                    𝑛𝑡𝑜𝑡
fed into a multi-resolution neural network tasked with


                                                              77
Emanuele Iacobelli et al. CEUR Workshop Proceedings                                                                        75–84


Table 1
This table displays the number of sample instances before and after the customization of the Training, Validation, and Test
sets derived from the DAiSEE.
                   Engagement Level              Original Dataset                   Customized Dataset
                                         Training     Validation       Test    Training    Validation    Test
                       Very Low            34             23            4         34          23          4
                          Low              214           160           81         52          37          17
                         High             2649           912           861       341          110        105
                       Very High          2585           625           777       344          100        102


    where 𝜆 is the reduction coefficient (that we have set
to 0.25), 𝑛𝑡𝑜𝑡 represents the total number of samples in a
given set, and 𝑓𝑖 denotes the frequency of label 𝑖 in that
set. Table 1 displays information both before and after
the preprocessing procedures on the dataset.
    Since DAiSEE contains recordings captured in dynamic
environments, each of these videos may present different
and various disturbances, such as changing light condi-
tions, visual occlusions, or unconstrained user motion.
To improve video quality, we applied manual color and
intensity adjustments, focusing on enhancing contrast,
brightness, and sharpness for optimal detail resolution.
Examples include adjustments to the gamma value, which
effectively improves visibility in varying light conditions
or exposure levels by normalizing image histograms,
making videos more suitable for continuous analysis;
Another example is the sharpness adjustments, which
enhance fine details and edges, making facial features
more prominent.
    Despite these modifications, the dataset still demanded
excessive memory requirements. Consequently, we
opted for further adjustments. Considering that the ma-
jority of engagement information is likely derived from
human expressions and gaze attention, with a smaller
contribution from gestures, we decided to crop from each
video only the user’s faces. This step also aimed to elimi-
nate potential issues and biases arising from background
data. The face cropping was automated using a pre-
trained Single Shot Multibox Detector (SSD) model from
the Caffe framework [33].
    To prevent the generation of unstable videos, we ap-
plied a stabilization algorithm (see pseudocode in Fig. 2)          Figure 2: Pseudocode of the face stabilization algorithm used
that facilitates smooth transitions between subsequently            to prevent unstable videos while cropping the user’s faces
                                                                    from the original video in the DAiSEE.
detected faces by stabilizing the position of their bound-
ing boxes. At the start of each video, the size of the first
detected face’s bounding box is stored. In all the follow-
ing frames, this dimension is used to resize the bounding           histogram equalization is applied to normalize the color
box of the subsequently detected faces. Additionally, if            information.
the distance between the centers of two consecutive de-
tected faces is smaller than a manually adjusted threshold
𝛾, the center of the newest detected face is replaced with
the center of the bounding box of the previously detected
face. Finally, each frame is converted to grayscale, and


                                                               78
Emanuele Iacobelli et al. CEUR Workshop Proceedings                                                                      75–84


Figure 3: Full pipeline of the real-time application. The Webcam Reader Module acquires the data in real-time and passes
them to the Face Cropping Model. This model crops the user’s face from the webcam images and passes them both to the
Facial Landmark Detection module and the Frame Buffer which has a capacity of 60 frames. Once the buffer is full, each new
frame is passed to the Gaze Direction Module and the Face Engagement Model. The predictions of these models are then
combined to produce the actual output of our system.


4. Methodology                                                      tional layers instead of the traditional 2D convolutional
                                                                    layers implemented in the original ResNet architecture.
The complete architecture of the real-time application we           Learning temporal information is crucial for video anal-
developed is illustrated in Fig. 3. Specifically, the system        ysis, as it allows the network to recognize complex pat-
utilizes a single input video stream captured through               terns such as actions, gestures, or sequences of facial
a webcam reader module, implemented in the external                 expressions. Due to this requirement, the model necessi-
library OpenCV [34], to feed two distinct models. The               tates an initial period to populate a buffer of 60 frames,
Face Engagement Model evaluates engagement based                    ensuring a sufficient amount of data for the correct uti-
on facial expressions, while the Gaze Direction Model               lization of the 3D convolutional layers. Once the buffer
predicts engagement by analyzing where the user’s focal             reaches its capacity, the prediction of the engagement
point. Lastly, the predictions of these two models are              level can begin. Subsequently, with the arrival of each
combined to derive the final engagement level estimated             new frame, the buffer is updated, and the oldest frame is
by our application.                                                 discarded. We tested three versions of this architecture,
                                                                    differing mainly in the depth and the internal structure
4.1. Face Engagement Model                                          of the convolutional block used. Specifically, we imple-
                                                                    mented the 18-, 34-, and 50-layer versions.
This model is designed to estimate the user engagement                 For all these architectures, we introduced 3D layers
level from frontal recording videos. We designed it as a            for batch normalization, max pooling, and average pool-
customized version of the ResNet architecture [35] and              ing. In detail, each convolutional block includes a batch
we implemented different versions to identify the most              normalization layer, and all convolutional layers employ
effective one. In essence, a residual network employs skip          the ReLU activation function. Only the last fully con-
connections to address the vanishing gradient problem.              nected layer, responsible for the final prediction of the
These connections allow information to directly back-               human’s engagement level, uses the Softmax activation
propagate, circumventing previous layers. Moreover, a               function. During training, we utilized the He/Kaiming
skip connection facilitates a residual block in learning            initialization technique [36], which initializes weights us-
the residual, which is the difference between the desired           ing a normal distribution with zero mean and a variance
output and the current input of the layer. This approach            of 2𝑛 , where 𝑛 is the total number of inputs to the neuron.
makes it easier for the network to understand what input            This initialization is specifically tailored for networks
modifications are needed to achieve the desired output,             employing the ReLU activation function, mitigating the
rather than altering the entire input from scratch. This of-        vanishing or exploding gradient problem.
ten translates to a more straightforward learning process              Additionally, we employed the Focal Loss [37] as the
for the network.                                                    training function, opting for it over the conventional Cat-
   To address the human engagement level classification             egorical Cross-Entropy. The principal reason is that the
problem, our models needed to capture both spatial and              Focal Loss addresses the issue of unbalanced data by pri-
temporal information. To enable the network to learn                oritizing examples the model struggles with, rather than
temporal information by analyzing multiple frames si-               those it confidently predicts. This ensures continuous
multaneously in the same layer, we opted for 3D convolu-


                                                               79
Emanuele Iacobelli et al. CEUR Workshop Proceedings                                                                                      75–84


Figure 4: Gaze Direction Model Workflow: The input image, captured by the webcam, undergoes processing through the
Face Cropping Module. This module is responsible for cropping the detected face, and the resulting image is then fed into the
Face Landmark Detector. The Face Landmark Detector estimates the position of facial keypoints, which are subsequently
utilized to crop the eye regions. Each eye image undergoes further analysis in the Gaze Direction Module, which assesses both
horizontal and vertical directions of the gaze.


improvement on challenging examples, preventing the Table 2
model from becoming overly confident with easy ones. This table displays the implementation details of the three
We implemented the following Focal Loss formula:     different architectures that we tested for the Face Engagement
                                                                     Model. When the stride information is not present it means
                     − (1 − 𝑝𝑖 )𝛾 ln(𝑝𝑖 )                 (2)        that the stride is equal to 1.
                                                                      Layer Name                       Architecture Name
   where 𝛾 represents the focusing parameter (typically a
                                                                                        18-layer            34-layer             50-layer
positive number) to be fine-tuned using cross-validation,
and 𝑝𝑖 denotes the predicted probability of the correct               Convolution           kernel = [7,7,7], filters = 64, stride = 2

class. Also, our training process incorporates early stop-             Max Pool                          k = [3,3,3], s = 2
ping with a learned patience value of 10 epochs, L2 reg-              Convolution    k=[3,3,3],f=64      k=[3,3,3],f=64       k=[1,1,1],f=64
                                                                        Block        k=[3,3,3],f=64      k=[3,3,3],f=64       k=[3,3,3],f=64
ularization featuring a weight decay set to 1𝑒 −3 , and an                                                                    k=[1,1,1],f=256
Adam Optimizer accompanied by a Learning Rate Sched-                                       x2                   x3                  x3
uler [38] with a maximum learning rate of 1𝑒 − 4 and a                Convolution    k=[3,3,3],f=128    k=[3,3,3],f=128       k=[1,1,1],f=128
Gradient Scaler to reduce the range of magnitudes in the                Block        k=[3,3,3],f=128    k=[3,3,3],f=128       k=[3,3,3],f=128
gradients. All the implementation details of the tested                                                                       k=[1,1,1],f=512
                                                                                           x2                   x4                  x4
models are reported in the Table 2. Following the train-
                                                                      Convolution    k=[3,3,3],f=256    k=[3,3,3],f=256       k=[1,1,1],f=256
ing phase, we opted for the 50-layer version model, with                Block        k=[3,3,3],f=256    k=[3,3,3],f=256       k=[3,3,3],f=256
a batch size equal to 16, as the engagement network for                                                                       k=[1,1,1],f=1024
our application, as it demonstrated the highest accuracy                                   x2                   x6                   x6
among the tested versions.                                            Convolution    k=[3,3,3],f=512    k=[3,3,3],f=512       k=[1,1,1],f=512
                                                                        Block        k=[3,3,3],f=512    k=[3,3,3],f=512       k=[3,3,3],f=512
                                                                                                                              k=[1,1,1],f=2048
4.2. Gaze Direction Model                                                                  x2                   x3                   x3
                                                                      Average Pool                     Output Size = 1x1x1
This model is designed to extract attention information
                                                                        Dropout                             Rate = 0.4
from a person’s gaze in frontal recording videos. The
                                                                         Linear                          Neurons = 1024
gaze direction provides valuable insights into a person’s
engagement during a task. The complete workflow of
this model is displayed in Fig. 4. To implement this
model, we combined two pre-trained neural networks                   considering the customized dataset that we have em-
available in the dlib library [39].                                  ployed for training the engagement model, it plays a
   The first network is the Face Cropping Model, a CNN               crucial role in the real-time application. Specifically, it
trained for face detection in general images. It not only            crops faces from the live stream frames and passes these
identifies faces but also provides their bounding box co-            images to both the Engagement Model and the Facial
ordinates and converts the input image to grayscale. Al-             Landmark Detector.
though the use of this network may appear redundant                    The Facial Landmark Detector, the second network


                                                                80
Emanuele Iacobelli et al. CEUR Workshop Proceedings                                                                     75–84


that we have employed from the dlib library, recognizes           Table 3
68 2D facial landmarks (e.g., nose tip, corners of the            Table displaying the conversion rules from engagement level
mouth, and eyes) in a given face image. These facial              labels to score and vice versa.
landmarks serve two purposes: they are used to crop the                     Engagement Label     Engagement Score
eye regions based on the eye landmarks and to calculate
the face orientation with respect to the vertical axis (yaw                         0                    0.1
                                                                                    1                   0.35
angle). This orientation is determined through the use
                                                                                    2                   0.65
of a vector starting from the midpoint between the eyes                             3                    0.9
and terminating at the nose tip.
   Estimating the focal point of the user is accomplished
through the Gaze Direction Module, a simple computer
                                                                  on the sine of the face orientation. Updates related to
vision pipeline. Initially, the eye landmarks outlining
                                                                  face position involve calculating the distance between
the eye contours are employed to create a mask that re-
                                                                  the face bounding box center and the frame center. If this
moves extraneous pixels from each cropped eye image.
                                                                  distance exceeds one-sixth of the total frame dimension,
Subsequently, the Otsu’s method [40] is applied to auto-
                                                                  the right and left limits are adjusted. The adjustment is
matically threshold the image, distinguishing between
                                                                  determined by normalizing the distance between the face
foreground (iris and pupil pixels) and background (sclera
                                                                  and frame centers between 0 and 0.5. If the face shifts
pixels).
                                                                  to the right, the distance is subtracted from the limits;
   The resulting image is then horizontally and vertically
                                                                  otherwise, it is added.
divided around its center to estimate the gaze direction.
                                                                     The engagement level, derived from the gaze direction,
Both vertical and horizontal gaze directions are quanti-
                                                                  is within the range [0,1]. It is obtained by subtracting
fied as values within the range of [-1,1]. Regarding hori-
                                                                  the sum of horizontal and vertical gaze errors from 1. A
zontal gaze direction, a value approaching -1 indicates
                                                                  score of 1 indicates complete focus on the screen, with
the user is looking to the left, while a value approaching
                                                                  no gaze exceeding the defined limits. A score of 0 implies
1 suggests a rightward gaze. A value around 0 indicates
                                                                  no face detection in the current frame. The closer the
the user is looking at the center of the screen. Similarly,
                                                                  engagement level is to zero, the more the user surpasses
for vertical gaze direction, a value nearing 1 signifies
                                                                  the admissible field of view limits, indicating a lack of
a downward gaze and a value nearing -1 indicates an
                                                                  focus on the task. Specifically, when the gaze exceeds the
upward gaze.
                                                                  limits, horizontal and vertical gaze errors are calculated
   To compute these directions, the density of white pix-
                                                                  as the difference in modulo between the estimated gaze
els representing the sclera is analyzed. For each eye im-
                                                                  direction and the corresponding limits.
age, the total number of white pixels is calculated. If this
value is zero, it implies incorrect eye detection, and the
current frame is skipped. Otherwise, for each sub-image           4.3. Engagement Level Estimation
generated, the percentage of white pixels in relation to          To obtain the final detected engagement level, we com-
the total number of white pixels in the corresponding             bined the predictions from the Face Engagement Model
original eye image is calculated. Then, the percentages           and the Gaze Direction Model using a linear interpolation
belonging to the same direction of both eyes are averaged         formula:
(e.g., the percentage of white pixels in the left sub-image
of the left eye is averaged with the percentage of white               𝛼 ⋅ 𝐸𝑛𝑔𝑎𝑔𝑒𝑚𝑒𝑛𝑡𝐹 𝑎𝑐𝑒 + (1 − 𝛼) ⋅ 𝐸𝑛𝑔𝑎𝑔𝑒𝑚𝑒𝑛𝑡𝐺𝑎𝑧𝑒     (3)
pixels in the left sub-image of the right eye). Finally, the
difference between these averages produces the value                Where 𝛼 is a learnable parameter used to weigh the
within the range of [-1,1] described earlier.                     importance of the models’ predictions. In addition, to
   To effectively use the estimated gaze direction, it’s          correctly apply this formula, the prediction of the Face
crucial to consider the limits of the user’s field of view,       Engagement Model needs to be converted from labels to a
which may vary based on the task. In our screen-based             value within the range [0,1]. The conversion is performed
task implementation, we assume that a face orientation            according to the rules displayed in Table 3.
deviation exceeding 30 degrees from the camera-aligned
orientation indicates the user is no longer looking at the        5. Results
monitor.
   Initially, these limits are set at the task’s beginning and    To evaluate the accuracy of our system, we measured the
dynamically adjusted based on the user’s face position            disparity between the predicted engagement scores and
and orientation relative to the camera frame’s center. If         the ground truth values using two regression metrics:
the face orientation exceeds 20 degrees from the frontal          Mean Absolute Error (MAE) and Mean Absolute Percent-
position, the horizontal limits shift proportionally based        age Error (MAPE). To facilitate the application of these


                                                             81
Emanuele Iacobelli et al. CEUR Workshop Proceedings                                                                  75–84


                                                                 0.5, our system achieved an accuracy of approximately
                                                                 58% (57.7%), closely aligning with the performance of
                                                                 state-of-the-art works in engagement level detection dis-
                                                                 cussed in Section 2 that work with the original version
                                                                 of the DAiSEE.


                                                                 6. Conclusions
                                                                 Our work introduces a novel approach to engagement
                                                                 level estimation by integrating two distinct machine
                                                                 learning pipelines focused on analyzing facial expres-
                                                                 sions and gaze direction. Noteworthy is our real-time
                                                                 application’s emphasis on cost-effectiveness and accessi-
                                                                 bility, achieved through the utilization of a single RGB
Figure 5: Trend of the Mean Absolute Error (MAE) with vary-      camera, fast and lightweight machine learning algo-
ing values of the parameter 𝛼 in Eq. (3).                        rithms, and computationally efficient computer vision
                                                                 techniques.
                                                                    In terms of system training, we customized the DAiSEE
                                                                 dataset to optimize memory usage, reduce class imbal-
                                                                 ance, mitigate bias introduced by repeated instances of
                                                                 the same individuals, and focus exclusively on facial crop-
                                                                 ping to eliminate potential background-related biases.
                                                                 The achieved results underscore the potential of our sys-
                                                                 tem as a robust foundation, offering a secure benchmark
                                                                 for the development of innovative applications integrat-
                                                                 ing automatic user engagement recognition, thereby dy-
                                                                 namically adapting to user interactions. This not only
                                                                 enhances overall usability but also heralds a new era in
                                                                 application interfaces, promising heightened levels of
                                                                 user experience and interaction.
                                                                    Looking forward, future improvements to our system
                                                                 can be directed towards enhancing the accuracy, robust-
Figure 6: Trend of the Mean Absolute Percentage Error            ness, and generalization capabilities by expanding the
(MAPE) with varying values of the parameter 𝛼 in Eq. (3).        dataset’s dimensions. This expansion may involve incor-
                                                                 porating data from a more diverse group, encompassing
                                                                 individuals with varying demographic characteristics,
metrics, we converted the engagement level labels associ-        cultural backgrounds, and engagement patterns.
ated with the samples in our customized dataset using the           Also, exploring attention estimation in multi-face con-
conversion rules outlined in Table 3. This transformation        texts, where multiple individuals are present simultane-
effectively turned the multi-label class problem, designed       ously, represents another intriguing avenue for future re-
for the DAiSEE dataset, into a regression problem.               search. Lastly, a significant refinement to our application
   During training, we experimented with different val-          involves substituting the CNN layers in the Face Detec-
ues for the parameter 𝛼 in Eq. (3) to maximize the sys-          tion Model with Visual Transformers [41](ViTs), known
tem’s accuracy. As illustrated in Figs. 5 and 6, the lowest      for their excellence in image manipulation and long-
error for both MAE and MAPE occurred when 𝛼 was set              range dependency modeling compared to traditional con-
to 0.5. This indicates that both predictions from the Face       volutional layers. This substitution could enhance the
Engagement Model and the Gaze Direction Model carry              precision of engagement level estimation from facial ex-
equal importance and are essential for achieving accurate        pressions, as different facial regions can be effectively
predictions.                                                     combined at the same time.
   Analysis of the scenarios where 𝛼 is 0 (using only the
Face Model) or 1 (using only the Gaze Model) reveals
significantly higher errors in both performance metrics.
                                                                 References
Independently, these predictions struggle to accurately           [1] G. Capizzi, G. L. Sciuto, C. Napoli, M. Woźniak,
gauge the user’s engagement level. With 𝛼 initialized to              G. Susi, A spiking neural network-based long-


                                                            82
Emanuele Iacobelli et al. CEUR Workshop Proceedings                                                                 75–84


     term prediction system for biogas production, Neu-             tion on driver distraction, in: 2015 IEEE 12th inter-
     ral Networks 129 (2020) 271 – 279. doi:10.1016/j.              national conference on wearable and implantable
     neunet.2020.06.001 .                                           body sensor networks (BSN), IEEE, 2015, pp. 1–6.
 [2] N. Brandizzi, S. Russo, G. Galati, C. Napoli, Address-    [13] E. Iacobelli, V. Ponzi, S. Russo, C. Napoli, Eye-
     ing vehicle sharing through behavioral analysis: A             tracking system with low-end hardware: Devel-
     solution to user clustering using recency-frequency-           opment and evaluation, Information 14 (2023) 644.
     monetary and vehicle relocation based on neigh-           [14] F. Fiani, S. Russo, C. Napoli, An advanced solu-
     borhood splits, Information (Switzerland) 13 (2022).           tion based on machine learning for remote emdr
     doi:10.3390/info13110511 .                                     therapy, Technologies 11 (2023). doi:10.3390/
 [3] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo,           technologies11060172 .
     J. Starczewski, C. Napoli, A novel convmixer trans-       [15] P. Kaur, K. Krishan, S. K. Sharma, T. Kanchan,
     former based architecture for violent behavior de-             Facial-recognition algorithms: A literature review,
     tection 14126 LNAI (2023) 3 – 16. doi:10.1007/                 Medicine, Science and the Law 60 (2020) 131–139.
     978- 3- 031- 42508- 0_1 .                                 [16] G. De Magistris, M. Romano, J. Starczewski,
 [4] G. Capizzi, G. L. Sciuto, C. Napoli, E. Tramontana,            C. Napoli, A novel dwt-based encoder for human
     A multithread nested neural network architecture               pose estimation, volume 3360, 2022, pp. 33 – 40.
     to model surface plasmon polaritons propagation,          [17] M. Dewan, M. Murshed, F. Lin, Engagement detec-
     Micromachines 7 (2016). doi:10.3390/mi7070110 .                tion in online learning: a review, Smart Learning
 [5] C. Napoli, G. Pappalardo, E. Tramontana, R. K. Now-            Environments 6 (2019) 1–20.
     icki, J. T. Starczewski, M. Woźniak, Toward work          [18] M. Murshed, M. A. A. Dewan, F. Lin, D. Wen, En-
     groups classification based on probabilistic neural            gagement detection in e-learning environments us-
     network approach, volume 9119, 2015, pp. 79 – 89.              ing convolutional neural networks, in: 2019 IEEE
     doi:10.1007/978- 3- 319- 19324- 3_8 .                          Intl Conf on Dependable, Autonomic and Secure
 [6] N. Brandizzi, S. Russo, R. Brociek, A. Wajda, First            Computing, Intl Conf on Pervasive Intelligence and
     studies to apply the theory of mind theory to green            Computing, Intl Conf on Cloud and Big Data Com-
     and smart mobility by using gaussian area cluster-             puting, Intl Conf on Cyber Science and Technology
     ing, volume 3118, 2021, pp. 71 – 76.                           Congress (DASC/PiCom/CBDCom/CyberSciTech),
 [7] A. Gupta, A. D’Cunha, K. Awasthi, V. Balasubrama-              IEEE, 2019, pp. 80–86.
     nian, Daisee: Towards user engagement recogni-            [19] J. T. Springenberg, A. Dosovitskiy, T. Brox, M. Ried-
     tion in the wild, arXiv preprint arXiv:1609.01885              miller, Striving for simplicity: The all convolutional
     (2016).                                                        net, arXiv preprint arXiv:1412.6806 (2014).
 [8] Z. Wan, J. He, A. Voisine, An attention level moni-       [20] M. Lin, Q. Chen, S. Yan, Network in network, arXiv
     toring and alarming system for the driver fatigue              preprint arXiv:1312.4400 (2013).
     in the pervasive environment, in: Brain and Health        [21] K. Simonyan, A. Zisserman, Very deep convolu-
     Informatics: International Conference, BHI 2013,               tional networks for large-scale image recognition,
     Maebashi, Japan, October 29-31, 2013. Proceedings,             arXiv preprint arXiv:1409.1556 (2014).
     Springer, 2013, pp. 287–296.                              [22] S. Albawi, T. A. Mohammed, S. Al-Zawi, Under-
 [9] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,           standing of a convolutional neural network, in:
     Analysis pre and post covid-19 pandemic rorschach              2017 international conference on engineering and
     test data of using em algorithms and gmm models,               technology (ICET), Ieee, 2017, pp. 1–6.
     volume 3360, 2022, pp. 55 – 63.                           [23] J. Whitehill, Z. Serpell, Y.-C. Lin, A. Foster, J. R.
[10] C.-M. Chen, J.-Y. Wang, C.-M. Yu, Assessing the                Movellan, The faces of engagement: Automatic
     attention levels of students by using a novel at-              recognition of student engagementfrom facial ex-
     tention aware system based on brainwave signals,               pressions, IEEE Transactions on Affective Comput-
     British Journal of Educational Technology 48 (2017)            ing 5 (2014) 86–98.
     348–369.                                                  [24] G. Littlewort, J. Whitehill, T. Wu, I. Fasel, M. Frank,
[11] S. Di Palma, A. Tonacci, A. Narzisi, C. Domenici,              J. Movellan, M. Bartlett, Computer expression
     G. Pioggia, F. Muratori, L. Billeci, M. S. Group,              recognition toolbox, Proc. Automatic Face and Ges-
     et al., Monitoring of autonomic response to so-                ture Recognition (FG’11) 20 (2011) 24–25.
     ciocognitive tasks during treatment in children with      [25] P. Viola, M. Jones, et al., Robust real-time object
     autism spectrum disorders by wearable technolo-                detection, International journal of computer vision
     gies: A feasibility study, Computers in biology and            4 (2001) 4.
     medicine 85 (2017) 143–152.                               [26] F. Del Duchetto, P. Baxter, M. Hanheide, Are you
[12] O. Dehzangi, C. Williams, Towards multi-modal                  still with me? continuous engagement assessment
     wearable driver monitoring: Impact of road condi-              from a robot’s point of view, Frontiers in Robotics


                                                          83
Emanuele Iacobelli et al. CEUR Workshop Proceedings                                                               75–84


     and AI 7 (2020) 116.                                     [41] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
[27] S. Hochreiter, J. Schmidhuber, Long short-term                senborn, X. Zhai, T. Unterthiner, M. Dehghani,
     memory, Neural computation 9 (1997) 1735–1780.                M. Minderer, G. Heigold, S. Gelly, et al., An image is
[28] J. Liao, Y. Liang, J. Pan, Deep facial spatiotempo-           worth 16x16 words: Transformers for image recog-
     ral network for engagement prediction in online               nition at scale, arXiv preprint arXiv:2010.11929
     learning, Applied Intelligence 51 (2021) 6609–6621.           (2020).
[29] S. Pepe, S. Tedeschi, N. Brandizzi, S. Russo, L. Iocchi,
     C. Napoli, Human attention assessment using a
     machine learning approach with gan-based data
     augmentation technique trained using a custom
     dataset, OBM Neurobiology 6 (2022) 1–10.
[30] S. J. Pan, Q. Yang, A survey on transfer learning,
     IEEE Transactions on knowledge and data engineer-
     ing 22 (2009) 1345–1359.
[31] Z. Hu, C. Lv, P. Hang, C. Huang, Y. Xing,
     Data-driven estimation of driver attention using
     calibration-free eye gaze and scene features, IEEE
     Transactions on Industrial Electronics 69 (2021)
     1800–1808.
[32] A. Palazzi, D. Abati, F. Solera, R. Cucchiara, et al.,
     Predicting the driver’s focus of attention: the dr
     (eye) ve project, IEEE transactions on pattern anal-
     ysis and machine intelligence 41 (2018) 1720–1733.
[33] E. Cengil, A. Çınar, E. Özbay, Image classification
     with caffe deep learning framework, in: 2017 In-
     ternational Conference on Computer Science and
     Engineering (UBMK), IEEE, 2017, pp. 440–444.
[34] I. Culjak, D. Abram, T. Pribanic, H. Dzapo, M. Cifrek,
     A brief introduction to opencv, in: 2012 proceedings
     of the 35th international convention MIPRO, IEEE,
     2012, pp. 1725–1730.
[35] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     ing for image recognition, in: Proceedings of the
     IEEE conference on computer vision and pattern
     recognition, 2016, pp. 770–778.
[36] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into
     rectifiers: Surpassing human-level performance on
     imagenet classification, in: Proceedings of the IEEE
     international conference on computer vision, 2015,
     pp. 1026–1034.
[37] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Fo-
     cal loss for dense object detection, in: Proceedings
     of the IEEE international conference on computer
     vision, 2017, pp. 2980–2988.
[38] L. N. Smith, N. Topin, Super-convergence: Very
     fast training of neural networks using large learn-
     ing rates, in: Artificial intelligence and machine
     learning for multi-domain operations applications,
     volume 11006, SPIE, 2019, pp. 369–386.
[39] D. E. King, Dlib-ml: A machine learning toolkit,
     The Journal of Machine Learning Research 10 (2009)
     1755–1758.
[40] N. Otsu, A threshold selection method from gray-
     level histograms, IEEE transactions on systems,
     man, and cybernetics 9 (1979) 62–66.


                                                           84

</pre>