Distinguishing Engagement Facets: An Essential
Component for AI-based Interactive Healthcare
Hanan Salam1,2,*,†
1
    New York University Abu Dhabi, PO Box 129188, Saadiyat Island, Abu Dhabi, United Arab Emirates
2
    Social Machines & Robotics (SMART) Lab, Center of AI & Robotics (CAIR)


                Abstract
                Engagement in Human-Machine Interaction is the process by which entities participating in the interaction establish, maintain,
                and end their perceived connection. It is essential to monitor the engagement state of patients in various AI-based interactive
                healthcare paradigms. This includes medical conditions that alter social behavior such as Autism Spectrum Disorder (ASD) or
                Attention-Deficit/Hyperactivity Disorder (ADHD). Engagement is a multi-faceted construct which is composed of behavioral,
                emotional, and mental components. Previous research has neglected this multi-faceted nature of engagement and focused on
                the detection of engagement level or binary engagement label. In this paper, a system is presented to distinguish these facets
                using contextual and relational features. This can facilitate further fine-grained analysis. Several machine learning classifiers
                including traditional and deep learning models are compared for this task. An F-Score of 0.74 was obtained on a balanced
                dataset of 22242 instances with neural network-based classification. The proposed framework shall serve as a baseline for
                further research on engagement facets recognition, and its integration is socially assistive robotic applications.

                Keywords
                Engagement Recognition, Interactive Healthcare, Affective Computing, Human-Robot Interaction


1. Introduction                                                                                                tion of the current interaction context. These studies
                                                                                                               suggest that when attempting automatic inference of
During the last decade, researchers have demonstrated user’s engagement state, it is important to consider this
interest in enhancing the capabilities of robots to assist multi-faceted nature.
humans in their daily life. This requires incorporation                                                           Application areas of Assistive Robotics include elderly
of social intelligence within the robots which involves care [6], helping people with medical conditions that
understanding different states of engagement.                                                                  alter social behavior such as children suffering from
     Research in Human-Machine Interaction (HMI) has Autism Spectrum Disorder (ASD) [7] or people suffering
depicted that engagement is a multi-faceted construct from Adult Deficient Hyperactivity Disorder (ADHD) [8],
and consists of different components [1]. It is very much coaching and tutoring [9, 10]. Fasola and Mataric [11]
important to be able to distinguish the facets before per- presented a Socially Assistive Robot (SAR) system de-
forming a deeper analysis. Corrigan et al. [2] demon- signed to engage elderly users in physical exercise. Dif-
strated that engagement is mainly composed of cognitive ferent variants of the robot’s verbal instructions were
and affective components which are manifested by at- used to minimize the robot’s perceived verbal repetitive-
tention and enjoyment. According to O’brien et al. [3], ness, and thus maintain the users’ engagement. Previ-
engagement is characterized by features like challenge, ous engagement detection approaches revolve around
positive affect, endurability, aesthetic and sensory appeal, a binary classification-based approach (engaged vs. not
attention, feedback, variety/novelty, interactivity, and per- engaged) [12, 13] or a multi-class approach (engagement
ceived user control. In the context of youth engagement level) [14, 15]. However, the multi faceted nature is sel-
in activities, Ramey et al. [4] proposed a model of psycho- dom considered.
logical engagement having three components: cognitive                                                             In this paper, a framework that takes into account the
like thinking or concentrating, affective like enjoyment, multi-faceted nature of engagement is proposed. Engage-
and relational like through connectedness to something. ment is modeled in terms of a spectrum of engagement
Salam et al. [5] showed that the mental and emotional states: mental, behavioural and emotional. This is the
states of the user related to engagement vary in func- first engagement framework of its kind to propose such
                                                                                                               classification framework of the facets of engagement.
Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney,
Australia,                                                                                                     Such analysis allows to inform the implementation of
*
  Corresponding author.                                                                                        fine-grained strategies based on a deeper understanding
$ hanan.salam@nyu.edu (H. Salam)                                                                               of user’s states. We present a preliminary evaluation
 https://wp.nyu.edu/smartlab/ (H. Salam)                                                                      of this approach on an off-line multi-party HRI corpus.
 0000-0001-6971-5264 (H. Salam)                                                                               The corpus was chosen due to the relevance of its in-
  © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).
  CEUR Workshop Proceedings (CEUR-WS.org)                                                                      teraction scenario (educational followed by competitive
context) to the use of AI-based interactive healthcare        in the thick of present, interacting, engaged, and just
systems. For instance, SAR for neuro-developmental dis-       attending. A system to distinguish two classes of engage-
orders such as ADHD and ASD, which might benefit from         ment namely medium-high to high and medium-high to
a multi-faceted engagement model. For instance, an edu-       low engagement was presented by [14]. MBednarik et
cational scenario can be adopted for the characterization     al. [25] distinguished disparate states of conversational
of ADHD, since such context would solicit attention cues      engagement states. It includes no interest, following, re-
which are normally impaired in the case of ADHD indi-         sponding, conversing, influencing and managing. They
viduals. We are aware that for the study to be complete,      also modelled a bi-class problem having low/high conver-
it should be validated in the context of an SAR scenario.     sational level. Oetrel et al. [20] distinguished 4 classes for
However, the lack of such dataset has led us to choose        group involvement namely high, low, leader, steering the
a proxy dataset to perform the initial validation of the      conversation, and group is forming itself. Two models
framework.                                                    were developed in [26] focusing on not-engaged/engaged
                                                              and not-engaged/normally-engaged/very-engaged state
                                                              distinction. Frank et al. [27] differentiated 6 different
2. Related Work                                               states of engagement in the thick of disengagement, in-
                                                              volved engagement, relaxed engagement, intention to
Engagement in Human-Robot Interaction is defined as
                                                              act, action, and involved action.
the process by which two (or more) participants establish,
                                                                 Recently, [28] stated that engagement in HRI should be
maintain, and end their perceived connection [16, 17].
                                                              multi-faceted. Formulating binary/multi-class problems
Andrist et al. [18] analyzed an HRI dataset in terms of
                                                              (engaged vs. not engaged) or a multi-class problem (en-
interaction type, quality, problem types, and the system’s
                                                              gagement level) over this ignore the multi-faceted nature
failure points causing problems. Failure in the engage-
                                                              of engagement. Taking this multi-faceted nature into
ment component was found to be among the major identi-
                                                              consideration is very important for the design of intel-
fied problems cause during the interaction. This confirms
                                                              ligent social agents. For instance, this can influence the
that a highly performing engagement model is essential
                                                              implemented engagement strategies within the agent’s
for the success of any HRI scenario [18].
                                                              architecture. Some studies attempted to implement dif-
   Bohus & Horvitz [19] pioneered research on engage-
                                                              ferent strategies related to task and social engagement.
ment in multi-party interaction. They explored disparate
                                                              For instance, [29] implemented a task engagement strat-
engagement strategies to allow robots to engage simulta-
                                                              egy which focuses on the task at hand and having users
neously with multiple users. There are multi-farious stud-
                                                              meta-cognitively reflect on the robot’s performance and
ies based on multi-party interactions. Oertel et al. [20]
                                                              a social engagement strategy which focuses on their en-
studied in both individual and group level about the rela-
                                                              joyment and having them meta-cognitively reflect on
tionship amidst the participants’ gaze and speech behav-
                                                              their emotions with respect to the activity and the group
ior. Leite et al. [13] experimented with the generalization
                                                              interactions.
capacity of an engagement model. It was trained and
                                                                 Different features have been used to distinguish en-
tested on single-party and multi-party scenarios respec-
                                                              gagement states. Some of such features include contex-
tively. The opposite scenario was also considered. Salam
                                                              tual [30, 14, 21] attentional [31, 32], affective [14, 12,
et al. [21] conducted a study on engagement recognition
                                                              26, 33] to name a few. Salam et al. [34] used person to
in a triadic HRI scenario and showed that it is possible to
                                                              detect both individual and group engagement. [35, 36]
infer a participant’s engagement state based on the other
                                                              combined different aspects like backchannels, eye gaze,
participants’ cues.
                                                              head nodding-based features to detect engagement level.
   Most of engagement inference approaches revolved
                                                              Ben et al. [23] combined several attributes like speech
around identification of a person’s intention to engage.
                                                              and facial expressions, gaze and head motion, distance to
There has also been studies to detect whether the person
                                                              robot to identify disengagement. Masui et al. [33] worked
is engaged/disengaged. Benkaouar et al. [22] presented
                                                              with facial Action Units and physiological responses.
a system to detect disparate engagement phases. This
                                                                 Recent approaches explored deep learning architec-
includes intention to engage, engaged and disengaged.
                                                              tures for the detection of engagement. Dewan et al. [26]
Foster et al. [12] attempted to detect whether a person
                                                              used person-independent edge features and Kernel Prin-
intends to engage which is a bi-class problem. Leite
                                                              cipal Component Analysis (KPCA) within a deep learning
et al. [13] attempted to identify disengagement in both
                                                              framework to detect online learners’ engagement using
group and individual interactions. Ben et al. [23] also
                                                              facial expressions. [37] used CNN and LSTM networks to
presented a system dedicated to the similar cause.
                                                              predict engagement level. [38] proposed adaptive deep
   There are different works which focused on detecting
                                                              architectures for different user groups for predicting en-
different levels of engagement of a user. Michalowski
                                                              gagement in robot-mediated collaborative learning.
et al. [24] distinguished different levels of engagement
                                                                 Contextual information is being used in social signal
processing for quite some time. Kapoor et al. [39] com-        education in classrooms. Children with Autism Spectrum
bined context features in the form of game state with fa-      Disorder are a target population for such personalized
cial and posture features in an online educative scenario.     teaching systems [49]. However, most of existing sys-
Martinez and Yannakakis [40] used sequence mining for          tems do not include a user’s engagement analysis module.
the prediction of computer game player affective states.       Such Socially Assistive systems can largely benefit from
Castellano et al. [14] explored task and social-based con-     a fine-grained analysis of engagement. This will make
textual features. In another instance, the authors [41]        the systems more human-like. There has been interest in
used same contextual features for distinguishing interac-      automated screening and consultation to detect problems
tion quality.                                                  of the body and mind at an early stage. This can also help
   Relational feature have proven to be useful in multi-       to reduce the initial load on doctors. It is very important
farious instances. Curhan et al. [42] used dyad-based          for the patients to feel that they are interacting with their
cues for predicting negotiation outcomes. Jayagopi et          peers rather than a machine. The systems need to process
al. [43] adhered to group-based cues to understand typi-       both audio and visual cues in order to properly under-
cal behavior in small groups. Nguyen et al. [44] extracted     stand patients. While the patients are interacting with the
relational audio-visual cues to detect the suitability of an   automated systems, several states of engagement needs
applicant in a job interview. The features included audio      to be monitored simultaneously. This includes level of
and visual back-channeling, nodding while speaking, mu-        concentration, different reactions, spontaneity, to name
tual short utterances and nods. It also includes [45] that     a few. Such states of engagement portray useful informa-
used “looking-while-speaking” feature to understand per-       tion about a patient’s health. These engagement states
sonality impressions from conversational logs extracted        can be categorized into a broader spectrum of behavioral,
from YouTube.                                                  mental and emotional states. Distinguishing the engage-
   So far, context has been insufficiently investigated in     ment facet is important at the outset for a deeper analysis.
the avenue of affective and cognitive states. Devillers et     This can pave the way for systems which would better un-
al. [46] highlights the importance of context in the as-       derstand the condition of patients by reading their body
sessment of engagement. They identified paralinguistic,        language and wont merely match spoken symptoms. This
linguistic, non-verbal, interactional, and specific emo-       will especially be useful in treating and understanding
tional and mental state-based features as very important       mental conditions where the body language is a vital
for engagement prediction. In this work, we investigate        aspect. In the case of psychological problems, patients
relational and contextual features for the recognition of      are often engaged into conversations regarding disparate
a spectrum of engagement states. The features have been        aspects by doctors wherein the patient’s body language
used in isolation as well as in combination to assess their    serves a vital pointer towards the mental condition.
engagement state distinction capability. These features
have not been combined previously for detecting engage-
ment facets. Compared to previous works, the proposed          4. Proposed Framework
features model interaction context, the robot’s behavior
                                                               The proposed framework is composed of 3 steps. First, a
and the behavioral relation between the participant in
                                                               multi-party HRI corpus is annotated in terms of engage-
question and the other entities of the interaction.
                                                               ment facets. Then, different contextual and relational
                                                               features are extracted. Finally, different standard classi-
3. Need for Engagement                                         fiers are used to classify the different engagement facets.
                                                               Fig. 1 presents an illustration of the proposed framework.
   Recognition in Interactive
   Healthcare                                                  4.1. Data Corpus
Technological advancements have propagated to every            In this section, the data corpus along with the disparate
field. There has always been efforts to automate tasks.        engagement annotations is discussed.
Healthcare is one of the primal needs for society and
it also has been touched by technology [47]. Several           4.2. Interaction Scenario & Modalities
interactive systems have come up to aid in automated
healthcare and the well-being of people with medical           We use 4 interactions of 8 participants from the conver-
conditions. In [8], an SAR is proposed, whose aim is to        sational HRI data corpus ‘Vernissage’ [50]. It is a multi-
help children with ADHD to improve their educational           party interaction amidst the humanoid robot NAO1 and
outcome through social interaction with a robot. An-           2 participants. The interaction has different contexts
other educational SAR was presented by [48]. This was          which can mainly be differentiated into 2 parts. The 1𝑠𝑡
targeted towards providing assistance in personalizing
                                                               1
                                                                   https://www.softbankrobotics.com/emea/en/nao
                                                                                                               Emotional


                                                                                                                             Engagement Facet
                                                                ...                ...
                                          • Relational
                                          • Contextual                                                          Mental
                                          • Relational+
                                            Contextual
                      Input                                                                                    Behavioural
                                         Feature extraction


                                                              Artificial neural network-based classification


Figure 1: Illustration of the proposed framework with an Artificial Neural Network classifier.


                                                                utes. NAO’s internal camera was used to record the clips.
                                                                This provided the front view. 3 other cameras were also
                                                                used to get the left, right and rear feeds. Fig. 2 shows the
                                                                organization of the recording room.
                                                                   The corpus has annotations for non-verbal behaviors
                                                                of the participants. It also contains robot’s speech and
                                                                action in the log file of the robot.

                                                                4.3. Engagement Annotations
                                                                 Engagement labels were assigned to 3 categories namely
                                                                 mental, behavioral and emotional. These were anno-
                                                                 tated when the participants manifested one the follow-
                                                                 ing states: thinking, listening, positive/negative reaction,
Figure 2: Organisation of the recording room. NAO (orange),
                                                                 responding, waiting for feedback, concentrating, and lis-
participants typical positions (gray circles), cameras (HD: red,
                                                                 tening to the other participant. The annotations were
VICON: blue), wizard feedback (green), paintings (green lines),                                                2
windows (blue lines), VICON coordinate system (red), head performed by 2 people with the aid of Elan annotation
pose calibration position(P1 and P2).                            tool [51]. They watched every video 2 times (once with
                                                                 the perspective of 1 participant). Discrete segments were
                                                                 annotated and it was stopped as soon as a change was ob-
                                                                 served. The Mean inter-rater Cronbach’ Alpha coefficient
is where the robot describes several paintings hanged on was 0.93. This points to the reliability of the annotations.
a wall (informative/educational context). In the 2 the The details of each category is as follows.
                                                         𝑛𝑑

robot performs a quiz with the volunteers related to art
and culture (competitive context). This was done in or-
                                                                 Mental states – A segment was assigned mental state
der to encompass different variations for the engagement
                                                                 label when the participant manifested one of the follow-
states.
                                                                 ing mental states:
   This corpus was chosen since its interaction scenario
is relevant to the use of SAR for neuro-developmental                  • Listening (EL): The participant is listening to
disorders such as ADHD and ASD, that might benefit                       NAO;
from a multi-faceted engagement model. For instance, an
educational scenario like the once in the first part of the            • WaitingFeedback (EWF): The participant is wait-
Vernissage corpus scenario can be adopted for the char-                  ing for NAO’s feedback after he/she had answered
acterization of ADHD, since an educational/informative                   a question;
scenario would solicit attention cues which are normally
impaired in the case of ADHD individuals. We are aware                 • Thinking (ETh): The participant is thinking about
that for the study to be complete, it should be validated                the response to a question asked by NAO;
in the context of an SAR scenario. However, the lack                     • Concentrating (EC): The participant is concentrat-
of such dataset has led us to choose a proxy dataset to                    ing with NAO;
perform our initial validation of the framework.
   The average length per interaction is nearly 11 min-         2
                                                                    https://tla.mpi.nl/tools/tla-tools/elan/
Table 1
Details on the number of annotated instances in each class
           State         Number of instances
           Behavioral             10331
           Emotional              7414
           Mental                 80902
           Total                  98647


     • ListeningPerson2 (ELP2): The participant is listen-
       ing to the other who is answering NAO.

Behavioral states – A segment was assigned a be-
havioral state label when the participant manifested the Figure 3: Features illustration: contextual features
following behavioral state:                              (Robot, Participant); relational features (Participant-Robot,
                                                              Participant1-Participant2)
     • Responding (ER): The participant is responding to
       NAO;
                                                          the other entities. Moreover, for a dialogue of the robot,
Emotional states – A segment was assigned an emo- we extract the robot’s utterance, addressee and topic of
tional state label when the participant manifested one of speech.
the following emotional states:

     • PositiveReaction (EPR): The participant shows a        Participant:
       positive reaction to NAO.                                  1) Visual Focus Of Attention (VFOA): Gaze in human-
     • NegativeReaction (ENR): The participant shows a               human social interactions is considered as the
       negative reaction to NAO.                                     primary cue of attention [52, 53]. We use VFOA
                                                                     ground truth of every participant which were
The details regarding the number of annotated instances              annotated with 9 labels.
for each class is presented in Table 1.
                                                                  2) VFOA Shifts: Gaze shifts indicate people’s engage-
                                                                     ment/ disengagement with specific environmen-
4.4. Extracted Features                                              tal stimuli [54]. We define VFOA shift as the
In this study, we used the annotated cues from Vernissage            moment when a participant shifts attention to
corpus. Moreover, we extracted additional metrics which              a different subject. This feature is binary and is
were computed from the existing ones. They were catego-              computed from the VFOA labels.
rized into two categories: 1) contextual and 2) relational.       3) Addressee: When addressing somebody, we are
   Contextual features deal with either the different enti-          engaged with him/her. Similarly, in the context
ties of an interaction like the robot utterance, addressee           of HRI, when a participant addresses someone
and topic of speech or behavioral aspects of the partici-            other than the robot, he/she is disengaged from
pant that concern the interaction context like visual focus          the robot. Adressee annotations used from the
of attention and addressee.                                          corpus and are annotated into 6 Classes: {NoLabel,
   Relational features encode the behavioral relation be-            Nao, Group, PRight, PLeft, Silence}.
tween the participants and the robot. Fig. 3 illustrates
the features groups used in our study.
                                                              Robot: Starting from the robot’s conversation logs, the
                                                              following were extracted.
4.4.1. Contextual Features
                                                                  1) Utterances: The labels {Speech, Silence} were as-
Interaction amidst entities involves both entities and con-
                                                                     signed to frames depending on the robot’s speech
nection. While inferring the engagement state of an
                                                                     activity.
interacting person, we consider behavior of the person
as well as our behavior. Thus, an automated engagement            2) Addressee: The addressee of the robot was de-
identification system should also consider the same.                 tected using predefined words from its speech.
   Consequently, we employ different contextual features
that describes the participant’s behavior with respect to
        The following labels were assigned {Person1, Per-           2) Participants Mutual Laughter: This refers to
        son2, GroupExplicit, GroupPerson1, GroupPer-                   events where the two participants laugh together.
        son2, Person1Group, Person2Group, Group, Si-                   This represents reaction to the robot’s speech.
        lence}. ‘GroupExplicit’ label refers to such seg-
        ments where the robot was explicitly addressing             3) P1 Looks at P2/ P2 Talks to Robot: This repre-
        both participants. ‘GroupPersonX’ /𝑋 ∈ (1, 2)                  sents events where the passive participant looks
        corresponds to segments where the robot ad-                    at active participant while he/she is talking to the
        dresses the group then ‘PersonX’ while ‘Person-                robot. Though this may appear to be disengage-
        XGroup’ represents the inverse.                                ment, but analysis revealed the inverse.

    3) Topic of Speech: This was identified using a key-           The total number of features is 39. There were 34
        word set. These were related to disparate paint-        contextual   features and 5 relational features.
        ings available in the scene {manray, warhol, arp,
        paintings}. Frames were allotted labels based on 4.5. Engagement Facets Classification
        them.
                                                                As this is the first work that proposed the classification
                                                                of engagement facets, namely, behavioural, emotional,
4.4.2. Relational Features                                      and mental, it is important to establish a classification
We extract a set of Relational Features describing robot’s baseline.
and participants’ behaviors synchrony and alignment.               We compare different classifiers for the defined en-
These include, among others, mutual gaze and laughter. gagement facets classification. The classifiers include
A logical AND operation was used between participants’ traditional machine learning classifiers such as Bayesian
and robot’s features time series for obtaining mutual Network (Bayes Net), Naive Bayes, Linear Logistic Re-
events occurrence. Fig. 4 shows an example of partici- gression (LLR), Support Vector Machine (SVM), Radial
pants’ mutual laughter extraction.                              Basis Function Network (RBF Net), and simple Artifi-
                                                                cial Neural Network (ANN). A deep learning classifier,
                                                                namely Recurrent Neural Network (RNN) was also used
                                                                for this classification task. This helps establish an initial
                                                                understanding of whether traditional machine learning
                                                                classifiers are sufficient for the task, or more sophisticated
Figure 4: Example of relational cues extraction. This corre- classification techniques such as deep learning methods
sponds to participants’ mutual laughter detection using logical are needed.
AND from laughter time series.

                                                                5. Results and Discussion
Participant-Robot Features:                               The proposed framework is evaluated using 5-fold cross
                                                          validation. As the data was highly imbalanced, so a subset
    1) Gaze-Speech Alignment: We extracted events
                                                          of the data was drawn having equal number of instances
       where a participant looks at objects correspond-
                                                          per class totalling to 22242 instances. The combined
       ing to the robot’s topic of speech. This indicates
                                                          features (contextual+relational) for this dataset were used
       that the participant is listening to the robot and
                                                          to train the different classifiers presented in section 4.5.
       is interested in what it is saying.
   2) P1 talks to P2/Robot Speaks: This refers to events 5.1. Comparative performance analysis of
      where the participants speak with each other dur-
                                                                standard classifiers
      ing the robot’s speech. This may signal a disen-
      gagement behavior.                                 Table 2 presents the results of training different clas-
                                                         sifiers on the combined features. From the table, we
Person1-Person2 Features:                                can state that the best performing classifier was ANN
                                                         with an accuracy of 74.57%, followed by the Linear Lo-
   1) Participants Mutual Looks: This refers to events gistic Regression model (70.35%). The performance of
      where the participants look at each other. Though SVM and Bayes Net were very close (69.6%), followed by
      this may signal disengagement but it may also Naive Bayes (68.61%). Surprisingly, the least accuracy of
      signal engagement as it might be a reaction to 57.68% was obtained using the deep RNN. This might be
      the robot’s speech.                                due to the fact that the number of samples is not sufficient
Table 2                                                         Table 3
Comparative analysis of the performance of standard classi-     Confusion matrix for balanced dataset.
fiers on the balanced dataset.                                                     Behavioral      Emotional      Mental
              Classifier       Accuracy (%)                       Behavioral           6347           1318          916
              RNN                 57.68                           Emotional            384            4893          1153
              RBF Net             65.13                           Mental               683            1203          5345
              Naive Bayes         68.61
              Bayes Net           69.61
              SVM                 69.68                         Table 4
              LLR                 70.35                         Class-wise values for performance metrics on the balanced
              ANN                 74.57                         dataset. True Positive Rate (TPR), False Positive Rate (FPR).
                                                                   Metrics        Behavioral      Emotional      Mental
                                                                   TPR               0.856           0.660        0.721
for training deep neural networks. Consequently, tradi-            FPR               0.151           0.104        0.127
tional Machine Learning approaches performed better. It            Precision         0.740           0.761        0.739
might be worth it to investigate deep neural networks in           Recall            0.856           0.660        0.721
future works using a higher number of instances.                   F-score           0.794           0.707        0.730

5.2. Performance analysis on engagement
     facets                                                     6. Conclusions and future work
We analyse the performance of the best performing clas-         In this paper, we proposed a system to detect different
sifier (ANN) on the different engagement facets (behav-         facets of engagement states: mental, emotional, and be-
ioral, emotional, mental). The corresponding confusion          havioral. This is the first engagement framework of its
matrix is presented in Table 3. The values for different        kind to propose such classification framework of engage-
performance metrics (true positive rate, false positive         ment. This is essential for a deeper analysis of the user’s
rate, precision, recall, and F-score) for each of the classes   engagement by machines. In the context of AI-based
are also presented in Table 4.                                  healthcare systems such as socially assistive robots, such
   It is noted that the best performance was obtained for       fine-grained analysis would improve performance, and fa-
the behavioral class with an F-score of 0.794. This was         cilitate adaptive interventions. For instance, recognizing
followed by the mental class where an F-score of 0.730          whether the user’s engagement is emotional, behavioral,
was obtained. The lowest performance was obtained for           or mental, might better inform AI-based healthcare sys-
the emotional class with an F-score of 0.707. This lower        tems, especially those that rely on interactive systems
performance for the emotional class might be explained          (e.g. ASR for ADHD or ASD). The proposed framework
by the fact that the current features might not be highly       was validated on an HRI corpus exhibiting educational
correlated with the emotional states, and more correlated       and competitive contexts, which are relevant to AI-based
with the other engagement facets. It might be worth it          interactive systems. The preliminary results show that
to investigate other relevant features in the future.           it is possible to classify engagement facets with a rel-
   Looking at the confusion matrix, we can see that the         atively acceptable accuracy. These results shall serve
highest confused pair was behavioral-emotional where            as a baseline for the development of more accurate sys-
1318 instances were predicted as behavioral when they           tems. In future, we plan to validate the framework on a
actually belonged to the emotional class. This is followed      larger dataset that exhibits an SAR scenario. We plan to
by the mental-emotional pair with 1203 instances mis-           work with individual features to improve the system’s
classified as mental when their actual label was emotional.     performance and perform a deep grained analysis of the
Similarly, 1153 mental instances were mis-classified as         different states. We will also explore deep learning-based
emotional. The high confusion between mental and emo-           approaches and unsupervised approaches towards de-
tional engagement states is expected as these states might      tection of engagement state types and thereafter finer
exhibit similar non-verbal cues. The confusion between          classification. Deep learning will be used not only for
behavioral and emotional states is less evident. This con-      data classification but also for feature extraction.
fusion might be due to the used features, which are not
sufficient to precisely predict the emotional states.
                                                                Acknowledgments
                                                                This work is supported in part by the NYUAD Center for
                                                                Artificial Intelligence and Robotics, funded by Tamkeen
                                                                under the NYUAD Research Institute Award CG010.
References                                                           sellati, Comparing models of disengagement in
                                                                     individual and group interactions, in: Proceedings
 [1] H. Salam, O. Celiktutan, H. Gunes, M. Chetouani,                of the 10th Annual ACM/IEEE International Con-
     Automatic context-driven inference of engagement                ference on Human-Robot Interaction, ACM, 2015,
     in hmi: A survey, arXiv preprint arXiv:2209.15370               pp. 99–105.
     (2022).                                                    [14] G. Castellano, I. Leite, A. Pereira, C. Martinho,
 [2] L. J. Corrigan, C. Peters, D. Küster, G. Castellano, To-        A. Paiva, P. W. McOwan, Detecting engagement in
     ward Robotic Socially Believable Behaving Systems               hri: An exploration of social and task-based context,
     - Volume I : Modeling Emotions, Springer Interna-               in: International Conference on Privacy, Security,
     tional Publishing, 2016, pp. 29–51.                             Risk and Trust and International Conference on
 [3] H. L. O’Brien, E. G. Toms, What is user engage-                 Social Computing, IEEE, 2012, pp. 421–428.
     ment? a conceptual framework for defining user             [15] C. Peters, S. Asteriadis, K. Karpouzis, Investigat-
     engagement with technology, Journal of the Ameri-               ing shared attention with a virtual agent using a
     can Society for Information Science and Technology              gaze-based interface, Journal on Multimodal User
     59 (2008) 938–955.                                              Interfaces 3 (2010) 119–130.
 [4] H. L. Ramey, L. Rose-Krasnor, M. A. Busseri, S. Gad-       [16] C. L. Sidner, C. D. Kidd, C. Lee, N. Lesh, Where to
     bois, A. Bowker, L. Findlay, Measuring psycho-                  look: a study of human-robot engagement, in: 9th
     logical engagement in youth activity involvement,               international conference on Intelligent user inter-
     Journal of adolescence 45 (2015) 237–249.                       faces, ACM, 2004, pp. 78–84.
 [5] H. Salam, M. Chetouani, A multi-level context-             [17] C. L. Sidner, C. Lee, C. D. Kidd, N. Lesh, C. Rich,
     based modeling of engagement in human-robot in-                 Explorations in engagement for humans and robots,
     teraction, in: 11th IEEE International Conference               Artificial Intelligence 166 (2005) 140–164.
     and Workshops on Automatic Face and Gesture                [18] S. Andrist, D. Bohus, E. Kamar, E. Horvitz, What
     Recognition (FG), volume 3, IEEE, 2015, pp. 1–6.                went wrong and why? diagnosing situated interac-
 [6] E. Broadbent, C. Jayawardena, N. Kerse, R. Stafford,            tion failures in the wild, in: International Confer-
     B. MacDonald, Human-robot interaction research                  ence on Social Robotics, Springer, 2017, pp. 293–303.
     to improve quality of life in elder care–an approach       [19] D. Bohus, E. Horvitz, Models for multiparty en-
     and issues, in: 25th Conference on Artificial Intelli-          gagement in open-world dialog, in: Proceedings
     gence (AAAI). Workshop on Human-Robot Interac-                  of the SIGDIAL 2009 Conference: The 10th Annual
     tion in Elder Care, 2011, pp. 13–19.                            Meeting of the Special Interest Group on Discourse
 [7] D. Feil-seifer, U. Viterbi, et al., Development of              and Dialogue, Association for Computational Lin-
     socially assistive robots for children with autism              guistics, 2009, pp. 225–234.
     spectrum disorders, Technical Report, Center for           [20] C. Oertel, G. Salvi, A gaze-based method for relating
     Robotics and Embedded Systems, 2009.                            group involvement to individual engagement in
 [8] M. Fridin, Y. Yaakobi, Educational robot for children           multimodal multiparty dialogue, in: Proceedings
     with adhd/add, in: Architectural Design, Interna-               of the 15th ACM on International conference on
     tional Conference on Computational Vision and                   multimodal interaction, ACM, 2013, pp. 99–106.
     Robotics, Bhubaneswar, INDIA, 2011, pp. 1–7.               [21] H. Salam, M. Chetouani, Engagement detection
 [9] J. Greczek, M. Matarić, Expanding the computa-                  based on mutli-party cues for human robot inter-
     tional model of graded cueing: Robots encouraging               action, in: International Conference on Affective
     health behavior change, in: 29th AAAI Conference                Computing and Intelligent Interaction (ACII), IEEE,
     on Artificial Intelligence, 2014, pp. 1–2.                      2015, pp. 341–347.
[10] X. Zhu, D. Ramanan, Face detection, pose esti-             [22] W. Benkaouar, D. Vaufreydaz, Multi-sensors en-
     mation, and landmark localization in the wild, in:              gagement detection with a robot companion in
     IEEE Conference on Computer Vision and Pattern                  a home environment, in: Workshop on Assis-
     Recognition (CVPR), IEEE, 2012, pp. 2879–2886.                  tance and Service robotics in a human environ-
[11] J. Fasola, M. Mataric, A socially assistive robot               ment at IEEE International Conference on Intel-
     exercise coach for the elderly, Journal of Human-               ligent Robots and Systems (IROS), 2012, pp. 45–52.
     Robot Interaction 2 (2013) 3–32.                           [23] A. Ben-Youssef, G. Varni, S. Essid, C. Clavel, On-the-
[12] M. E. Foster, A. Gaschler, M. Giuliani, How can                 fly detection of user engagement decrease in spon-
     i help you’: comparing engagement classification                taneous human–robot interaction using recurrent
     strategies for a robot bartender, in: Proceedings               and deep neural networks, International Journal of
     of the 15th ACM on International conference on                  Social Robotics 11 (2019) 815–828.
     multimodal interaction, ACM, 2013, pp. 255–262.            [24] M. P. Michalowski, S. Sabanovic, R. Simmons, A
[13] I. Leite, M. McCoy, D. Ullman, N. Salomons, B. Scas-            spatial model of engagement for a social robot, in:
     9th IEEE International Workshop on Advanced Mo-                robot interactions, IEEE Access 5 (2016) 705–721.
     tion Control, IEEE, 2006, pp. 762–767.                    [35] K. Inoue, D. Lala, K. Takanashi, T. Kawahara, En-
[25] R. Bednarik, S. Eivazi, M. Hradis, Gaze and conver-            gagement recognition in spoken dialogue via neural
     sational engagement in multiparty video conversa-              network by aggregating different annotators’ mod-
     tion: An annotation scheme and classification of               els., in: Interspeech, 2018, pp. 616–620.
     high and low levels of engagement, in: Proceed-           [36] K. Inoue, D. Lala, K. Takanashi, T. Kawahara, Latent
     ings of the 4th Workshop on Eye Gaze in Intelligent            character model for engagement recognition based
     Human Machine Interaction, ACM, 2012, pp. 1–6.                 on multimodal behaviors, in: 9th International
[26] M. A. A. Dewan, F. Lin, D. Wen, M. Murshed, Z. Ud-             Workshop on Spoken Dialogue System Technology,
     din, A deep learning approach to detecting en-                 Springer, 2019, pp. 119–130.
     gagement of online learners, in: IEEE SmartWorld,         [37] F. Del Duchetto, P. Baxter, M. Hanheide, Are
     Ubiquitous Intelligence & Computing, Advanced &                you still with me? continuous engagement assess-
     Trusted Computing, Scalable Computing & Com-                   ment from a robot’s point of view, arXiv preprint
     munications, Cloud & Big Data Computing, Inter-                arXiv:2001.03515 (2020).
     net of People and Smart City Innovation, IEEE, 2018,      [38] V. V. Chithrra Raghuram, H. Salam, J. Nasir,
     pp. 1895–1902.                                                 B. Bruno, O. Celiktutan, Personalized productive
[27] M. Frank, G. Tofighi, H. Gu, R. Fruchter, En-                  engagement recognition in robot-mediated collabo-
     gagement detection in meetings, arXiv preprint                 rative learning, in: Proceedings of the 2022 Interna-
     arXiv:1608.08711 (2016).                                       tional Conference on Multimodal Interaction, 2022,
[28] L. Devillers, S. Rosset, G. D. Duplessis, L. Bechade,          pp. 632–641.
     Y. Yemez, B. B. Turker, M. Sezgin, E. Erzin, K. El Had-   [39] A. Kapoor, R. W. Picard, Multimodal affect recog-
     dad, S. Dupont, et al., Multifaceted engagement in             nition in learning environments, in: 13th an-
     social interaction with a machine: The joker project,          nual ACM international conference on Multimedia,
     in: 13th IEEE International Conference on Auto-                ACM, 2005, pp. 677–682.
     matic Face & Gesture Recognition (FG), IEEE, 2018,        [40] H. P. Martínez, G. N. Yannakakis, Mining multi-
     pp. 697–701.                                                   modal sequential patterns: a case study on affect
[29] L. El Hamamsy, W. Johal, T. Asselborn, J. Nasir,               detection, in: Proceedings of the 13th international
     P. Dillenbourg, Learning by collaborative teaching:            conference on multimodal interfaces, ACM, 2011,
     An engaging multi-party cowriter activity, in: 28th            pp. 3–10.
     IEEE International Conference on Robot and Hu-            [41] G. Castellano, I. Leite, A. Paiva, Detecting perceived
     man Interactive Communication (ROMAN), IEEE,                   quality of interaction with a robot using contextual
     2019, pp. 1–8.                                                 features, Autonomous Robots (2016) 1–17.
[30] A. Kapoor, R. W. Picard, Y. Ivanov, Probabilistic com-    [42] J. R. Curhan, A. Pentland, Thin slices of negotiation:
     bination of multiple modalities to detect interest, in:        predicting outcomes from conversational dynam-
     Proceedings of the 17th International Conference               ics within the first 5 minutes., Journal of Applied
     on Pattern Recognition, volume 3, IEEE, 2004, pp.              Psychology 92 (2007) 802.
     969–972.                                                  [43] D. Jayagopi, D. Sanchez-Cortes, K. Otsuka, J. Yam-
[31] S.-S. Yun, M.-T. Choi, M. Kim, J.-B. Song, Intention           ato, D. Gatica-Perez, Linking speaking and looking
     reading from a fuzzy-based human engagement                    behavior patterns with group composition, percep-
     model and behavioural features, International Jour-            tion, and performance, in: Proceedings of the 14th
     nal of Advanced Robotic Systems (2012).                        ACM international conference on Multimodal in-
[32] F. Papadopoulos, L. J. Corrigan, A. Jones, G. Castel-          teraction, ACM, 2012, pp. 433–440.
     lano, Learner modelling and automatic engage-             [44] L. S. Nguyen, D. Frauendorfer, M. S. Mast, D. Gatica-
     ment recognition with robotic tutors, Proceedings -            Perez, Hire me: Computational inference of hirabil-
     2013 Humaine Association Conference on Affective               ity in employment interviews based on nonverbal
     Computing and Intelligent Interaction, ACII 2013               behavior, IEEE Transactions on Multimedia 6 (2013)
     (2013) 740–744.                                                1018–1031.
[33] K. Masui, G. Okada, N. Tsumura, Measurement               [45] J. Biel, D. Gatica-Perez, The youtube lens: Crowd-
     of advertisement effect based on multimodal emo-               sourced personality impressions and audiovisual
     tional responses considering personality, ITE Trans-           analysis of vlogs, Multimedia, IEEE Transactions
     actions on Media Technology and Applications 8                 on 15 (2013) 41–55.
     (2020) 49–59.                                             [46] L. Devillers, G. D. Duplessis, Toward a context-
[34] H. Salam, O. Celiktutan, I. Hupont, H. Gunes,                  based approach to assess engagement in human-
     M. Chetouani, Fully automatic analysis of engage-              robot social interaction, in: Dialogues with Social
     ment and its relationship to personality in human-             Robots, Springer, 2017, pp. 293–301.
[47] E. Thelisson, K. Sharma, H. Salam, V. Dignum, The
     general data protection regulation: An opportunity
     for the hci community?, in: Extended Abstracts
     of the 2018 CHI Conference on Human Factors in
     Computing Systems, 2018, pp. 1–8.
[48] J. Greczek, E. Short, C. E. Clabaugh, K. Swift-Spong,
     M. Mataric, Socially assistive robotics for personal-
     ized education for children, in: AAAI Fall Sympo-
     sium Series, 2014, pp. 1–3.
[49] C. Kasari, A. Sturm, W. Shih, Smarter approach to
     personalizing intervention for children with autism
     spectrum disorder, Journal of Speech, Language,
     and Hearing Research 61 (2018) 2629–2640.
[50] D. B. Jayagopi, S. Sheikhi, D. Klotz, J. Wienke, J.-M.
     Odobez, S. Wrede, V. Khalidov, L. Nguyen, B. Wrede,
     D. Gatica-Perez, The vernissage corpus: A multi-
     modal human-robot-interaction dataset, Technical
     Report, Bielefeld University, 2012.
[51] P. Wittenburg, H. Brugman, A. Russel, A. Klass-
     mann, H. Sloetjes, Elan: a professional frame-
     work for multimodality research, in: Proceedings
     of 5th the International Conference on Language
     Resources and Evaluation (LREC), 2006, pp. 1556–
     1559.
[52] M. F. Mason, E. P. Tatkow, C. N. Macrae, The look
     of love gaze shifts and person perception, Psycho-
     logical Science 16 (2005) 236–239.
[53] C. L. Sidner, C. Lee, N. Lesh, Engagement when
     looking: behaviors for robots when collaborating
     with people, in: Diabruck: Proceedings of the 7th
     workshop on the Semantic and Pragmatics of Dia-
     logue, University of Saarland, 2003, pp. 123–130.
[54] S. Baron-Cohen, Mindblindness: An essay on
     autism and theory of mind, MIT press, 1997.