=Paper=
{{Paper
|id=Vol-3359/paper5
|storemode=property
|title=Distinguishing Engagement Facets: An Essential Component for AI-based Interactive Healthcare
|pdfUrl=https://ceur-ws.org/Vol-3359/paper5.pdf
|volume=Vol-3359
|authors=Hanan Salam
|dblpUrl=https://dblp.org/rec/conf/iui/Salam23
}}
==Distinguishing Engagement Facets: An Essential Component for AI-based Interactive Healthcare==
Distinguishing Engagement Facets: An Essential
Component for AI-based Interactive Healthcare
Hanan Salam1,2,*,†
1
New York University Abu Dhabi, PO Box 129188, Saadiyat Island, Abu Dhabi, United Arab Emirates
2
Social Machines & Robotics (SMART) Lab, Center of AI & Robotics (CAIR)
Abstract
Engagement in Human-Machine Interaction is the process by which entities participating in the interaction establish, maintain,
and end their perceived connection. It is essential to monitor the engagement state of patients in various AI-based interactive
healthcare paradigms. This includes medical conditions that alter social behavior such as Autism Spectrum Disorder (ASD) or
Attention-Deficit/Hyperactivity Disorder (ADHD). Engagement is a multi-faceted construct which is composed of behavioral,
emotional, and mental components. Previous research has neglected this multi-faceted nature of engagement and focused on
the detection of engagement level or binary engagement label. In this paper, a system is presented to distinguish these facets
using contextual and relational features. This can facilitate further fine-grained analysis. Several machine learning classifiers
including traditional and deep learning models are compared for this task. An F-Score of 0.74 was obtained on a balanced
dataset of 22242 instances with neural network-based classification. The proposed framework shall serve as a baseline for
further research on engagement facets recognition, and its integration is socially assistive robotic applications.
Keywords
Engagement Recognition, Interactive Healthcare, Affective Computing, Human-Robot Interaction
1. Introduction tion of the current interaction context. These studies
suggest that when attempting automatic inference of
During the last decade, researchers have demonstrated user’s engagement state, it is important to consider this
interest in enhancing the capabilities of robots to assist multi-faceted nature.
humans in their daily life. This requires incorporation Application areas of Assistive Robotics include elderly
of social intelligence within the robots which involves care [6], helping people with medical conditions that
understanding different states of engagement. alter social behavior such as children suffering from
Research in Human-Machine Interaction (HMI) has Autism Spectrum Disorder (ASD) [7] or people suffering
depicted that engagement is a multi-faceted construct from Adult Deficient Hyperactivity Disorder (ADHD) [8],
and consists of different components [1]. It is very much coaching and tutoring [9, 10]. Fasola and Mataric [11]
important to be able to distinguish the facets before per- presented a Socially Assistive Robot (SAR) system de-
forming a deeper analysis. Corrigan et al. [2] demon- signed to engage elderly users in physical exercise. Dif-
strated that engagement is mainly composed of cognitive ferent variants of the robot’s verbal instructions were
and affective components which are manifested by at- used to minimize the robot’s perceived verbal repetitive-
tention and enjoyment. According to O’brien et al. [3], ness, and thus maintain the users’ engagement. Previ-
engagement is characterized by features like challenge, ous engagement detection approaches revolve around
positive affect, endurability, aesthetic and sensory appeal, a binary classification-based approach (engaged vs. not
attention, feedback, variety/novelty, interactivity, and per- engaged) [12, 13] or a multi-class approach (engagement
ceived user control. In the context of youth engagement level) [14, 15]. However, the multi faceted nature is sel-
in activities, Ramey et al. [4] proposed a model of psycho- dom considered.
logical engagement having three components: cognitive In this paper, a framework that takes into account the
like thinking or concentrating, affective like enjoyment, multi-faceted nature of engagement is proposed. Engage-
and relational like through connectedness to something. ment is modeled in terms of a spectrum of engagement
Salam et al. [5] showed that the mental and emotional states: mental, behavioural and emotional. This is the
states of the user related to engagement vary in func- first engagement framework of its kind to propose such
classification framework of the facets of engagement.
Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney,
Australia, Such analysis allows to inform the implementation of
*
Corresponding author. fine-grained strategies based on a deeper understanding
$ hanan.salam@nyu.edu (H. Salam) of user’s states. We present a preliminary evaluation
https://wp.nyu.edu/smartlab/ (H. Salam) of this approach on an off-line multi-party HRI corpus.
0000-0001-6971-5264 (H. Salam) The corpus was chosen due to the relevance of its in-
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org) teraction scenario (educational followed by competitive
context) to the use of AI-based interactive healthcare in the thick of present, interacting, engaged, and just
systems. For instance, SAR for neuro-developmental dis- attending. A system to distinguish two classes of engage-
orders such as ADHD and ASD, which might benefit from ment namely medium-high to high and medium-high to
a multi-faceted engagement model. For instance, an edu- low engagement was presented by [14]. MBednarik et
cational scenario can be adopted for the characterization al. [25] distinguished disparate states of conversational
of ADHD, since such context would solicit attention cues engagement states. It includes no interest, following, re-
which are normally impaired in the case of ADHD indi- sponding, conversing, influencing and managing. They
viduals. We are aware that for the study to be complete, also modelled a bi-class problem having low/high conver-
it should be validated in the context of an SAR scenario. sational level. Oetrel et al. [20] distinguished 4 classes for
However, the lack of such dataset has led us to choose group involvement namely high, low, leader, steering the
a proxy dataset to perform the initial validation of the conversation, and group is forming itself. Two models
framework. were developed in [26] focusing on not-engaged/engaged
and not-engaged/normally-engaged/very-engaged state
distinction. Frank et al. [27] differentiated 6 different
2. Related Work states of engagement in the thick of disengagement, in-
volved engagement, relaxed engagement, intention to
Engagement in Human-Robot Interaction is defined as
act, action, and involved action.
the process by which two (or more) participants establish,
Recently, [28] stated that engagement in HRI should be
maintain, and end their perceived connection [16, 17].
multi-faceted. Formulating binary/multi-class problems
Andrist et al. [18] analyzed an HRI dataset in terms of
(engaged vs. not engaged) or a multi-class problem (en-
interaction type, quality, problem types, and the system’s
gagement level) over this ignore the multi-faceted nature
failure points causing problems. Failure in the engage-
of engagement. Taking this multi-faceted nature into
ment component was found to be among the major identi-
consideration is very important for the design of intel-
fied problems cause during the interaction. This confirms
ligent social agents. For instance, this can influence the
that a highly performing engagement model is essential
implemented engagement strategies within the agent’s
for the success of any HRI scenario [18].
architecture. Some studies attempted to implement dif-
Bohus & Horvitz [19] pioneered research on engage-
ferent strategies related to task and social engagement.
ment in multi-party interaction. They explored disparate
For instance, [29] implemented a task engagement strat-
engagement strategies to allow robots to engage simulta-
egy which focuses on the task at hand and having users
neously with multiple users. There are multi-farious stud-
meta-cognitively reflect on the robot’s performance and
ies based on multi-party interactions. Oertel et al. [20]
a social engagement strategy which focuses on their en-
studied in both individual and group level about the rela-
joyment and having them meta-cognitively reflect on
tionship amidst the participants’ gaze and speech behav-
their emotions with respect to the activity and the group
ior. Leite et al. [13] experimented with the generalization
interactions.
capacity of an engagement model. It was trained and
Different features have been used to distinguish en-
tested on single-party and multi-party scenarios respec-
gagement states. Some of such features include contex-
tively. The opposite scenario was also considered. Salam
tual [30, 14, 21] attentional [31, 32], affective [14, 12,
et al. [21] conducted a study on engagement recognition
26, 33] to name a few. Salam et al. [34] used person to
in a triadic HRI scenario and showed that it is possible to
detect both individual and group engagement. [35, 36]
infer a participant’s engagement state based on the other
combined different aspects like backchannels, eye gaze,
participants’ cues.
head nodding-based features to detect engagement level.
Most of engagement inference approaches revolved
Ben et al. [23] combined several attributes like speech
around identification of a person’s intention to engage.
and facial expressions, gaze and head motion, distance to
There has also been studies to detect whether the person
robot to identify disengagement. Masui et al. [33] worked
is engaged/disengaged. Benkaouar et al. [22] presented
with facial Action Units and physiological responses.
a system to detect disparate engagement phases. This
Recent approaches explored deep learning architec-
includes intention to engage, engaged and disengaged.
tures for the detection of engagement. Dewan et al. [26]
Foster et al. [12] attempted to detect whether a person
used person-independent edge features and Kernel Prin-
intends to engage which is a bi-class problem. Leite
cipal Component Analysis (KPCA) within a deep learning
et al. [13] attempted to identify disengagement in both
framework to detect online learners’ engagement using
group and individual interactions. Ben et al. [23] also
facial expressions. [37] used CNN and LSTM networks to
presented a system dedicated to the similar cause.
predict engagement level. [38] proposed adaptive deep
There are different works which focused on detecting
architectures for different user groups for predicting en-
different levels of engagement of a user. Michalowski
gagement in robot-mediated collaborative learning.
et al. [24] distinguished different levels of engagement
Contextual information is being used in social signal
processing for quite some time. Kapoor et al. [39] com- education in classrooms. Children with Autism Spectrum
bined context features in the form of game state with fa- Disorder are a target population for such personalized
cial and posture features in an online educative scenario. teaching systems [49]. However, most of existing sys-
Martinez and Yannakakis [40] used sequence mining for tems do not include a user’s engagement analysis module.
the prediction of computer game player affective states. Such Socially Assistive systems can largely benefit from
Castellano et al. [14] explored task and social-based con- a fine-grained analysis of engagement. This will make
textual features. In another instance, the authors [41] the systems more human-like. There has been interest in
used same contextual features for distinguishing interac- automated screening and consultation to detect problems
tion quality. of the body and mind at an early stage. This can also help
Relational feature have proven to be useful in multi- to reduce the initial load on doctors. It is very important
farious instances. Curhan et al. [42] used dyad-based for the patients to feel that they are interacting with their
cues for predicting negotiation outcomes. Jayagopi et peers rather than a machine. The systems need to process
al. [43] adhered to group-based cues to understand typi- both audio and visual cues in order to properly under-
cal behavior in small groups. Nguyen et al. [44] extracted stand patients. While the patients are interacting with the
relational audio-visual cues to detect the suitability of an automated systems, several states of engagement needs
applicant in a job interview. The features included audio to be monitored simultaneously. This includes level of
and visual back-channeling, nodding while speaking, mu- concentration, different reactions, spontaneity, to name
tual short utterances and nods. It also includes [45] that a few. Such states of engagement portray useful informa-
used “looking-while-speaking” feature to understand per- tion about a patient’s health. These engagement states
sonality impressions from conversational logs extracted can be categorized into a broader spectrum of behavioral,
from YouTube. mental and emotional states. Distinguishing the engage-
So far, context has been insufficiently investigated in ment facet is important at the outset for a deeper analysis.
the avenue of affective and cognitive states. Devillers et This can pave the way for systems which would better un-
al. [46] highlights the importance of context in the as- derstand the condition of patients by reading their body
sessment of engagement. They identified paralinguistic, language and wont merely match spoken symptoms. This
linguistic, non-verbal, interactional, and specific emo- will especially be useful in treating and understanding
tional and mental state-based features as very important mental conditions where the body language is a vital
for engagement prediction. In this work, we investigate aspect. In the case of psychological problems, patients
relational and contextual features for the recognition of are often engaged into conversations regarding disparate
a spectrum of engagement states. The features have been aspects by doctors wherein the patient’s body language
used in isolation as well as in combination to assess their serves a vital pointer towards the mental condition.
engagement state distinction capability. These features
have not been combined previously for detecting engage-
ment facets. Compared to previous works, the proposed 4. Proposed Framework
features model interaction context, the robot’s behavior
The proposed framework is composed of 3 steps. First, a
and the behavioral relation between the participant in
multi-party HRI corpus is annotated in terms of engage-
question and the other entities of the interaction.
ment facets. Then, different contextual and relational
features are extracted. Finally, different standard classi-
3. Need for Engagement fiers are used to classify the different engagement facets.
Fig. 1 presents an illustration of the proposed framework.
Recognition in Interactive
Healthcare 4.1. Data Corpus
Technological advancements have propagated to every In this section, the data corpus along with the disparate
field. There has always been efforts to automate tasks. engagement annotations is discussed.
Healthcare is one of the primal needs for society and
it also has been touched by technology [47]. Several 4.2. Interaction Scenario & Modalities
interactive systems have come up to aid in automated
healthcare and the well-being of people with medical We use 4 interactions of 8 participants from the conver-
conditions. In [8], an SAR is proposed, whose aim is to sational HRI data corpus ‘Vernissage’ [50]. It is a multi-
help children with ADHD to improve their educational party interaction amidst the humanoid robot NAO1 and
outcome through social interaction with a robot. An- 2 participants. The interaction has different contexts
other educational SAR was presented by [48]. This was which can mainly be differentiated into 2 parts. The 1𝑠𝑡
targeted towards providing assistance in personalizing
1
https://www.softbankrobotics.com/emea/en/nao
Emotional
Engagement Facet
... ...
• Relational
• Contextual Mental
• Relational+
Contextual
Input Behavioural
Feature extraction
Artificial neural network-based classification
Figure 1: Illustration of the proposed framework with an Artificial Neural Network classifier.
utes. NAO’s internal camera was used to record the clips.
This provided the front view. 3 other cameras were also
used to get the left, right and rear feeds. Fig. 2 shows the
organization of the recording room.
The corpus has annotations for non-verbal behaviors
of the participants. It also contains robot’s speech and
action in the log file of the robot.
4.3. Engagement Annotations
Engagement labels were assigned to 3 categories namely
mental, behavioral and emotional. These were anno-
tated when the participants manifested one the follow-
ing states: thinking, listening, positive/negative reaction,
Figure 2: Organisation of the recording room. NAO (orange),
responding, waiting for feedback, concentrating, and lis-
participants typical positions (gray circles), cameras (HD: red,
tening to the other participant. The annotations were
VICON: blue), wizard feedback (green), paintings (green lines), 2
windows (blue lines), VICON coordinate system (red), head performed by 2 people with the aid of Elan annotation
pose calibration position(P1 and P2). tool [51]. They watched every video 2 times (once with
the perspective of 1 participant). Discrete segments were
annotated and it was stopped as soon as a change was ob-
served. The Mean inter-rater Cronbach’ Alpha coefficient
is where the robot describes several paintings hanged on was 0.93. This points to the reliability of the annotations.
a wall (informative/educational context). In the 2 the The details of each category is as follows.
𝑛𝑑
robot performs a quiz with the volunteers related to art
and culture (competitive context). This was done in or-
Mental states – A segment was assigned mental state
der to encompass different variations for the engagement
label when the participant manifested one of the follow-
states.
ing mental states:
This corpus was chosen since its interaction scenario
is relevant to the use of SAR for neuro-developmental • Listening (EL): The participant is listening to
disorders such as ADHD and ASD, that might benefit NAO;
from a multi-faceted engagement model. For instance, an
educational scenario like the once in the first part of the • WaitingFeedback (EWF): The participant is wait-
Vernissage corpus scenario can be adopted for the char- ing for NAO’s feedback after he/she had answered
acterization of ADHD, since an educational/informative a question;
scenario would solicit attention cues which are normally
impaired in the case of ADHD individuals. We are aware • Thinking (ETh): The participant is thinking about
that for the study to be complete, it should be validated the response to a question asked by NAO;
in the context of an SAR scenario. However, the lack • Concentrating (EC): The participant is concentrat-
of such dataset has led us to choose a proxy dataset to ing with NAO;
perform our initial validation of the framework.
The average length per interaction is nearly 11 min- 2
https://tla.mpi.nl/tools/tla-tools/elan/
Table 1
Details on the number of annotated instances in each class
State Number of instances
Behavioral 10331
Emotional 7414
Mental 80902
Total 98647
• ListeningPerson2 (ELP2): The participant is listen-
ing to the other who is answering NAO.
Behavioral states – A segment was assigned a be-
havioral state label when the participant manifested the Figure 3: Features illustration: contextual features
following behavioral state: (Robot, Participant); relational features (Participant-Robot,
Participant1-Participant2)
• Responding (ER): The participant is responding to
NAO;
the other entities. Moreover, for a dialogue of the robot,
Emotional states – A segment was assigned an emo- we extract the robot’s utterance, addressee and topic of
tional state label when the participant manifested one of speech.
the following emotional states:
• PositiveReaction (EPR): The participant shows a Participant:
positive reaction to NAO. 1) Visual Focus Of Attention (VFOA): Gaze in human-
• NegativeReaction (ENR): The participant shows a human social interactions is considered as the
negative reaction to NAO. primary cue of attention [52, 53]. We use VFOA
ground truth of every participant which were
The details regarding the number of annotated instances annotated with 9 labels.
for each class is presented in Table 1.
2) VFOA Shifts: Gaze shifts indicate people’s engage-
ment/ disengagement with specific environmen-
4.4. Extracted Features tal stimuli [54]. We define VFOA shift as the
In this study, we used the annotated cues from Vernissage moment when a participant shifts attention to
corpus. Moreover, we extracted additional metrics which a different subject. This feature is binary and is
were computed from the existing ones. They were catego- computed from the VFOA labels.
rized into two categories: 1) contextual and 2) relational. 3) Addressee: When addressing somebody, we are
Contextual features deal with either the different enti- engaged with him/her. Similarly, in the context
ties of an interaction like the robot utterance, addressee of HRI, when a participant addresses someone
and topic of speech or behavioral aspects of the partici- other than the robot, he/she is disengaged from
pant that concern the interaction context like visual focus the robot. Adressee annotations used from the
of attention and addressee. corpus and are annotated into 6 Classes: {NoLabel,
Relational features encode the behavioral relation be- Nao, Group, PRight, PLeft, Silence}.
tween the participants and the robot. Fig. 3 illustrates
the features groups used in our study.
Robot: Starting from the robot’s conversation logs, the
following were extracted.
4.4.1. Contextual Features
1) Utterances: The labels {Speech, Silence} were as-
Interaction amidst entities involves both entities and con-
signed to frames depending on the robot’s speech
nection. While inferring the engagement state of an
activity.
interacting person, we consider behavior of the person
as well as our behavior. Thus, an automated engagement 2) Addressee: The addressee of the robot was de-
identification system should also consider the same. tected using predefined words from its speech.
Consequently, we employ different contextual features
that describes the participant’s behavior with respect to
The following labels were assigned {Person1, Per- 2) Participants Mutual Laughter: This refers to
son2, GroupExplicit, GroupPerson1, GroupPer- events where the two participants laugh together.
son2, Person1Group, Person2Group, Group, Si- This represents reaction to the robot’s speech.
lence}. ‘GroupExplicit’ label refers to such seg-
ments where the robot was explicitly addressing 3) P1 Looks at P2/ P2 Talks to Robot: This repre-
both participants. ‘GroupPersonX’ /𝑋 ∈ (1, 2) sents events where the passive participant looks
corresponds to segments where the robot ad- at active participant while he/she is talking to the
dresses the group then ‘PersonX’ while ‘Person- robot. Though this may appear to be disengage-
XGroup’ represents the inverse. ment, but analysis revealed the inverse.
3) Topic of Speech: This was identified using a key- The total number of features is 39. There were 34
word set. These were related to disparate paint- contextual features and 5 relational features.
ings available in the scene {manray, warhol, arp,
paintings}. Frames were allotted labels based on 4.5. Engagement Facets Classification
them.
As this is the first work that proposed the classification
of engagement facets, namely, behavioural, emotional,
4.4.2. Relational Features and mental, it is important to establish a classification
We extract a set of Relational Features describing robot’s baseline.
and participants’ behaviors synchrony and alignment. We compare different classifiers for the defined en-
These include, among others, mutual gaze and laughter. gagement facets classification. The classifiers include
A logical AND operation was used between participants’ traditional machine learning classifiers such as Bayesian
and robot’s features time series for obtaining mutual Network (Bayes Net), Naive Bayes, Linear Logistic Re-
events occurrence. Fig. 4 shows an example of partici- gression (LLR), Support Vector Machine (SVM), Radial
pants’ mutual laughter extraction. Basis Function Network (RBF Net), and simple Artifi-
cial Neural Network (ANN). A deep learning classifier,
namely Recurrent Neural Network (RNN) was also used
for this classification task. This helps establish an initial
understanding of whether traditional machine learning
classifiers are sufficient for the task, or more sophisticated
Figure 4: Example of relational cues extraction. This corre- classification techniques such as deep learning methods
sponds to participants’ mutual laughter detection using logical are needed.
AND from laughter time series.
5. Results and Discussion
Participant-Robot Features: The proposed framework is evaluated using 5-fold cross
validation. As the data was highly imbalanced, so a subset
1) Gaze-Speech Alignment: We extracted events
of the data was drawn having equal number of instances
where a participant looks at objects correspond-
per class totalling to 22242 instances. The combined
ing to the robot’s topic of speech. This indicates
features (contextual+relational) for this dataset were used
that the participant is listening to the robot and
to train the different classifiers presented in section 4.5.
is interested in what it is saying.
2) P1 talks to P2/Robot Speaks: This refers to events 5.1. Comparative performance analysis of
where the participants speak with each other dur-
standard classifiers
ing the robot’s speech. This may signal a disen-
gagement behavior. Table 2 presents the results of training different clas-
sifiers on the combined features. From the table, we
Person1-Person2 Features: can state that the best performing classifier was ANN
with an accuracy of 74.57%, followed by the Linear Lo-
1) Participants Mutual Looks: This refers to events gistic Regression model (70.35%). The performance of
where the participants look at each other. Though SVM and Bayes Net were very close (69.6%), followed by
this may signal disengagement but it may also Naive Bayes (68.61%). Surprisingly, the least accuracy of
signal engagement as it might be a reaction to 57.68% was obtained using the deep RNN. This might be
the robot’s speech. due to the fact that the number of samples is not sufficient
Table 2 Table 3
Comparative analysis of the performance of standard classi- Confusion matrix for balanced dataset.
fiers on the balanced dataset. Behavioral Emotional Mental
Classifier Accuracy (%) Behavioral 6347 1318 916
RNN 57.68 Emotional 384 4893 1153
RBF Net 65.13 Mental 683 1203 5345
Naive Bayes 68.61
Bayes Net 69.61
SVM 69.68 Table 4
LLR 70.35 Class-wise values for performance metrics on the balanced
ANN 74.57 dataset. True Positive Rate (TPR), False Positive Rate (FPR).
Metrics Behavioral Emotional Mental
TPR 0.856 0.660 0.721
for training deep neural networks. Consequently, tradi- FPR 0.151 0.104 0.127
tional Machine Learning approaches performed better. It Precision 0.740 0.761 0.739
might be worth it to investigate deep neural networks in Recall 0.856 0.660 0.721
future works using a higher number of instances. F-score 0.794 0.707 0.730
5.2. Performance analysis on engagement
facets 6. Conclusions and future work
We analyse the performance of the best performing clas- In this paper, we proposed a system to detect different
sifier (ANN) on the different engagement facets (behav- facets of engagement states: mental, emotional, and be-
ioral, emotional, mental). The corresponding confusion havioral. This is the first engagement framework of its
matrix is presented in Table 3. The values for different kind to propose such classification framework of engage-
performance metrics (true positive rate, false positive ment. This is essential for a deeper analysis of the user’s
rate, precision, recall, and F-score) for each of the classes engagement by machines. In the context of AI-based
are also presented in Table 4. healthcare systems such as socially assistive robots, such
It is noted that the best performance was obtained for fine-grained analysis would improve performance, and fa-
the behavioral class with an F-score of 0.794. This was cilitate adaptive interventions. For instance, recognizing
followed by the mental class where an F-score of 0.730 whether the user’s engagement is emotional, behavioral,
was obtained. The lowest performance was obtained for or mental, might better inform AI-based healthcare sys-
the emotional class with an F-score of 0.707. This lower tems, especially those that rely on interactive systems
performance for the emotional class might be explained (e.g. ASR for ADHD or ASD). The proposed framework
by the fact that the current features might not be highly was validated on an HRI corpus exhibiting educational
correlated with the emotional states, and more correlated and competitive contexts, which are relevant to AI-based
with the other engagement facets. It might be worth it interactive systems. The preliminary results show that
to investigate other relevant features in the future. it is possible to classify engagement facets with a rel-
Looking at the confusion matrix, we can see that the atively acceptable accuracy. These results shall serve
highest confused pair was behavioral-emotional where as a baseline for the development of more accurate sys-
1318 instances were predicted as behavioral when they tems. In future, we plan to validate the framework on a
actually belonged to the emotional class. This is followed larger dataset that exhibits an SAR scenario. We plan to
by the mental-emotional pair with 1203 instances mis- work with individual features to improve the system’s
classified as mental when their actual label was emotional. performance and perform a deep grained analysis of the
Similarly, 1153 mental instances were mis-classified as different states. We will also explore deep learning-based
emotional. The high confusion between mental and emo- approaches and unsupervised approaches towards de-
tional engagement states is expected as these states might tection of engagement state types and thereafter finer
exhibit similar non-verbal cues. The confusion between classification. Deep learning will be used not only for
behavioral and emotional states is less evident. This con- data classification but also for feature extraction.
fusion might be due to the used features, which are not
sufficient to precisely predict the emotional states.
Acknowledgments
This work is supported in part by the NYUAD Center for
Artificial Intelligence and Robotics, funded by Tamkeen
under the NYUAD Research Institute Award CG010.
References sellati, Comparing models of disengagement in
individual and group interactions, in: Proceedings
[1] H. Salam, O. Celiktutan, H. Gunes, M. Chetouani, of the 10th Annual ACM/IEEE International Con-
Automatic context-driven inference of engagement ference on Human-Robot Interaction, ACM, 2015,
in hmi: A survey, arXiv preprint arXiv:2209.15370 pp. 99–105.
(2022). [14] G. Castellano, I. Leite, A. Pereira, C. Martinho,
[2] L. J. Corrigan, C. Peters, D. Küster, G. Castellano, To- A. Paiva, P. W. McOwan, Detecting engagement in
ward Robotic Socially Believable Behaving Systems hri: An exploration of social and task-based context,
- Volume I : Modeling Emotions, Springer Interna- in: International Conference on Privacy, Security,
tional Publishing, 2016, pp. 29–51. Risk and Trust and International Conference on
[3] H. L. O’Brien, E. G. Toms, What is user engage- Social Computing, IEEE, 2012, pp. 421–428.
ment? a conceptual framework for defining user [15] C. Peters, S. Asteriadis, K. Karpouzis, Investigat-
engagement with technology, Journal of the Ameri- ing shared attention with a virtual agent using a
can Society for Information Science and Technology gaze-based interface, Journal on Multimodal User
59 (2008) 938–955. Interfaces 3 (2010) 119–130.
[4] H. L. Ramey, L. Rose-Krasnor, M. A. Busseri, S. Gad- [16] C. L. Sidner, C. D. Kidd, C. Lee, N. Lesh, Where to
bois, A. Bowker, L. Findlay, Measuring psycho- look: a study of human-robot engagement, in: 9th
logical engagement in youth activity involvement, international conference on Intelligent user inter-
Journal of adolescence 45 (2015) 237–249. faces, ACM, 2004, pp. 78–84.
[5] H. Salam, M. Chetouani, A multi-level context- [17] C. L. Sidner, C. Lee, C. D. Kidd, N. Lesh, C. Rich,
based modeling of engagement in human-robot in- Explorations in engagement for humans and robots,
teraction, in: 11th IEEE International Conference Artificial Intelligence 166 (2005) 140–164.
and Workshops on Automatic Face and Gesture [18] S. Andrist, D. Bohus, E. Kamar, E. Horvitz, What
Recognition (FG), volume 3, IEEE, 2015, pp. 1–6. went wrong and why? diagnosing situated interac-
[6] E. Broadbent, C. Jayawardena, N. Kerse, R. Stafford, tion failures in the wild, in: International Confer-
B. MacDonald, Human-robot interaction research ence on Social Robotics, Springer, 2017, pp. 293–303.
to improve quality of life in elder care–an approach [19] D. Bohus, E. Horvitz, Models for multiparty en-
and issues, in: 25th Conference on Artificial Intelli- gagement in open-world dialog, in: Proceedings
gence (AAAI). Workshop on Human-Robot Interac- of the SIGDIAL 2009 Conference: The 10th Annual
tion in Elder Care, 2011, pp. 13–19. Meeting of the Special Interest Group on Discourse
[7] D. Feil-seifer, U. Viterbi, et al., Development of and Dialogue, Association for Computational Lin-
socially assistive robots for children with autism guistics, 2009, pp. 225–234.
spectrum disorders, Technical Report, Center for [20] C. Oertel, G. Salvi, A gaze-based method for relating
Robotics and Embedded Systems, 2009. group involvement to individual engagement in
[8] M. Fridin, Y. Yaakobi, Educational robot for children multimodal multiparty dialogue, in: Proceedings
with adhd/add, in: Architectural Design, Interna- of the 15th ACM on International conference on
tional Conference on Computational Vision and multimodal interaction, ACM, 2013, pp. 99–106.
Robotics, Bhubaneswar, INDIA, 2011, pp. 1–7. [21] H. Salam, M. Chetouani, Engagement detection
[9] J. Greczek, M. Matarić, Expanding the computa- based on mutli-party cues for human robot inter-
tional model of graded cueing: Robots encouraging action, in: International Conference on Affective
health behavior change, in: 29th AAAI Conference Computing and Intelligent Interaction (ACII), IEEE,
on Artificial Intelligence, 2014, pp. 1–2. 2015, pp. 341–347.
[10] X. Zhu, D. Ramanan, Face detection, pose esti- [22] W. Benkaouar, D. Vaufreydaz, Multi-sensors en-
mation, and landmark localization in the wild, in: gagement detection with a robot companion in
IEEE Conference on Computer Vision and Pattern a home environment, in: Workshop on Assis-
Recognition (CVPR), IEEE, 2012, pp. 2879–2886. tance and Service robotics in a human environ-
[11] J. Fasola, M. Mataric, A socially assistive robot ment at IEEE International Conference on Intel-
exercise coach for the elderly, Journal of Human- ligent Robots and Systems (IROS), 2012, pp. 45–52.
Robot Interaction 2 (2013) 3–32. [23] A. Ben-Youssef, G. Varni, S. Essid, C. Clavel, On-the-
[12] M. E. Foster, A. Gaschler, M. Giuliani, How can fly detection of user engagement decrease in spon-
i help you’: comparing engagement classification taneous human–robot interaction using recurrent
strategies for a robot bartender, in: Proceedings and deep neural networks, International Journal of
of the 15th ACM on International conference on Social Robotics 11 (2019) 815–828.
multimodal interaction, ACM, 2013, pp. 255–262. [24] M. P. Michalowski, S. Sabanovic, R. Simmons, A
[13] I. Leite, M. McCoy, D. Ullman, N. Salomons, B. Scas- spatial model of engagement for a social robot, in:
9th IEEE International Workshop on Advanced Mo- robot interactions, IEEE Access 5 (2016) 705–721.
tion Control, IEEE, 2006, pp. 762–767. [35] K. Inoue, D. Lala, K. Takanashi, T. Kawahara, En-
[25] R. Bednarik, S. Eivazi, M. Hradis, Gaze and conver- gagement recognition in spoken dialogue via neural
sational engagement in multiparty video conversa- network by aggregating different annotators’ mod-
tion: An annotation scheme and classification of els., in: Interspeech, 2018, pp. 616–620.
high and low levels of engagement, in: Proceed- [36] K. Inoue, D. Lala, K. Takanashi, T. Kawahara, Latent
ings of the 4th Workshop on Eye Gaze in Intelligent character model for engagement recognition based
Human Machine Interaction, ACM, 2012, pp. 1–6. on multimodal behaviors, in: 9th International
[26] M. A. A. Dewan, F. Lin, D. Wen, M. Murshed, Z. Ud- Workshop on Spoken Dialogue System Technology,
din, A deep learning approach to detecting en- Springer, 2019, pp. 119–130.
gagement of online learners, in: IEEE SmartWorld, [37] F. Del Duchetto, P. Baxter, M. Hanheide, Are
Ubiquitous Intelligence & Computing, Advanced & you still with me? continuous engagement assess-
Trusted Computing, Scalable Computing & Com- ment from a robot’s point of view, arXiv preprint
munications, Cloud & Big Data Computing, Inter- arXiv:2001.03515 (2020).
net of People and Smart City Innovation, IEEE, 2018, [38] V. V. Chithrra Raghuram, H. Salam, J. Nasir,
pp. 1895–1902. B. Bruno, O. Celiktutan, Personalized productive
[27] M. Frank, G. Tofighi, H. Gu, R. Fruchter, En- engagement recognition in robot-mediated collabo-
gagement detection in meetings, arXiv preprint rative learning, in: Proceedings of the 2022 Interna-
arXiv:1608.08711 (2016). tional Conference on Multimodal Interaction, 2022,
[28] L. Devillers, S. Rosset, G. D. Duplessis, L. Bechade, pp. 632–641.
Y. Yemez, B. B. Turker, M. Sezgin, E. Erzin, K. El Had- [39] A. Kapoor, R. W. Picard, Multimodal affect recog-
dad, S. Dupont, et al., Multifaceted engagement in nition in learning environments, in: 13th an-
social interaction with a machine: The joker project, nual ACM international conference on Multimedia,
in: 13th IEEE International Conference on Auto- ACM, 2005, pp. 677–682.
matic Face & Gesture Recognition (FG), IEEE, 2018, [40] H. P. Martínez, G. N. Yannakakis, Mining multi-
pp. 697–701. modal sequential patterns: a case study on affect
[29] L. El Hamamsy, W. Johal, T. Asselborn, J. Nasir, detection, in: Proceedings of the 13th international
P. Dillenbourg, Learning by collaborative teaching: conference on multimodal interfaces, ACM, 2011,
An engaging multi-party cowriter activity, in: 28th pp. 3–10.
IEEE International Conference on Robot and Hu- [41] G. Castellano, I. Leite, A. Paiva, Detecting perceived
man Interactive Communication (ROMAN), IEEE, quality of interaction with a robot using contextual
2019, pp. 1–8. features, Autonomous Robots (2016) 1–17.
[30] A. Kapoor, R. W. Picard, Y. Ivanov, Probabilistic com- [42] J. R. Curhan, A. Pentland, Thin slices of negotiation:
bination of multiple modalities to detect interest, in: predicting outcomes from conversational dynam-
Proceedings of the 17th International Conference ics within the first 5 minutes., Journal of Applied
on Pattern Recognition, volume 3, IEEE, 2004, pp. Psychology 92 (2007) 802.
969–972. [43] D. Jayagopi, D. Sanchez-Cortes, K. Otsuka, J. Yam-
[31] S.-S. Yun, M.-T. Choi, M. Kim, J.-B. Song, Intention ato, D. Gatica-Perez, Linking speaking and looking
reading from a fuzzy-based human engagement behavior patterns with group composition, percep-
model and behavioural features, International Jour- tion, and performance, in: Proceedings of the 14th
nal of Advanced Robotic Systems (2012). ACM international conference on Multimodal in-
[32] F. Papadopoulos, L. J. Corrigan, A. Jones, G. Castel- teraction, ACM, 2012, pp. 433–440.
lano, Learner modelling and automatic engage- [44] L. S. Nguyen, D. Frauendorfer, M. S. Mast, D. Gatica-
ment recognition with robotic tutors, Proceedings - Perez, Hire me: Computational inference of hirabil-
2013 Humaine Association Conference on Affective ity in employment interviews based on nonverbal
Computing and Intelligent Interaction, ACII 2013 behavior, IEEE Transactions on Multimedia 6 (2013)
(2013) 740–744. 1018–1031.
[33] K. Masui, G. Okada, N. Tsumura, Measurement [45] J. Biel, D. Gatica-Perez, The youtube lens: Crowd-
of advertisement effect based on multimodal emo- sourced personality impressions and audiovisual
tional responses considering personality, ITE Trans- analysis of vlogs, Multimedia, IEEE Transactions
actions on Media Technology and Applications 8 on 15 (2013) 41–55.
(2020) 49–59. [46] L. Devillers, G. D. Duplessis, Toward a context-
[34] H. Salam, O. Celiktutan, I. Hupont, H. Gunes, based approach to assess engagement in human-
M. Chetouani, Fully automatic analysis of engage- robot social interaction, in: Dialogues with Social
ment and its relationship to personality in human- Robots, Springer, 2017, pp. 293–301.
[47] E. Thelisson, K. Sharma, H. Salam, V. Dignum, The
general data protection regulation: An opportunity
for the hci community?, in: Extended Abstracts
of the 2018 CHI Conference on Human Factors in
Computing Systems, 2018, pp. 1–8.
[48] J. Greczek, E. Short, C. E. Clabaugh, K. Swift-Spong,
M. Mataric, Socially assistive robotics for personal-
ized education for children, in: AAAI Fall Sympo-
sium Series, 2014, pp. 1–3.
[49] C. Kasari, A. Sturm, W. Shih, Smarter approach to
personalizing intervention for children with autism
spectrum disorder, Journal of Speech, Language,
and Hearing Research 61 (2018) 2629–2640.
[50] D. B. Jayagopi, S. Sheikhi, D. Klotz, J. Wienke, J.-M.
Odobez, S. Wrede, V. Khalidov, L. Nguyen, B. Wrede,
D. Gatica-Perez, The vernissage corpus: A multi-
modal human-robot-interaction dataset, Technical
Report, Bielefeld University, 2012.
[51] P. Wittenburg, H. Brugman, A. Russel, A. Klass-
mann, H. Sloetjes, Elan: a professional frame-
work for multimodality research, in: Proceedings
of 5th the International Conference on Language
Resources and Evaluation (LREC), 2006, pp. 1556–
1559.
[52] M. F. Mason, E. P. Tatkow, C. N. Macrae, The look
of love gaze shifts and person perception, Psycho-
logical Science 16 (2005) 236–239.
[53] C. L. Sidner, C. Lee, N. Lesh, Engagement when
looking: behaviors for robots when collaborating
with people, in: Diabruck: Proceedings of the 7th
workshop on the Semantic and Pragmatics of Dia-
logue, University of Saarland, 2003, pp. 123–130.
[54] S. Baron-Cohen, Mindblindness: An essay on
autism and theory of mind, MIT press, 1997.