1. Introduction

Distinguishing Engagement Facets: An Essential Component for AI-based Interactive Healthcare

Hanan Salam

0 1 0 New York University Abu Dhabi , PO Box 129188, Saadiyat Island, Abu Dhabi , United Arab Emirates 1 Social Machines & Robotics (SMART) Lab, Center of AI & Robotics , CAIR

Engagement in Human-Machine Interaction is the process by which entities participating in the interaction establish, maintain, and end their perceived connection. It is essential to monitor the engagement state of patients in various AI-based interactive healthcare paradigms. This includes medical conditions that alter social behavior such as Autism Spectrum Disorder (ASD) or Attention-Deficit/Hyperactivity Disorder (ADHD). Engagement is a multi-faceted construct which is composed of behavioral, emotional, and mental components. Previous research has neglected this multi-faceted nature of engagement and focused on the detection of engagement level or binary engagement label. In this paper, a system is presented to distinguish these facets using contextual and relational features. This can facilitate further fine-grained analysis. Several machine learning classifiers including traditional and deep learning models are compared for this task. An F-Score of 0.74 was obtained on a balanced dataset of 22242 instances with neural network-based classification. The proposed framework shall serve as a baseline for further research on engagement facets recognition, and its integration is socially assistive robotic applications.

eol>Engagement Recognition Interactive Healthcare Afective Computing Human-Robot Interaction

1. Introduction

Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, Australia, * Corresponding author. $ hanan.salam@nyu.edu (H. Salam) https://wp.nyu.edu/smartlab/ (H. Salam)

0000-0001-6971-5264 (H. Salam) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 The corpus was chosen due to the relevance of its inICntEernUatRionWal(CoCrBkYs4h.0o).p Proceedings (CEUR-WS.org) teraction scenario (educational followed by competitive context) to the use of AI-based interactive healthcare in the thick of present, interacting, engaged, and just systems. For instance, SAR for neuro-developmental dis- attending. A system to distinguish two classes of engageorders such as ADHD and ASD, which might benefit from ment namely medium-high to high and medium-high to a multi-faceted engagement model. For instance, an edu- low engagement was presented by [14]. MBednarik et cational scenario can be adopted for the characterization al. [25] distinguished disparate states of conversational of ADHD, since such context would solicit attention cues engagement states. It includes no interest, following, rewhich are normally impaired in the case of ADHD indi- sponding, conversing, influencing and managing. They viduals. We are aware that for the study to be complete, also modelled a bi-class problem having low/high converit should be validated in the context of an SAR scenario. sational level. Oetrel et al. [20] distinguished 4 classes for However, the lack of such dataset has led us to choose group involvement namely high, low, leader, steering the a proxy dataset to perform the initial validation of the conversation, and group is forming itself. Two models framework. were developed in [26] focusing on not-engaged/engaged and not-engaged/normally-engaged/very-engaged state distinction. Frank et al. [27] diferentiated 6 diferent 2. Related Work states of engagement in the thick of disengagement, involved engagement, relaxed engagement, intention to Engagement in Human-Robot Interaction is defined as act, action, and involved action. the process by which two (or more) participants establish, Recently, [28] stated that engagement in HRI should be maintain, and end their perceived connection [16, 17]. multi-faceted. Formulating binary/multi-class problems Andrist et al. [18] analyzed an HRI dataset in terms of (engaged vs. not engaged) or a multi-class problem (eninteraction type, quality, problem types, and the system’s gagement level) over this ignore the multi-faceted nature failure points causing problems. Failure in the engage- of engagement. Taking this multi-faceted nature into ment component was found to be among the major identi- consideration is very important for the design of intelifed problems cause during the interaction. This confirms ligent social agents. For instance, this can influence the that a highly performing engagement model is essential implemented engagement strategies within the agent’s for the success of any HRI scenario [18]. architecture. Some studies attempted to implement dif

Bohus & Horvitz [19] pioneered research on engage- ferent strategies related to task and social engagement. ment in multi-party interaction. They explored disparate For instance, [29] implemented a task engagement stratengagement strategies to allow robots to engage simulta- egy which focuses on the task at hand and having users neously with multiple users. There are multi-farious stud- meta-cognitively reflect on the robot’s performance and ies based on multi-party interactions. Oertel et al. [20] a social engagement strategy which focuses on their enstudied in both individual and group level about the rela- joyment and having them meta-cognitively reflect on tionship amidst the participants’ gaze and speech behav- their emotions with respect to the activity and the group ior. Leite et al. [ 13 ] experimented with the generalization interactions. capacity of an engagement model. It was trained and Diferent features have been used to distinguish entested on single-party and multi-party scenarios respec- gagement states. Some of such features include contextively. The opposite scenario was also considered. Salam tual [30, 14, 21] attentional [31, 32], afective [ 14, 12, et al. [21] conducted a study on engagement recognition 26, 33] to name a few. Salam et al. [34] used person to in a triadic HRI scenario and showed that it is possible to detect both individual and group engagement. [35, 36] infer a participant’s engagement state based on the other combined diferent aspects like backchannels, eye gaze, participants’ cues. head nodding-based features to detect engagement level.

Most of engagement inference approaches revolved Ben et al. [23] combined several attributes like speech around identification of a person’s intention to engage. and facial expressions, gaze and head motion, distance to There has also been studies to detect whether the person robot to identify disengagement. Masui et al. [33] worked is engaged/disengaged. Benkaouar et al. [22] presented with facial Action Units and physiological responses. a system to detect disparate engagement phases. This Recent approaches explored deep learning architecincludes intention to engage, engaged and disengaged. tures for the detection of engagement. Dewan et al. [26] Foster et al. [12] attempted to detect whether a person used person-independent edge features and Kernel Prinintends to engage which is a bi-class problem. Leite cipal Component Analysis (KPCA) within a deep learning et al. [ 13 ] attempted to identify disengagement in both framework to detect online learners’ engagement using group and individual interactions. Ben et al. [23] also facial expressions. [37] used CNN and LSTM networks to presented a system dedicated to the similar cause. predict engagement level. [38] proposed adaptive deep

There are diferent works which focused on detecting architectures for diferent user groups for predicting endiferent levels of engagement of a user. Michalowski gagement in robot-mediated collaborative learning. et al. [24] distinguished diferent levels of engagement Contextual information is being used in social signal processing for quite some time. Kapoor et al. [39] com- education in classrooms. Children with Autism Spectrum bined context features in the form of game state with fa- Disorder are a target population for such personalized cial and posture features in an online educative scenario. teaching systems [49]. However, most of existing sysMartinez and Yannakakis [40] used sequence mining for tems do not include a user’s engagement analysis module. the prediction of computer game player afective states. Such Socially Assistive systems can largely benefit from Castellano et al. [14] explored task and social-based con- a fine-grained analysis of engagement. This will make textual features. In another instance, the authors [41] the systems more human-like. There has been interest in used same contextual features for distinguishing interac- automated screening and consultation to detect problems tion quality. of the body and mind at an early stage. This can also help

Relational feature have proven to be useful in multi- to reduce the initial load on doctors. It is very important farious instances. Curhan et al. [42] used dyad-based for the patients to feel that they are interacting with their cues for predicting negotiation outcomes. Jayagopi et peers rather than a machine. The systems need to process al. [43] adhered to group-based cues to understand typi- both audio and visual cues in order to properly undercal behavior in small groups. Nguyen et al. [44] extracted stand patients. While the patients are interacting with the relational audio-visual cues to detect the suitability of an automated systems, several states of engagement needs applicant in a job interview. The features included audio to be monitored simultaneously. This includes level of and visual back-channeling, nodding while speaking, mu- concentration, diferent reactions, spontaneity, to name tual short utterances and nods. It also includes [45] that a few. Such states of engagement portray useful informaused “looking-while-speaking” feature to understand per- tion about a patient’s health. These engagement states sonality impressions from conversational logs extracted can be categorized into a broader spectrum of behavioral, from YouTube. mental and emotional states. Distinguishing the engage

So far, context has been insuficiently investigated in ment facet is important at the outset for a deeper analysis. the avenue of afective and cognitive states. Devillers et This can pave the way for systems which would better unal. [46] highlights the importance of context in the as- derstand the condition of patients by reading their body sessment of engagement. They identified paralinguistic, language and wont merely match spoken symptoms. This linguistic, non-verbal, interactional, and specific emo- will especially be useful in treating and understanding tional and mental state-based features as very important mental conditions where the body language is a vital for engagement prediction. In this work, we investigate aspect. In the case of psychological problems, patients relational and contextual features for the recognition of are often engaged into conversations regarding disparate a spectrum of engagement states. The features have been aspects by doctors wherein the patient’s body language used in isolation as well as in combination to assess their serves a vital pointer towards the mental condition. engagement state distinction capability. These features have not been combined previously for detecting engagement facets. Compared to previous works, the proposed 4. Proposed Framework features model interaction context, the robot’s behavior and the behavioral relation between the participant in question and the other entities of the interaction.

The proposed framework is composed of 3 steps. First, a

multi-party HRI corpus is annotated in terms of engagement facets. Then, diferent contextual and relational features are extracted. Finally, diferent standard classiifers are used to classify the diferent engagement facets. Fig. 1 presents an illustration of the proposed framework.

4.1. Data Corpus 3. Need for Engagement Recognition in Interactive Healthcare

Technological advancements have propagated to every In this section, the data corpus along with the disparate ifeld. There has always been eforts to automate tasks. engagement annotations is discussed. Healthcare is one of the primal needs for society and it also has been touched by technology [47]. Several 4.2. Interaction Scenario & Modalities interactive systems have come up to aid in automated healthcare and the well-being of people with medical We use 4 interactions of 8 participants from the converconditions. In [8], an SAR is proposed, whose aim is to sational HRI data corpus ‘Vernissage’ [50]. It is a multihelp children with ADHD to improve their educational party interaction amidst the humanoid robot NAO1 and outcome through social interaction with a robot. An- 2 participants. The interaction has diferent contexts other educational SAR was presented by [48]. This was which can mainly be diferentiated into 2 parts. The 1 targeted towards providing assistance in personalizing Input • Relational • Contextual • Relational+

Contextual Feature extraction . . .

Mental Behavioural

Artificial neural network-based classification is where the robot describes several paintings hanged on a wall (informative/educational context). In the 2 the robot performs a quiz with the volunteers related to art and culture (competitive context). This was done in order to encompass diferent variations for the engagement states.

This corpus was chosen since its interaction scenario is relevant to the use of SAR for neuro-developmental disorders such as ADHD and ASD, that might benefit from a multi-faceted engagement model. For instance, an educational scenario like the once in the first part of the Vernissage corpus scenario can be adopted for the characterization of ADHD, since an educational/informative scenario would solicit attention cues which are normally impaired in the case of ADHD individuals. We are aware that for the study to be complete, it should be validated in the context of an SAR scenario. However, the lack of such dataset has led us to choose a proxy dataset to perform our initial validation of the framework. utes. NAO’s internal camera was used to record the clips.

This provided the front view. 3 other cameras were also used to get the left, right and rear feeds. Fig. 2 shows the organization of the recording room.

The corpus has annotations for non-verbal behaviors of the participants. It also contains robot’s speech and action in the log file of the robot.

4.3. Engagement Annotations

Engagement labels were assigned to 3 categories namely mental, behavioral and emotional. These were annotated when the participants manifested one the following states: thinking, listening, positive/negative reaction, responding, waiting for feedback, concentrating, and listening to the other participant. The annotations were performed by 2 people with the aid of Elan2 annotation tool [51]. They watched every video 2 times (once with the perspective of 1 participant). Discrete segments were annotated and it was stopped as soon as a change was observed. The Mean inter-rater Cronbach’ Alpha coeficient was 0.93. This points to the reliability of the annotations. The details of each category is as follows.

Mental states – A segment was assigned mental state label when the participant manifested one of the following mental states:

• Listening (EL): The participant is listening to

NAO; • WaitingFeedback (EWF): The participant is waiting for NAO’s feedback after he/she had answered a question; • Thinking (ETh): The participant is thinking about

the response to a question asked by NAO; • Concentrating (EC): The participant is concentrating with NAO;

The average length per interaction is nearly 11 min- 2https://tla.mpi.nl/tools/tla-tools/elan/

4.4. Extracted Features In this study, we used the annotated cues from Vernissage

corpus. Moreover, we extracted additional metrics which were computed from the existing ones. They were categorized into two categories: 1) contextual and 2) relational.

Contextual features deal with either the diferent entities of an interaction like the robot utterance, addressee and topic of speech or behavioral aspects of the participant that concern the interaction context like visual focus of attention and addressee.

Relational features encode the behavioral relation between the participants and the robot. Fig. 3 illustrates the features groups used in our study. 4.4.1. Contextual Features

Interaction amidst entities involves both entities and con

nection. While inferring the engagement state of an interacting person, we consider behavior of the person as well as our behavior. Thus, an automated engagement identification system should also consider the same.

Consequently, we employ diferent contextual features that describes the participant’s behavior with respect to the other entities. Moreover, for a dialogue of the robot, we extract the robot’s utterance, addressee and topic of speech.

Participant: 1) Visual Focus Of Attention (VFOA): Gaze in humanhuman social interactions is considered as the primary cue of attention [52, 53]. We use VFOA ground truth of every participant which were annotated with 9 labels. 2) VFOA Shifts: Gaze shifts indicate people’s engagement/ disengagement with specific environmental stimuli [54]. We define VFOA shift as the moment when a participant shifts attention to a diferent subject. This feature is binary and is computed from the VFOA labels.

3) Addressee: When addressing somebody, we are

engaged with him/her. Similarly, in the context of HRI, when a participant addresses someone other than the robot, he/she is disengaged from the robot. Adressee annotations used from the corpus and are annotated into 6 Classes: {NoLabel, Nao, Group, PRight, PLeft, Silence}.

Robot: Starting from the robot’s conversation logs, the following were extracted.

1) Utterances: The labels {Speech, Silence} were as

signed to frames depending on the robot’s speech activity.

2) Addressee: The addressee of the robot was de

tected using predefined words from its speech. The following labels were assigned {Person1, Person2, GroupExplicit, GroupPerson1, GroupPerson2, Person1Group, Person2Group, Group, Silence}. ‘GroupExplicit’ label refers to such segments where the robot was explicitly addressing both participants. ‘GroupPersonX’ / ∈ (1, 2) corresponds to segments where the robot addresses the group then ‘PersonX’ while ‘PersonXGroup’ represents the inverse. 3) Topic of Speech: This was identified using a keyword set. These were related to disparate paintings available in the scene {manray, warhol, arp, paintings}. Frames were allotted labels based on them. 3) P1 Looks at P2/ P2 Talks to Robot: This represents events where the passive participant looks at active participant while he/she is talking to the robot. Though this may appear to be disengagement, but analysis revealed the inverse.

The total number of features is 39. There were 34

contextual features and 5 relational features.

4.5. Engagement Facets Classification 5. Results and Discussion

The proposed framework is evaluated using 5-fold cross validation. As the data was highly imbalanced, so a subset of the data was drawn having equal number of instances per class totalling to 22242 instances. The combined features (contextual+relational) for this dataset were used to train the diferent classifiers presented in section 4.5.

5.1. Comparative performance analysis of standard classifiers

Table 2 presents the results of training diferent classifiers on the combined features. From the table, we can state that the best performing classifier was ANN with an accuracy of 74.57%, followed by the Linear Logistic Regression model (70.35%). The performance of SVM and Bayes Net were very close (69.6%), followed by Naive Bayes (68.61%). Surprisingly, the least accuracy of 57.68% was obtained using the deep RNN. This might be due to the fact that the number of samples is not suficient

6. Conclusions and future work

We analyse the performance of the best performing clas- In this paper, we proposed a system to detect diferent sifier (ANN) on the diferent engagement facets (behav- facets of engagement states: mental, emotional, and beioral, emotional, mental). The corresponding confusion havioral. This is the first engagement framework of its matrix is presented in Table 3. The values for diferent kind to propose such classification framework of engageperformance metrics (true positive rate, false positive ment. This is essential for a deeper analysis of the user’s rate, precision, recall, and F-score) for each of the classes engagement by machines. In the context of AI-based are also presented in Table 4. healthcare systems such as socially assistive robots, such

It is noted that the best performance was obtained for fine-grained analysis would improve performance, and fathe behavioral class with an F-score of 0.794. This was cilitate adaptive interventions. For instance, recognizing followed by the mental class where an F-score of 0.730 whether the user’s engagement is emotional, behavioral, was obtained. The lowest performance was obtained for or mental, might better inform AI-based healthcare systhe emotional class with an F-score of 0.707. This lower tems, especially those that rely on interactive systems performance for the emotional class might be explained (e.g. ASR for ADHD or ASD). The proposed framework by the fact that the current features might not be highly was validated on an HRI corpus exhibiting educational correlated with the emotional states, and more correlated and competitive contexts, which are relevant to AI-based with the other engagement facets. It might be worth it interactive systems. The preliminary results show that to investigate other relevant features in the future. it is possible to classify engagement facets with a rel

Looking at the confusion matrix, we can see that the atively acceptable accuracy. These results shall serve highest confused pair was behavioral-emotional where as a baseline for the development of more accurate sys1318 instances were predicted as behavioral when they tems. In future, we plan to validate the framework on a actually belonged to the emotional class. This is followed larger dataset that exhibits an SAR scenario. We plan to by the mental-emotional pair with 1203 instances mis- work with individual features to improve the system’s classified as mental when their actual label was emotional. performance and perform a deep grained analysis of the Similarly, 1153 mental instances were mis-classified as diferent states. We will also explore deep learning-based emotional. The high confusion between mental and emo- approaches and unsupervised approaches towards detional engagement states is expected as these states might tection of engagement state types and thereafter finer exhibit similar non-verbal cues. The confusion between classification. Deep learning will be used not only for behavioral and emotional states is less evident. This con- data classification but also for feature extraction. fusion might be due to the used features, which are not suficient to precisely predict the emotional states.

Acknowledgments This work is supported in part by the NYUAD Center for Artificial Intelligence and Robotics, funded by Tamkeen under the NYUAD Research Institute Award CG010.

9th IEEE International Workshop on Advanced Mo- robot interactions, IEEE Access 5 ( 2016 ) 705 - 721 .

tion

Control

, IEEE, 2006 , pp. 762 - 767 . [35]

Inoue ,

Lala ,

Takanashi , T. Kawahara, En[25]

Bednarik ,

Eivazi ,

Hradis , Gaze and conver- gagement recognition in spoken dialogue via neural

tion: An annotation scheme and classification of els ., in: Interspeech, 2018 , pp. 616 - 620 .

high and low levels of engagement , in: Proceed- [36]

Inoue ,

Lala ,

Takanashi , T. Kawahara, Latent

ings of the 4th Workshop on Eye Gaze in Intelligent character model for engagement recognition based

Human

Machine

Interaction , ACM, 2012 , pp. 1 - 6 . on multimodal behaviors, in: 9th International [26]

M. A. A.

Dewan ,

Lin ,

Wen ,

Murshed , Z . Ud- Workshop on Spoken Dialogue System Technology,

din , A deep learning approach to detecting en- Springer, 2019 , pp. 119 - 130 .

gagement of online learners , in: IEEE SmartWorld, [37]

Del Duchetto ,

Baxter ,

Hanheide , Are

munications , Cloud & Big Data Computing , Inter- arXiv: 2001 . 03515 ( 2020 ).

net of People and Smart City Innovation , IEEE, 2018 , [38]

V. V.

Chithrra Raghuram ,

Salam , J. Nasir,

pp. 1895 - 1902 . B. Bruno , O.

Celiktutan , Personalized productive [27] M.

Frank , G. Tofighi, H.

Gu , R.

Fruchter , En- engagement recognition in robot-mediated collabo-

gagement detection in meetings, arXiv preprint rative learning , in: Proceedings of the 2022 Interna-

arXiv:1608.08711 ( 2016 ). tional Conference on Multimodal Interaction, 2022 , [28]

Devillers ,

Rosset ,

G. D.

Duplessis , L. Bechade, pp. 632 - 641 .

Yemez ,

B. B.

Turker ,

Sezgin ,

Erzin ,

El Had- [39]

Kapoor ,

R. W.

Picard , Multimodal afect recog-

dad , S.

Dupont , et al., Multifaceted engagement in nition in learning environments , in: 13th an-

in: 13th IEEE International Conference on Auto- ACM , 2005 , pp. 677 - 682 .

matic Face & Gesture Recognition (FG) , IEEE, 2018 , [40]

H. P.

Martínez ,

G. N.

Yannakakis , Mining multi-

pp. 697 - 701 . modal sequential patterns: a case study on afect [29]

El Hamamsy ,

Johal ,

Asselborn , J. Nasir, detection, in: Proceedings of the 13th international

Dillenbourg , Learning by collaborative teaching: conference on multimodal interfaces , ACM , 2011 ,

An engaging multi-party cowriter activity , in: 28th pp. 3 - 10 .

IEEE International Conference on Robot and Hu - [41]

Castellano , I. Leite ,

Paiva , Detecting perceived

2019 , pp. 1 - 8 . features, Autonomous Robots ( 2016 ) 1 - 17 . [30]

Kapoor ,

R. W.

Picard ,

Ivanov , Probabilistic com- [42]

J. R.

Curhan ,

Pentland , Thin slices of negotiation:

Proceedings of the 17th International Conference ics within the first 5 minutes ., Journal of Applied

on Pattern Recognition , volume 3 , IEEE, 2004 , pp. Psychology 92 ( 2007 ) 802 .

969- 972 . [43]

Jayagopi , D. Sanchez-Cortes, K.

Otsuka , J. Yam[31] S.-S.

Yun , M. -T. Choi, M.

Kim , J.-B. Song , Intention ato, D. Gatica-Perez, Linking speaking and looking

model and behavioural features, International Jour- tion, and performance , in: Proceedings of the 14th

nal of Advanced Robotic Systems ( 2012 ). ACM international conference on Multimodal in[32]

Papadopoulos ,

L. J.

Corrigan ,

Jones , G. Castel- teraction, ACM, 2012 , pp. 433 - 440 .

lano, Learner modelling and automatic engage- [44]

L. S.

Nguyen ,

Frauendorfer ,

M. S.

Mast , D. Gatica-

2013

Humaine

Association

Conference on Afective ity in employment interviews based on nonverbal

Computing and Intelligent

Interaction , ACII 2013 behavior , IEEE Transactions on Multimedia 6 ( 2013 )

( 2013 ) 740 - 744 . 1018 - 1031 . [33]

Masui , G. Okada,

Tsumura , Measurement [45]

Biel ,

Gatica-Perez , The youtube lens: Crowd-

actions on Media Technology and Applications 8 on 15 ( 2013 ) 41 - 55 .

( 2020 ) 49 - 59 . [46]

Devillers ,

G. D.

Duplessis , Toward a context[34]

Salam ,

Celiktutan , I. Hupont, H.

Gunes, based approach to assess engagement in human-

ment and its relationship to personality in human-

Robots , Springer, 2017 , pp. 293 - 301 . [47]

Thelisson ,

Sharma ,

Salam ,

Dignum , The

of the 2018 CHI Conference on Human Factors in

Computing

Systems , 2018 , pp. 1 - 8 . [48]

Greczek ,

Short ,

C. E.

Clabaugh ,

Swift-Spong ,

sium Series , 2014 , pp. 1 - 3 . [49]

Kasari ,

Sturm , W. Shih, Smarter approach to

and Hearing Research 61 ( 2018 ) 2629 - 2640 . [50] D. B. Jayagopi , S.

Sheikhi , D.

Klotz , J.

Wienke , J.-M.

Report , Bielefeld University, 2012 . [51]

Wittenburg ,

Brugman ,

Russel , A . Klass-

of 5th the International Conference on Language

Resources and Evaluation (LREC) , 2006 , pp. 1556 -

1559. [52]

M. F.

Mason ,

E. P.

Tatkow ,

C. N.

Macrae , The look

logical Science 16 ( 2005 ) 236 - 239 . [53]

C. L.

Sidner ,

Lee ,

Lesh , Engagement when

with people , in: Diabruck: Proceedings of the 7th

logue, University of Saarland, 2003 , pp. 123 - 130 . [54]

Baron-Cohen , Mindblindness: An essay on

autism and theory of mind , MIT press, 1997 .