Distinguishing Engagement Facets: An Essential Component for AI-based Interactive Healthcare Hanan Salam1,2,*,† 1 New York University Abu Dhabi, PO Box 129188, Saadiyat Island, Abu Dhabi, United Arab Emirates 2 Social Machines & Robotics (SMART) Lab, Center of AI & Robotics (CAIR) Abstract Engagement in Human-Machine Interaction is the process by which entities participating in the interaction establish, maintain, and end their perceived connection. It is essential to monitor the engagement state of patients in various AI-based interactive healthcare paradigms. This includes medical conditions that alter social behavior such as Autism Spectrum Disorder (ASD) or Attention-Deficit/Hyperactivity Disorder (ADHD). Engagement is a multi-faceted construct which is composed of behavioral, emotional, and mental components. Previous research has neglected this multi-faceted nature of engagement and focused on the detection of engagement level or binary engagement label. In this paper, a system is presented to distinguish these facets using contextual and relational features. This can facilitate further fine-grained analysis. Several machine learning classifiers including traditional and deep learning models are compared for this task. An F-Score of 0.74 was obtained on a balanced dataset of 22242 instances with neural network-based classification. The proposed framework shall serve as a baseline for further research on engagement facets recognition, and its integration is socially assistive robotic applications. Keywords Engagement Recognition, Interactive Healthcare, Affective Computing, Human-Robot Interaction 1. Introduction tion of the current interaction context. These studies suggest that when attempting automatic inference of During the last decade, researchers have demonstrated user’s engagement state, it is important to consider this interest in enhancing the capabilities of robots to assist multi-faceted nature. humans in their daily life. This requires incorporation Application areas of Assistive Robotics include elderly of social intelligence within the robots which involves care [6], helping people with medical conditions that understanding different states of engagement. alter social behavior such as children suffering from Research in Human-Machine Interaction (HMI) has Autism Spectrum Disorder (ASD) [7] or people suffering depicted that engagement is a multi-faceted construct from Adult Deficient Hyperactivity Disorder (ADHD) [8], and consists of different components [1]. It is very much coaching and tutoring [9, 10]. Fasola and Mataric [11] important to be able to distinguish the facets before per- presented a Socially Assistive Robot (SAR) system de- forming a deeper analysis. Corrigan et al. [2] demon- signed to engage elderly users in physical exercise. Dif- strated that engagement is mainly composed of cognitive ferent variants of the robot’s verbal instructions were and affective components which are manifested by at- used to minimize the robot’s perceived verbal repetitive- tention and enjoyment. According to O’brien et al. [3], ness, and thus maintain the users’ engagement. Previ- engagement is characterized by features like challenge, ous engagement detection approaches revolve around positive affect, endurability, aesthetic and sensory appeal, a binary classification-based approach (engaged vs. not attention, feedback, variety/novelty, interactivity, and per- engaged) [12, 13] or a multi-class approach (engagement ceived user control. In the context of youth engagement level) [14, 15]. However, the multi faceted nature is sel- in activities, Ramey et al. [4] proposed a model of psycho- dom considered. logical engagement having three components: cognitive In this paper, a framework that takes into account the like thinking or concentrating, affective like enjoyment, multi-faceted nature of engagement is proposed. Engage- and relational like through connectedness to something. ment is modeled in terms of a spectrum of engagement Salam et al. [5] showed that the mental and emotional states: mental, behavioural and emotional. This is the states of the user related to engagement vary in func- first engagement framework of its kind to propose such classification framework of the facets of engagement. Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, Australia, Such analysis allows to inform the implementation of * Corresponding author. fine-grained strategies based on a deeper understanding $ hanan.salam@nyu.edu (H. Salam) of user’s states. We present a preliminary evaluation € https://wp.nyu.edu/smartlab/ (H. Salam) of this approach on an off-line multi-party HRI corpus.  0000-0001-6971-5264 (H. Salam) The corpus was chosen due to the relevance of its in- © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) teraction scenario (educational followed by competitive context) to the use of AI-based interactive healthcare in the thick of present, interacting, engaged, and just systems. For instance, SAR for neuro-developmental dis- attending. A system to distinguish two classes of engage- orders such as ADHD and ASD, which might benefit from ment namely medium-high to high and medium-high to a multi-faceted engagement model. For instance, an edu- low engagement was presented by [14]. MBednarik et cational scenario can be adopted for the characterization al. [25] distinguished disparate states of conversational of ADHD, since such context would solicit attention cues engagement states. It includes no interest, following, re- which are normally impaired in the case of ADHD indi- sponding, conversing, influencing and managing. They viduals. We are aware that for the study to be complete, also modelled a bi-class problem having low/high conver- it should be validated in the context of an SAR scenario. sational level. Oetrel et al. [20] distinguished 4 classes for However, the lack of such dataset has led us to choose group involvement namely high, low, leader, steering the a proxy dataset to perform the initial validation of the conversation, and group is forming itself. Two models framework. were developed in [26] focusing on not-engaged/engaged and not-engaged/normally-engaged/very-engaged state distinction. Frank et al. [27] differentiated 6 different 2. Related Work states of engagement in the thick of disengagement, in- volved engagement, relaxed engagement, intention to Engagement in Human-Robot Interaction is defined as act, action, and involved action. the process by which two (or more) participants establish, Recently, [28] stated that engagement in HRI should be maintain, and end their perceived connection [16, 17]. multi-faceted. Formulating binary/multi-class problems Andrist et al. [18] analyzed an HRI dataset in terms of (engaged vs. not engaged) or a multi-class problem (en- interaction type, quality, problem types, and the system’s gagement level) over this ignore the multi-faceted nature failure points causing problems. Failure in the engage- of engagement. Taking this multi-faceted nature into ment component was found to be among the major identi- consideration is very important for the design of intel- fied problems cause during the interaction. This confirms ligent social agents. For instance, this can influence the that a highly performing engagement model is essential implemented engagement strategies within the agent’s for the success of any HRI scenario [18]. architecture. Some studies attempted to implement dif- Bohus & Horvitz [19] pioneered research on engage- ferent strategies related to task and social engagement. ment in multi-party interaction. They explored disparate For instance, [29] implemented a task engagement strat- engagement strategies to allow robots to engage simulta- egy which focuses on the task at hand and having users neously with multiple users. There are multi-farious stud- meta-cognitively reflect on the robot’s performance and ies based on multi-party interactions. Oertel et al. [20] a social engagement strategy which focuses on their en- studied in both individual and group level about the rela- joyment and having them meta-cognitively reflect on tionship amidst the participants’ gaze and speech behav- their emotions with respect to the activity and the group ior. Leite et al. [13] experimented with the generalization interactions. capacity of an engagement model. It was trained and Different features have been used to distinguish en- tested on single-party and multi-party scenarios respec- gagement states. Some of such features include contex- tively. The opposite scenario was also considered. Salam tual [30, 14, 21] attentional [31, 32], affective [14, 12, et al. [21] conducted a study on engagement recognition 26, 33] to name a few. Salam et al. [34] used person to in a triadic HRI scenario and showed that it is possible to detect both individual and group engagement. [35, 36] infer a participant’s engagement state based on the other combined different aspects like backchannels, eye gaze, participants’ cues. head nodding-based features to detect engagement level. Most of engagement inference approaches revolved Ben et al. [23] combined several attributes like speech around identification of a person’s intention to engage. and facial expressions, gaze and head motion, distance to There has also been studies to detect whether the person robot to identify disengagement. Masui et al. [33] worked is engaged/disengaged. Benkaouar et al. [22] presented with facial Action Units and physiological responses. a system to detect disparate engagement phases. This Recent approaches explored deep learning architec- includes intention to engage, engaged and disengaged. tures for the detection of engagement. Dewan et al. [26] Foster et al. [12] attempted to detect whether a person used person-independent edge features and Kernel Prin- intends to engage which is a bi-class problem. Leite cipal Component Analysis (KPCA) within a deep learning et al. [13] attempted to identify disengagement in both framework to detect online learners’ engagement using group and individual interactions. Ben et al. [23] also facial expressions. [37] used CNN and LSTM networks to presented a system dedicated to the similar cause. predict engagement level. [38] proposed adaptive deep There are different works which focused on detecting architectures for different user groups for predicting en- different levels of engagement of a user. Michalowski gagement in robot-mediated collaborative learning. et al. [24] distinguished different levels of engagement Contextual information is being used in social signal processing for quite some time. Kapoor et al. [39] com- education in classrooms. Children with Autism Spectrum bined context features in the form of game state with fa- Disorder are a target population for such personalized cial and posture features in an online educative scenario. teaching systems [49]. However, most of existing sys- Martinez and Yannakakis [40] used sequence mining for tems do not include a user’s engagement analysis module. the prediction of computer game player affective states. Such Socially Assistive systems can largely benefit from Castellano et al. [14] explored task and social-based con- a fine-grained analysis of engagement. This will make textual features. In another instance, the authors [41] the systems more human-like. There has been interest in used same contextual features for distinguishing interac- automated screening and consultation to detect problems tion quality. of the body and mind at an early stage. This can also help Relational feature have proven to be useful in multi- to reduce the initial load on doctors. It is very important farious instances. Curhan et al. [42] used dyad-based for the patients to feel that they are interacting with their cues for predicting negotiation outcomes. Jayagopi et peers rather than a machine. The systems need to process al. [43] adhered to group-based cues to understand typi- both audio and visual cues in order to properly under- cal behavior in small groups. Nguyen et al. [44] extracted stand patients. While the patients are interacting with the relational audio-visual cues to detect the suitability of an automated systems, several states of engagement needs applicant in a job interview. The features included audio to be monitored simultaneously. This includes level of and visual back-channeling, nodding while speaking, mu- concentration, different reactions, spontaneity, to name tual short utterances and nods. It also includes [45] that a few. Such states of engagement portray useful informa- used “looking-while-speaking” feature to understand per- tion about a patient’s health. These engagement states sonality impressions from conversational logs extracted can be categorized into a broader spectrum of behavioral, from YouTube. mental and emotional states. Distinguishing the engage- So far, context has been insufficiently investigated in ment facet is important at the outset for a deeper analysis. the avenue of affective and cognitive states. Devillers et This can pave the way for systems which would better un- al. [46] highlights the importance of context in the as- derstand the condition of patients by reading their body sessment of engagement. They identified paralinguistic, language and wont merely match spoken symptoms. This linguistic, non-verbal, interactional, and specific emo- will especially be useful in treating and understanding tional and mental state-based features as very important mental conditions where the body language is a vital for engagement prediction. In this work, we investigate aspect. In the case of psychological problems, patients relational and contextual features for the recognition of are often engaged into conversations regarding disparate a spectrum of engagement states. The features have been aspects by doctors wherein the patient’s body language used in isolation as well as in combination to assess their serves a vital pointer towards the mental condition. engagement state distinction capability. These features have not been combined previously for detecting engage- ment facets. Compared to previous works, the proposed 4. Proposed Framework features model interaction context, the robot’s behavior The proposed framework is composed of 3 steps. First, a and the behavioral relation between the participant in multi-party HRI corpus is annotated in terms of engage- question and the other entities of the interaction. ment facets. Then, different contextual and relational features are extracted. Finally, different standard classi- 3. Need for Engagement fiers are used to classify the different engagement facets. Fig. 1 presents an illustration of the proposed framework. Recognition in Interactive Healthcare 4.1. Data Corpus Technological advancements have propagated to every In this section, the data corpus along with the disparate field. There has always been efforts to automate tasks. engagement annotations is discussed. Healthcare is one of the primal needs for society and it also has been touched by technology [47]. Several 4.2. Interaction Scenario & Modalities interactive systems have come up to aid in automated healthcare and the well-being of people with medical We use 4 interactions of 8 participants from the conver- conditions. In [8], an SAR is proposed, whose aim is to sational HRI data corpus ‘Vernissage’ [50]. It is a multi- help children with ADHD to improve their educational party interaction amidst the humanoid robot NAO1 and outcome through social interaction with a robot. An- 2 participants. The interaction has different contexts other educational SAR was presented by [48]. This was which can mainly be differentiated into 2 parts. The 1𝑠𝑡 targeted towards providing assistance in personalizing 1 https://www.softbankrobotics.com/emea/en/nao Emotional Engagement Facet ... ... • Relational • Contextual Mental • Relational+ Contextual Input Behavioural Feature extraction Artificial neural network-based classification Figure 1: Illustration of the proposed framework with an Artificial Neural Network classifier. utes. NAO’s internal camera was used to record the clips. This provided the front view. 3 other cameras were also used to get the left, right and rear feeds. Fig. 2 shows the organization of the recording room. The corpus has annotations for non-verbal behaviors of the participants. It also contains robot’s speech and action in the log file of the robot. 4.3. Engagement Annotations Engagement labels were assigned to 3 categories namely mental, behavioral and emotional. These were anno- tated when the participants manifested one the follow- ing states: thinking, listening, positive/negative reaction, Figure 2: Organisation of the recording room. NAO (orange), responding, waiting for feedback, concentrating, and lis- participants typical positions (gray circles), cameras (HD: red, tening to the other participant. The annotations were VICON: blue), wizard feedback (green), paintings (green lines), 2 windows (blue lines), VICON coordinate system (red), head performed by 2 people with the aid of Elan annotation pose calibration position(P1 and P2). tool [51]. They watched every video 2 times (once with the perspective of 1 participant). Discrete segments were annotated and it was stopped as soon as a change was ob- served. The Mean inter-rater Cronbach’ Alpha coefficient is where the robot describes several paintings hanged on was 0.93. This points to the reliability of the annotations. a wall (informative/educational context). In the 2 the The details of each category is as follows. 𝑛𝑑 robot performs a quiz with the volunteers related to art and culture (competitive context). This was done in or- Mental states – A segment was assigned mental state der to encompass different variations for the engagement label when the participant manifested one of the follow- states. ing mental states: This corpus was chosen since its interaction scenario is relevant to the use of SAR for neuro-developmental • Listening (EL): The participant is listening to disorders such as ADHD and ASD, that might benefit NAO; from a multi-faceted engagement model. For instance, an educational scenario like the once in the first part of the • WaitingFeedback (EWF): The participant is wait- Vernissage corpus scenario can be adopted for the char- ing for NAO’s feedback after he/she had answered acterization of ADHD, since an educational/informative a question; scenario would solicit attention cues which are normally impaired in the case of ADHD individuals. We are aware • Thinking (ETh): The participant is thinking about that for the study to be complete, it should be validated the response to a question asked by NAO; in the context of an SAR scenario. However, the lack • Concentrating (EC): The participant is concentrat- of such dataset has led us to choose a proxy dataset to ing with NAO; perform our initial validation of the framework. The average length per interaction is nearly 11 min- 2 https://tla.mpi.nl/tools/tla-tools/elan/ Table 1 Details on the number of annotated instances in each class State Number of instances Behavioral 10331 Emotional 7414 Mental 80902 Total 98647 • ListeningPerson2 (ELP2): The participant is listen- ing to the other who is answering NAO. Behavioral states – A segment was assigned a be- havioral state label when the participant manifested the Figure 3: Features illustration: contextual features following behavioral state: (Robot, Participant); relational features (Participant-Robot, Participant1-Participant2) • Responding (ER): The participant is responding to NAO; the other entities. Moreover, for a dialogue of the robot, Emotional states – A segment was assigned an emo- we extract the robot’s utterance, addressee and topic of tional state label when the participant manifested one of speech. the following emotional states: • PositiveReaction (EPR): The participant shows a Participant: positive reaction to NAO. 1) Visual Focus Of Attention (VFOA): Gaze in human- • NegativeReaction (ENR): The participant shows a human social interactions is considered as the negative reaction to NAO. primary cue of attention [52, 53]. We use VFOA ground truth of every participant which were The details regarding the number of annotated instances annotated with 9 labels. for each class is presented in Table 1. 2) VFOA Shifts: Gaze shifts indicate people’s engage- ment/ disengagement with specific environmen- 4.4. Extracted Features tal stimuli [54]. We define VFOA shift as the In this study, we used the annotated cues from Vernissage moment when a participant shifts attention to corpus. Moreover, we extracted additional metrics which a different subject. This feature is binary and is were computed from the existing ones. They were catego- computed from the VFOA labels. rized into two categories: 1) contextual and 2) relational. 3) Addressee: When addressing somebody, we are Contextual features deal with either the different enti- engaged with him/her. Similarly, in the context ties of an interaction like the robot utterance, addressee of HRI, when a participant addresses someone and topic of speech or behavioral aspects of the partici- other than the robot, he/she is disengaged from pant that concern the interaction context like visual focus the robot. Adressee annotations used from the of attention and addressee. corpus and are annotated into 6 Classes: {NoLabel, Relational features encode the behavioral relation be- Nao, Group, PRight, PLeft, Silence}. tween the participants and the robot. Fig. 3 illustrates the features groups used in our study. Robot: Starting from the robot’s conversation logs, the following were extracted. 4.4.1. Contextual Features 1) Utterances: The labels {Speech, Silence} were as- Interaction amidst entities involves both entities and con- signed to frames depending on the robot’s speech nection. While inferring the engagement state of an activity. interacting person, we consider behavior of the person as well as our behavior. Thus, an automated engagement 2) Addressee: The addressee of the robot was de- identification system should also consider the same. tected using predefined words from its speech. Consequently, we employ different contextual features that describes the participant’s behavior with respect to The following labels were assigned {Person1, Per- 2) Participants Mutual Laughter: This refers to son2, GroupExplicit, GroupPerson1, GroupPer- events where the two participants laugh together. son2, Person1Group, Person2Group, Group, Si- This represents reaction to the robot’s speech. lence}. ‘GroupExplicit’ label refers to such seg- ments where the robot was explicitly addressing 3) P1 Looks at P2/ P2 Talks to Robot: This repre- both participants. ‘GroupPersonX’ /𝑋 ∈ (1, 2) sents events where the passive participant looks corresponds to segments where the robot ad- at active participant while he/she is talking to the dresses the group then ‘PersonX’ while ‘Person- robot. Though this may appear to be disengage- XGroup’ represents the inverse. ment, but analysis revealed the inverse. 3) Topic of Speech: This was identified using a key- The total number of features is 39. There were 34 word set. These were related to disparate paint- contextual features and 5 relational features. ings available in the scene {manray, warhol, arp, paintings}. Frames were allotted labels based on 4.5. Engagement Facets Classification them. As this is the first work that proposed the classification of engagement facets, namely, behavioural, emotional, 4.4.2. Relational Features and mental, it is important to establish a classification We extract a set of Relational Features describing robot’s baseline. and participants’ behaviors synchrony and alignment. We compare different classifiers for the defined en- These include, among others, mutual gaze and laughter. gagement facets classification. The classifiers include A logical AND operation was used between participants’ traditional machine learning classifiers such as Bayesian and robot’s features time series for obtaining mutual Network (Bayes Net), Naive Bayes, Linear Logistic Re- events occurrence. Fig. 4 shows an example of partici- gression (LLR), Support Vector Machine (SVM), Radial pants’ mutual laughter extraction. Basis Function Network (RBF Net), and simple Artifi- cial Neural Network (ANN). A deep learning classifier, namely Recurrent Neural Network (RNN) was also used for this classification task. This helps establish an initial understanding of whether traditional machine learning classifiers are sufficient for the task, or more sophisticated Figure 4: Example of relational cues extraction. This corre- classification techniques such as deep learning methods sponds to participants’ mutual laughter detection using logical are needed. AND from laughter time series. 5. Results and Discussion Participant-Robot Features: The proposed framework is evaluated using 5-fold cross validation. As the data was highly imbalanced, so a subset 1) Gaze-Speech Alignment: We extracted events of the data was drawn having equal number of instances where a participant looks at objects correspond- per class totalling to 22242 instances. The combined ing to the robot’s topic of speech. This indicates features (contextual+relational) for this dataset were used that the participant is listening to the robot and to train the different classifiers presented in section 4.5. is interested in what it is saying. 2) P1 talks to P2/Robot Speaks: This refers to events 5.1. Comparative performance analysis of where the participants speak with each other dur- standard classifiers ing the robot’s speech. This may signal a disen- gagement behavior. Table 2 presents the results of training different clas- sifiers on the combined features. From the table, we Person1-Person2 Features: can state that the best performing classifier was ANN with an accuracy of 74.57%, followed by the Linear Lo- 1) Participants Mutual Looks: This refers to events gistic Regression model (70.35%). The performance of where the participants look at each other. Though SVM and Bayes Net were very close (69.6%), followed by this may signal disengagement but it may also Naive Bayes (68.61%). Surprisingly, the least accuracy of signal engagement as it might be a reaction to 57.68% was obtained using the deep RNN. This might be the robot’s speech. due to the fact that the number of samples is not sufficient Table 2 Table 3 Comparative analysis of the performance of standard classi- Confusion matrix for balanced dataset. fiers on the balanced dataset. Behavioral Emotional Mental Classifier Accuracy (%) Behavioral 6347 1318 916 RNN 57.68 Emotional 384 4893 1153 RBF Net 65.13 Mental 683 1203 5345 Naive Bayes 68.61 Bayes Net 69.61 SVM 69.68 Table 4 LLR 70.35 Class-wise values for performance metrics on the balanced ANN 74.57 dataset. True Positive Rate (TPR), False Positive Rate (FPR). Metrics Behavioral Emotional Mental TPR 0.856 0.660 0.721 for training deep neural networks. Consequently, tradi- FPR 0.151 0.104 0.127 tional Machine Learning approaches performed better. It Precision 0.740 0.761 0.739 might be worth it to investigate deep neural networks in Recall 0.856 0.660 0.721 future works using a higher number of instances. F-score 0.794 0.707 0.730 5.2. Performance analysis on engagement facets 6. Conclusions and future work We analyse the performance of the best performing clas- In this paper, we proposed a system to detect different sifier (ANN) on the different engagement facets (behav- facets of engagement states: mental, emotional, and be- ioral, emotional, mental). The corresponding confusion havioral. This is the first engagement framework of its matrix is presented in Table 3. The values for different kind to propose such classification framework of engage- performance metrics (true positive rate, false positive ment. This is essential for a deeper analysis of the user’s rate, precision, recall, and F-score) for each of the classes engagement by machines. In the context of AI-based are also presented in Table 4. healthcare systems such as socially assistive robots, such It is noted that the best performance was obtained for fine-grained analysis would improve performance, and fa- the behavioral class with an F-score of 0.794. This was cilitate adaptive interventions. For instance, recognizing followed by the mental class where an F-score of 0.730 whether the user’s engagement is emotional, behavioral, was obtained. The lowest performance was obtained for or mental, might better inform AI-based healthcare sys- the emotional class with an F-score of 0.707. This lower tems, especially those that rely on interactive systems performance for the emotional class might be explained (e.g. ASR for ADHD or ASD). The proposed framework by the fact that the current features might not be highly was validated on an HRI corpus exhibiting educational correlated with the emotional states, and more correlated and competitive contexts, which are relevant to AI-based with the other engagement facets. It might be worth it interactive systems. The preliminary results show that to investigate other relevant features in the future. it is possible to classify engagement facets with a rel- Looking at the confusion matrix, we can see that the atively acceptable accuracy. These results shall serve highest confused pair was behavioral-emotional where as a baseline for the development of more accurate sys- 1318 instances were predicted as behavioral when they tems. In future, we plan to validate the framework on a actually belonged to the emotional class. This is followed larger dataset that exhibits an SAR scenario. We plan to by the mental-emotional pair with 1203 instances mis- work with individual features to improve the system’s classified as mental when their actual label was emotional. performance and perform a deep grained analysis of the Similarly, 1153 mental instances were mis-classified as different states. We will also explore deep learning-based emotional. The high confusion between mental and emo- approaches and unsupervised approaches towards de- tional engagement states is expected as these states might tection of engagement state types and thereafter finer exhibit similar non-verbal cues. The confusion between classification. Deep learning will be used not only for behavioral and emotional states is less evident. This con- data classification but also for feature extraction. fusion might be due to the used features, which are not sufficient to precisely predict the emotional states. Acknowledgments This work is supported in part by the NYUAD Center for Artificial Intelligence and Robotics, funded by Tamkeen under the NYUAD Research Institute Award CG010. References sellati, Comparing models of disengagement in individual and group interactions, in: Proceedings [1] H. Salam, O. Celiktutan, H. Gunes, M. Chetouani, of the 10th Annual ACM/IEEE International Con- Automatic context-driven inference of engagement ference on Human-Robot Interaction, ACM, 2015, in hmi: A survey, arXiv preprint arXiv:2209.15370 pp. 99–105. (2022). [14] G. Castellano, I. Leite, A. Pereira, C. Martinho, [2] L. J. Corrigan, C. Peters, D. Küster, G. Castellano, To- A. Paiva, P. W. McOwan, Detecting engagement in ward Robotic Socially Believable Behaving Systems hri: An exploration of social and task-based context, - Volume I : Modeling Emotions, Springer Interna- in: International Conference on Privacy, Security, tional Publishing, 2016, pp. 29–51. Risk and Trust and International Conference on [3] H. L. O’Brien, E. G. Toms, What is user engage- Social Computing, IEEE, 2012, pp. 421–428. ment? a conceptual framework for defining user [15] C. Peters, S. Asteriadis, K. Karpouzis, Investigat- engagement with technology, Journal of the Ameri- ing shared attention with a virtual agent using a can Society for Information Science and Technology gaze-based interface, Journal on Multimodal User 59 (2008) 938–955. Interfaces 3 (2010) 119–130. [4] H. L. Ramey, L. Rose-Krasnor, M. A. Busseri, S. Gad- [16] C. L. Sidner, C. D. Kidd, C. Lee, N. Lesh, Where to bois, A. Bowker, L. Findlay, Measuring psycho- look: a study of human-robot engagement, in: 9th logical engagement in youth activity involvement, international conference on Intelligent user inter- Journal of adolescence 45 (2015) 237–249. faces, ACM, 2004, pp. 78–84. [5] H. Salam, M. Chetouani, A multi-level context- [17] C. L. Sidner, C. Lee, C. D. Kidd, N. Lesh, C. Rich, based modeling of engagement in human-robot in- Explorations in engagement for humans and robots, teraction, in: 11th IEEE International Conference Artificial Intelligence 166 (2005) 140–164. and Workshops on Automatic Face and Gesture [18] S. Andrist, D. Bohus, E. Kamar, E. Horvitz, What Recognition (FG), volume 3, IEEE, 2015, pp. 1–6. went wrong and why? diagnosing situated interac- [6] E. Broadbent, C. Jayawardena, N. Kerse, R. Stafford, tion failures in the wild, in: International Confer- B. MacDonald, Human-robot interaction research ence on Social Robotics, Springer, 2017, pp. 293–303. to improve quality of life in elder care–an approach [19] D. Bohus, E. Horvitz, Models for multiparty en- and issues, in: 25th Conference on Artificial Intelli- gagement in open-world dialog, in: Proceedings gence (AAAI). Workshop on Human-Robot Interac- of the SIGDIAL 2009 Conference: The 10th Annual tion in Elder Care, 2011, pp. 13–19. Meeting of the Special Interest Group on Discourse [7] D. Feil-seifer, U. Viterbi, et al., Development of and Dialogue, Association for Computational Lin- socially assistive robots for children with autism guistics, 2009, pp. 225–234. spectrum disorders, Technical Report, Center for [20] C. Oertel, G. Salvi, A gaze-based method for relating Robotics and Embedded Systems, 2009. group involvement to individual engagement in [8] M. Fridin, Y. Yaakobi, Educational robot for children multimodal multiparty dialogue, in: Proceedings with adhd/add, in: Architectural Design, Interna- of the 15th ACM on International conference on tional Conference on Computational Vision and multimodal interaction, ACM, 2013, pp. 99–106. Robotics, Bhubaneswar, INDIA, 2011, pp. 1–7. [21] H. Salam, M. Chetouani, Engagement detection [9] J. Greczek, M. Matarić, Expanding the computa- based on mutli-party cues for human robot inter- tional model of graded cueing: Robots encouraging action, in: International Conference on Affective health behavior change, in: 29th AAAI Conference Computing and Intelligent Interaction (ACII), IEEE, on Artificial Intelligence, 2014, pp. 1–2. 2015, pp. 341–347. [10] X. Zhu, D. Ramanan, Face detection, pose esti- [22] W. Benkaouar, D. Vaufreydaz, Multi-sensors en- mation, and landmark localization in the wild, in: gagement detection with a robot companion in IEEE Conference on Computer Vision and Pattern a home environment, in: Workshop on Assis- Recognition (CVPR), IEEE, 2012, pp. 2879–2886. tance and Service robotics in a human environ- [11] J. Fasola, M. Mataric, A socially assistive robot ment at IEEE International Conference on Intel- exercise coach for the elderly, Journal of Human- ligent Robots and Systems (IROS), 2012, pp. 45–52. Robot Interaction 2 (2013) 3–32. [23] A. Ben-Youssef, G. Varni, S. Essid, C. Clavel, On-the- [12] M. E. Foster, A. Gaschler, M. Giuliani, How can fly detection of user engagement decrease in spon- i help you’: comparing engagement classification taneous human–robot interaction using recurrent strategies for a robot bartender, in: Proceedings and deep neural networks, International Journal of of the 15th ACM on International conference on Social Robotics 11 (2019) 815–828. multimodal interaction, ACM, 2013, pp. 255–262. [24] M. P. Michalowski, S. Sabanovic, R. Simmons, A [13] I. Leite, M. McCoy, D. Ullman, N. Salomons, B. Scas- spatial model of engagement for a social robot, in: 9th IEEE International Workshop on Advanced Mo- robot interactions, IEEE Access 5 (2016) 705–721. tion Control, IEEE, 2006, pp. 762–767. [35] K. Inoue, D. Lala, K. Takanashi, T. Kawahara, En- [25] R. Bednarik, S. Eivazi, M. Hradis, Gaze and conver- gagement recognition in spoken dialogue via neural sational engagement in multiparty video conversa- network by aggregating different annotators’ mod- tion: An annotation scheme and classification of els., in: Interspeech, 2018, pp. 616–620. high and low levels of engagement, in: Proceed- [36] K. Inoue, D. Lala, K. Takanashi, T. Kawahara, Latent ings of the 4th Workshop on Eye Gaze in Intelligent character model for engagement recognition based Human Machine Interaction, ACM, 2012, pp. 1–6. on multimodal behaviors, in: 9th International [26] M. A. A. Dewan, F. Lin, D. Wen, M. Murshed, Z. Ud- Workshop on Spoken Dialogue System Technology, din, A deep learning approach to detecting en- Springer, 2019, pp. 119–130. gagement of online learners, in: IEEE SmartWorld, [37] F. Del Duchetto, P. Baxter, M. Hanheide, Are Ubiquitous Intelligence & Computing, Advanced & you still with me? continuous engagement assess- Trusted Computing, Scalable Computing & Com- ment from a robot’s point of view, arXiv preprint munications, Cloud & Big Data Computing, Inter- arXiv:2001.03515 (2020). net of People and Smart City Innovation, IEEE, 2018, [38] V. V. Chithrra Raghuram, H. Salam, J. Nasir, pp. 1895–1902. B. Bruno, O. Celiktutan, Personalized productive [27] M. Frank, G. Tofighi, H. Gu, R. Fruchter, En- engagement recognition in robot-mediated collabo- gagement detection in meetings, arXiv preprint rative learning, in: Proceedings of the 2022 Interna- arXiv:1608.08711 (2016). tional Conference on Multimodal Interaction, 2022, [28] L. Devillers, S. Rosset, G. D. Duplessis, L. Bechade, pp. 632–641. Y. Yemez, B. B. Turker, M. Sezgin, E. Erzin, K. El Had- [39] A. Kapoor, R. W. Picard, Multimodal affect recog- dad, S. Dupont, et al., Multifaceted engagement in nition in learning environments, in: 13th an- social interaction with a machine: The joker project, nual ACM international conference on Multimedia, in: 13th IEEE International Conference on Auto- ACM, 2005, pp. 677–682. matic Face & Gesture Recognition (FG), IEEE, 2018, [40] H. P. Martínez, G. N. Yannakakis, Mining multi- pp. 697–701. modal sequential patterns: a case study on affect [29] L. El Hamamsy, W. Johal, T. Asselborn, J. Nasir, detection, in: Proceedings of the 13th international P. Dillenbourg, Learning by collaborative teaching: conference on multimodal interfaces, ACM, 2011, An engaging multi-party cowriter activity, in: 28th pp. 3–10. IEEE International Conference on Robot and Hu- [41] G. Castellano, I. Leite, A. Paiva, Detecting perceived man Interactive Communication (ROMAN), IEEE, quality of interaction with a robot using contextual 2019, pp. 1–8. features, Autonomous Robots (2016) 1–17. [30] A. Kapoor, R. W. Picard, Y. Ivanov, Probabilistic com- [42] J. R. Curhan, A. Pentland, Thin slices of negotiation: bination of multiple modalities to detect interest, in: predicting outcomes from conversational dynam- Proceedings of the 17th International Conference ics within the first 5 minutes., Journal of Applied on Pattern Recognition, volume 3, IEEE, 2004, pp. Psychology 92 (2007) 802. 969–972. [43] D. Jayagopi, D. Sanchez-Cortes, K. Otsuka, J. Yam- [31] S.-S. Yun, M.-T. Choi, M. Kim, J.-B. Song, Intention ato, D. Gatica-Perez, Linking speaking and looking reading from a fuzzy-based human engagement behavior patterns with group composition, percep- model and behavioural features, International Jour- tion, and performance, in: Proceedings of the 14th nal of Advanced Robotic Systems (2012). ACM international conference on Multimodal in- [32] F. Papadopoulos, L. J. Corrigan, A. Jones, G. Castel- teraction, ACM, 2012, pp. 433–440. lano, Learner modelling and automatic engage- [44] L. S. Nguyen, D. Frauendorfer, M. S. Mast, D. Gatica- ment recognition with robotic tutors, Proceedings - Perez, Hire me: Computational inference of hirabil- 2013 Humaine Association Conference on Affective ity in employment interviews based on nonverbal Computing and Intelligent Interaction, ACII 2013 behavior, IEEE Transactions on Multimedia 6 (2013) (2013) 740–744. 1018–1031. [33] K. Masui, G. Okada, N. Tsumura, Measurement [45] J. Biel, D. Gatica-Perez, The youtube lens: Crowd- of advertisement effect based on multimodal emo- sourced personality impressions and audiovisual tional responses considering personality, ITE Trans- analysis of vlogs, Multimedia, IEEE Transactions actions on Media Technology and Applications 8 on 15 (2013) 41–55. (2020) 49–59. [46] L. Devillers, G. D. Duplessis, Toward a context- [34] H. Salam, O. Celiktutan, I. Hupont, H. Gunes, based approach to assess engagement in human- M. Chetouani, Fully automatic analysis of engage- robot social interaction, in: Dialogues with Social ment and its relationship to personality in human- Robots, Springer, 2017, pp. 293–301. [47] E. Thelisson, K. Sharma, H. Salam, V. Dignum, The general data protection regulation: An opportunity for the hci community?, in: Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems, 2018, pp. 1–8. [48] J. Greczek, E. Short, C. E. Clabaugh, K. Swift-Spong, M. Mataric, Socially assistive robotics for personal- ized education for children, in: AAAI Fall Sympo- sium Series, 2014, pp. 1–3. [49] C. Kasari, A. Sturm, W. Shih, Smarter approach to personalizing intervention for children with autism spectrum disorder, Journal of Speech, Language, and Hearing Research 61 (2018) 2629–2640. [50] D. B. Jayagopi, S. Sheikhi, D. Klotz, J. Wienke, J.-M. Odobez, S. Wrede, V. Khalidov, L. Nguyen, B. Wrede, D. Gatica-Perez, The vernissage corpus: A multi- modal human-robot-interaction dataset, Technical Report, Bielefeld University, 2012. [51] P. Wittenburg, H. Brugman, A. Russel, A. Klass- mann, H. Sloetjes, Elan: a professional frame- work for multimodality research, in: Proceedings of 5th the International Conference on Language Resources and Evaluation (LREC), 2006, pp. 1556– 1559. [52] M. F. Mason, E. P. Tatkow, C. N. Macrae, The look of love gaze shifts and person perception, Psycho- logical Science 16 (2005) 236–239. [53] C. L. Sidner, C. Lee, N. Lesh, Engagement when looking: behaviors for robots when collaborating with people, in: Diabruck: Proceedings of the 7th workshop on the Semantic and Pragmatics of Dia- logue, University of Saarland, 2003, pp. 123–130. [54] S. Baron-Cohen, Mindblindness: An essay on autism and theory of mind, MIT press, 1997.