<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Distinguishing Engagement Facets: An Essential Component for AI-based Interactive Healthcare</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hanan Salam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>New York University Abu Dhabi</institution>
          ,
          <addr-line>PO Box 129188, Saadiyat Island, Abu Dhabi</addr-line>
          ,
          <country country="AE">United Arab Emirates</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Social Machines &amp; Robotics (SMART) Lab, Center of AI &amp; Robotics</institution>
          ,
          <addr-line>CAIR</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Engagement in Human-Machine Interaction is the process by which entities participating in the interaction establish, maintain, and end their perceived connection. It is essential to monitor the engagement state of patients in various AI-based interactive healthcare paradigms. This includes medical conditions that alter social behavior such as Autism Spectrum Disorder (ASD) or Attention-Deficit/Hyperactivity Disorder (ADHD). Engagement is a multi-faceted construct which is composed of behavioral, emotional, and mental components. Previous research has neglected this multi-faceted nature of engagement and focused on the detection of engagement level or binary engagement label. In this paper, a system is presented to distinguish these facets using contextual and relational features. This can facilitate further fine-grained analysis. Several machine learning classifiers including traditional and deep learning models are compared for this task. An F-Score of 0.74 was obtained on a balanced dataset of 22242 instances with neural network-based classification. The proposed framework shall serve as a baseline for further research on engagement facets recognition, and its integration is socially assistive robotic applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Engagement Recognition</kwd>
        <kwd>Interactive Healthcare</kwd>
        <kwd>Afective Computing</kwd>
        <kwd>Human-Robot Interaction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney,
Australia,
* Corresponding author.
$ hanan.salam@nyu.edu (H. Salam)
 https://wp.nyu.edu/smartlab/ (H. Salam)</p>
      <p>0000-0001-6971-5264 (H. Salam)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 The corpus was chosen due to the relevance of its
inICntEernUatRionWal(CoCrBkYs4h.0o).p Proceedings (CEUR-WS.org) teraction scenario (educational followed by competitive
context) to the use of AI-based interactive healthcare in the thick of present, interacting, engaged, and just
systems. For instance, SAR for neuro-developmental dis- attending. A system to distinguish two classes of
engageorders such as ADHD and ASD, which might benefit from ment namely medium-high to high and medium-high to
a multi-faceted engagement model. For instance, an edu- low engagement was presented by [14]. MBednarik et
cational scenario can be adopted for the characterization al. [25] distinguished disparate states of conversational
of ADHD, since such context would solicit attention cues engagement states. It includes no interest, following,
rewhich are normally impaired in the case of ADHD indi- sponding, conversing, influencing and managing. They
viduals. We are aware that for the study to be complete, also modelled a bi-class problem having low/high
converit should be validated in the context of an SAR scenario. sational level. Oetrel et al. [20] distinguished 4 classes for
However, the lack of such dataset has led us to choose group involvement namely high, low, leader, steering the
a proxy dataset to perform the initial validation of the conversation, and group is forming itself. Two models
framework. were developed in [26] focusing on not-engaged/engaged
and not-engaged/normally-engaged/very-engaged state
distinction. Frank et al. [27] diferentiated 6 diferent
2. Related Work states of engagement in the thick of disengagement,
involved engagement, relaxed engagement, intention to
Engagement in Human-Robot Interaction is defined as act, action, and involved action.
the process by which two (or more) participants establish, Recently, [28] stated that engagement in HRI should be
maintain, and end their perceived connection [16, 17]. multi-faceted. Formulating binary/multi-class problems
Andrist et al. [18] analyzed an HRI dataset in terms of (engaged vs. not engaged) or a multi-class problem
(eninteraction type, quality, problem types, and the system’s gagement level) over this ignore the multi-faceted nature
failure points causing problems. Failure in the engage- of engagement. Taking this multi-faceted nature into
ment component was found to be among the major identi- consideration is very important for the design of
intelifed problems cause during the interaction. This confirms ligent social agents. For instance, this can influence the
that a highly performing engagement model is essential implemented engagement strategies within the agent’s
for the success of any HRI scenario [18]. architecture. Some studies attempted to implement
dif</p>
      <p>
        Bohus &amp; Horvitz [19] pioneered research on engage- ferent strategies related to task and social engagement.
ment in multi-party interaction. They explored disparate For instance, [29] implemented a task engagement
stratengagement strategies to allow robots to engage simulta- egy which focuses on the task at hand and having users
neously with multiple users. There are multi-farious stud- meta-cognitively reflect on the robot’s performance and
ies based on multi-party interactions. Oertel et al. [20] a social engagement strategy which focuses on their
enstudied in both individual and group level about the rela- joyment and having them meta-cognitively reflect on
tionship amidst the participants’ gaze and speech behav- their emotions with respect to the activity and the group
ior. Leite et al. [
        <xref ref-type="bibr" rid="ref16">13</xref>
        ] experimented with the generalization interactions.
capacity of an engagement model. It was trained and Diferent features have been used to distinguish
entested on single-party and multi-party scenarios respec- gagement states. Some of such features include
contextively. The opposite scenario was also considered. Salam tual [30, 14, 21] attentional [31, 32], afective [ 14, 12,
et al. [21] conducted a study on engagement recognition 26, 33] to name a few. Salam et al. [34] used person to
in a triadic HRI scenario and showed that it is possible to detect both individual and group engagement. [35, 36]
infer a participant’s engagement state based on the other combined diferent aspects like backchannels, eye gaze,
participants’ cues. head nodding-based features to detect engagement level.
      </p>
      <p>
        Most of engagement inference approaches revolved Ben et al. [23] combined several attributes like speech
around identification of a person’s intention to engage. and facial expressions, gaze and head motion, distance to
There has also been studies to detect whether the person robot to identify disengagement. Masui et al. [33] worked
is engaged/disengaged. Benkaouar et al. [22] presented with facial Action Units and physiological responses.
a system to detect disparate engagement phases. This Recent approaches explored deep learning
architecincludes intention to engage, engaged and disengaged. tures for the detection of engagement. Dewan et al. [26]
Foster et al. [12] attempted to detect whether a person used person-independent edge features and Kernel
Prinintends to engage which is a bi-class problem. Leite cipal Component Analysis (KPCA) within a deep learning
et al. [
        <xref ref-type="bibr" rid="ref16">13</xref>
        ] attempted to identify disengagement in both framework to detect online learners’ engagement using
group and individual interactions. Ben et al. [23] also facial expressions. [37] used CNN and LSTM networks to
presented a system dedicated to the similar cause. predict engagement level. [38] proposed adaptive deep
      </p>
      <p>There are diferent works which focused on detecting architectures for diferent user groups for predicting
endiferent levels of engagement of a user. Michalowski gagement in robot-mediated collaborative learning.
et al. [24] distinguished diferent levels of engagement Contextual information is being used in social signal
processing for quite some time. Kapoor et al. [39] com- education in classrooms. Children with Autism Spectrum
bined context features in the form of game state with fa- Disorder are a target population for such personalized
cial and posture features in an online educative scenario. teaching systems [49]. However, most of existing
sysMartinez and Yannakakis [40] used sequence mining for tems do not include a user’s engagement analysis module.
the prediction of computer game player afective states. Such Socially Assistive systems can largely benefit from
Castellano et al. [14] explored task and social-based con- a fine-grained analysis of engagement. This will make
textual features. In another instance, the authors [41] the systems more human-like. There has been interest in
used same contextual features for distinguishing interac- automated screening and consultation to detect problems
tion quality. of the body and mind at an early stage. This can also help</p>
      <p>Relational feature have proven to be useful in multi- to reduce the initial load on doctors. It is very important
farious instances. Curhan et al. [42] used dyad-based for the patients to feel that they are interacting with their
cues for predicting negotiation outcomes. Jayagopi et peers rather than a machine. The systems need to process
al. [43] adhered to group-based cues to understand typi- both audio and visual cues in order to properly
undercal behavior in small groups. Nguyen et al. [44] extracted stand patients. While the patients are interacting with the
relational audio-visual cues to detect the suitability of an automated systems, several states of engagement needs
applicant in a job interview. The features included audio to be monitored simultaneously. This includes level of
and visual back-channeling, nodding while speaking, mu- concentration, diferent reactions, spontaneity, to name
tual short utterances and nods. It also includes [45] that a few. Such states of engagement portray useful
informaused “looking-while-speaking” feature to understand per- tion about a patient’s health. These engagement states
sonality impressions from conversational logs extracted can be categorized into a broader spectrum of behavioral,
from YouTube. mental and emotional states. Distinguishing the
engage</p>
      <p>So far, context has been insuficiently investigated in ment facet is important at the outset for a deeper analysis.
the avenue of afective and cognitive states. Devillers et This can pave the way for systems which would better
unal. [46] highlights the importance of context in the as- derstand the condition of patients by reading their body
sessment of engagement. They identified paralinguistic, language and wont merely match spoken symptoms. This
linguistic, non-verbal, interactional, and specific emo- will especially be useful in treating and understanding
tional and mental state-based features as very important mental conditions where the body language is a vital
for engagement prediction. In this work, we investigate aspect. In the case of psychological problems, patients
relational and contextual features for the recognition of are often engaged into conversations regarding disparate
a spectrum of engagement states. The features have been aspects by doctors wherein the patient’s body language
used in isolation as well as in combination to assess their serves a vital pointer towards the mental condition.
engagement state distinction capability. These features
have not been combined previously for detecting
engagement facets. Compared to previous works, the proposed 4. Proposed Framework
features model interaction context, the robot’s behavior
and the behavioral relation between the participant in
question and the other entities of the interaction.</p>
      <sec id="sec-1-1">
        <title>The proposed framework is composed of 3 steps. First, a</title>
        <p>multi-party HRI corpus is annotated in terms of
engagement facets. Then, diferent contextual and relational
features are extracted. Finally, diferent standard
classiifers are used to classify the diferent engagement facets.
Fig. 1 presents an illustration of the proposed framework.</p>
        <sec id="sec-1-1-1">
          <title>4.1. Data Corpus</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Need for Engagement Recognition in Interactive Healthcare</title>
      <p>Technological advancements have propagated to every In this section, the data corpus along with the disparate
ifeld. There has always been eforts to automate tasks. engagement annotations is discussed.
Healthcare is one of the primal needs for society and
it also has been touched by technology [47]. Several 4.2. Interaction Scenario &amp; Modalities
interactive systems have come up to aid in automated
healthcare and the well-being of people with medical We use 4 interactions of 8 participants from the
converconditions. In [8], an SAR is proposed, whose aim is to sational HRI data corpus ‘Vernissage’ [50]. It is a
multihelp children with ADHD to improve their educational party interaction amidst the humanoid robot NAO1 and
outcome through social interaction with a robot. An- 2 participants. The interaction has diferent contexts
other educational SAR was presented by [48]. This was which can mainly be diferentiated into 2 parts. The 1
targeted towards providing assistance in personalizing
Input
• Relational
• Contextual
• Relational+</p>
      <p>Contextual
Feature extraction
.
.
.</p>
      <p>Mental
Behavioural</p>
      <p>Artificial neural network-based classification
is where the robot describes several paintings hanged on
a wall (informative/educational context). In the 2 the
robot performs a quiz with the volunteers related to art
and culture (competitive context). This was done in
order to encompass diferent variations for the engagement
states.</p>
      <p>This corpus was chosen since its interaction scenario
is relevant to the use of SAR for neuro-developmental
disorders such as ADHD and ASD, that might benefit
from a multi-faceted engagement model. For instance, an
educational scenario like the once in the first part of the
Vernissage corpus scenario can be adopted for the
characterization of ADHD, since an educational/informative
scenario would solicit attention cues which are normally
impaired in the case of ADHD individuals. We are aware
that for the study to be complete, it should be validated
in the context of an SAR scenario. However, the lack
of such dataset has led us to choose a proxy dataset to
perform our initial validation of the framework.
utes. NAO’s internal camera was used to record the clips.</p>
      <p>This provided the front view. 3 other cameras were also
used to get the left, right and rear feeds. Fig. 2 shows the
organization of the recording room.</p>
      <p>The corpus has annotations for non-verbal behaviors
of the participants. It also contains robot’s speech and
action in the log file of the robot.</p>
      <sec id="sec-2-1">
        <title>4.3. Engagement Annotations</title>
        <p>Engagement labels were assigned to 3 categories namely
mental, behavioral and emotional. These were
annotated when the participants manifested one the
following states: thinking, listening, positive/negative reaction,
responding, waiting for feedback, concentrating, and
listening to the other participant. The annotations were
performed by 2 people with the aid of Elan2 annotation
tool [51]. They watched every video 2 times (once with
the perspective of 1 participant). Discrete segments were
annotated and it was stopped as soon as a change was
observed. The Mean inter-rater Cronbach’ Alpha coeficient
was 0.93. This points to the reliability of the annotations.
The details of each category is as follows.</p>
        <sec id="sec-2-1-1">
          <title>Mental states – A segment was assigned mental state label when the participant manifested one of the following mental states:</title>
          <p>• Listening (EL): The participant is listening to</p>
          <p>NAO;
• WaitingFeedback (EWF): The participant is
waiting for NAO’s feedback after he/she had answered
a question;
• Thinking (ETh): The participant is thinking about</p>
          <p>the response to a question asked by NAO;
• Concentrating (EC): The participant is
concentrating with NAO;</p>
          <p>The average length per interaction is nearly 11 min- 2https://tla.mpi.nl/tools/tla-tools/elan/</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>4.4. Extracted Features</title>
        <sec id="sec-2-2-1">
          <title>In this study, we used the annotated cues from Vernissage</title>
          <p>corpus. Moreover, we extracted additional metrics which
were computed from the existing ones. They were
categorized into two categories: 1) contextual and 2) relational.</p>
          <p>Contextual features deal with either the diferent
entities of an interaction like the robot utterance, addressee
and topic of speech or behavioral aspects of the
participant that concern the interaction context like visual focus
of attention and addressee.</p>
          <p>Relational features encode the behavioral relation
between the participants and the robot. Fig. 3 illustrates
the features groups used in our study.
4.4.1. Contextual Features</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Interaction amidst entities involves both entities and con</title>
          <p>nection. While inferring the engagement state of an
interacting person, we consider behavior of the person
as well as our behavior. Thus, an automated engagement
identification system should also consider the same.</p>
          <p>Consequently, we employ diferent contextual features
that describes the participant’s behavior with respect to
the other entities. Moreover, for a dialogue of the robot,
we extract the robot’s utterance, addressee and topic of
speech.</p>
          <p>Participant:
1) Visual Focus Of Attention (VFOA): Gaze in
humanhuman social interactions is considered as the
primary cue of attention [52, 53]. We use VFOA
ground truth of every participant which were
annotated with 9 labels.
2) VFOA Shifts: Gaze shifts indicate people’s
engagement/ disengagement with specific
environmental stimuli [54]. We define VFOA shift as the
moment when a participant shifts attention to
a diferent subject. This feature is binary and is
computed from the VFOA labels.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>3) Addressee: When addressing somebody, we are</title>
          <p>engaged with him/her. Similarly, in the context
of HRI, when a participant addresses someone
other than the robot, he/she is disengaged from
the robot. Adressee annotations used from the
corpus and are annotated into 6 Classes: {NoLabel,
Nao, Group, PRight, PLeft, Silence}.</p>
          <p>Robot: Starting from the robot’s conversation logs, the
following were extracted.</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>1) Utterances: The labels {Speech, Silence} were as</title>
          <p>signed to frames depending on the robot’s speech
activity.</p>
        </sec>
        <sec id="sec-2-2-5">
          <title>2) Addressee: The addressee of the robot was de</title>
          <p>tected using predefined words from its speech.
The following labels were assigned {Person1,
Person2, GroupExplicit, GroupPerson1,
GroupPerson2, Person1Group, Person2Group, Group,
Silence}. ‘GroupExplicit’ label refers to such
segments where the robot was explicitly addressing
both participants. ‘GroupPersonX’ / ∈ (1, 2)
corresponds to segments where the robot
addresses the group then ‘PersonX’ while
‘PersonXGroup’ represents the inverse.
3) Topic of Speech: This was identified using a
keyword set. These were related to disparate
paintings available in the scene {manray, warhol, arp,
paintings}. Frames were allotted labels based on
them.
3) P1 Looks at P2/ P2 Talks to Robot: This
represents events where the passive participant looks
at active participant while he/she is talking to the
robot. Though this may appear to be
disengagement, but analysis revealed the inverse.</p>
        </sec>
        <sec id="sec-2-2-6">
          <title>The total number of features is 39. There were 34</title>
          <p>contextual features and 5 relational features.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>4.5. Engagement Facets Classification</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Results and Discussion</title>
      <p>The proposed framework is evaluated using 5-fold cross
validation. As the data was highly imbalanced, so a subset
of the data was drawn having equal number of instances
per class totalling to 22242 instances. The combined
features (contextual+relational) for this dataset were used
to train the diferent classifiers presented in section 4.5.</p>
      <sec id="sec-3-1">
        <title>5.1. Comparative performance analysis of standard classifiers</title>
        <p>Table 2 presents the results of training diferent
classifiers on the combined features. From the table, we
can state that the best performing classifier was ANN
with an accuracy of 74.57%, followed by the Linear
Logistic Regression model (70.35%). The performance of
SVM and Bayes Net were very close (69.6%), followed by
Naive Bayes (68.61%). Surprisingly, the least accuracy of
57.68% was obtained using the deep RNN. This might be
due to the fact that the number of samples is not suficient</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusions and future work</title>
      <p>We analyse the performance of the best performing clas- In this paper, we proposed a system to detect diferent
sifier (ANN) on the diferent engagement facets (behav- facets of engagement states: mental, emotional, and
beioral, emotional, mental). The corresponding confusion havioral. This is the first engagement framework of its
matrix is presented in Table 3. The values for diferent kind to propose such classification framework of
engageperformance metrics (true positive rate, false positive ment. This is essential for a deeper analysis of the user’s
rate, precision, recall, and F-score) for each of the classes engagement by machines. In the context of AI-based
are also presented in Table 4. healthcare systems such as socially assistive robots, such</p>
      <p>It is noted that the best performance was obtained for fine-grained analysis would improve performance, and
fathe behavioral class with an F-score of 0.794. This was cilitate adaptive interventions. For instance, recognizing
followed by the mental class where an F-score of 0.730 whether the user’s engagement is emotional, behavioral,
was obtained. The lowest performance was obtained for or mental, might better inform AI-based healthcare
systhe emotional class with an F-score of 0.707. This lower tems, especially those that rely on interactive systems
performance for the emotional class might be explained (e.g. ASR for ADHD or ASD). The proposed framework
by the fact that the current features might not be highly was validated on an HRI corpus exhibiting educational
correlated with the emotional states, and more correlated and competitive contexts, which are relevant to AI-based
with the other engagement facets. It might be worth it interactive systems. The preliminary results show that
to investigate other relevant features in the future. it is possible to classify engagement facets with a
rel</p>
      <p>Looking at the confusion matrix, we can see that the atively acceptable accuracy. These results shall serve
highest confused pair was behavioral-emotional where as a baseline for the development of more accurate
sys1318 instances were predicted as behavioral when they tems. In future, we plan to validate the framework on a
actually belonged to the emotional class. This is followed larger dataset that exhibits an SAR scenario. We plan to
by the mental-emotional pair with 1203 instances mis- work with individual features to improve the system’s
classified as mental when their actual label was emotional. performance and perform a deep grained analysis of the
Similarly, 1153 mental instances were mis-classified as diferent states. We will also explore deep learning-based
emotional. The high confusion between mental and emo- approaches and unsupervised approaches towards
detional engagement states is expected as these states might tection of engagement state types and thereafter finer
exhibit similar non-verbal cues. The confusion between classification. Deep learning will be used not only for
behavioral and emotional states is less evident. This con- data classification but also for feature extraction.
fusion might be due to the used features, which are not
suficient to precisely predict the emotional states.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>This work is supported in part by the NYUAD Center for Artificial Intelligence and Robotics, funded by Tamkeen under the NYUAD Research Institute Award CG010.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          9th IEEE International Workshop on Advanced Mo- robot interactions,
          <source>IEEE Access 5</source>
          (
          <year>2016</year>
          )
          <fpage>705</fpage>
          -
          <lpage>721</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>tion</surname>
            <given-names>Control</given-names>
          </string-name>
          , IEEE,
          <year>2006</year>
          , pp.
          <fpage>762</fpage>
          -
          <lpage>767</lpage>
          . [35]
          <string-name>
            <given-names>K.</given-names>
            <surname>Inoue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Takanashi</surname>
          </string-name>
          , T. Kawahara, En[25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bednarik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eivazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hradis</surname>
          </string-name>
          ,
          <article-title>Gaze and conver- gagement recognition in spoken dialogue via neural</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>tion: An annotation scheme and classification of els</article-title>
          ., in: Interspeech,
          <year>2018</year>
          , pp.
          <fpage>616</fpage>
          -
          <lpage>620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>high and low levels of engagement</article-title>
          , in: Proceed- [36]
          <string-name>
            <given-names>K.</given-names>
            <surname>Inoue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Takanashi</surname>
          </string-name>
          , T. Kawahara, Latent
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>ings of the 4th Workshop on Eye Gaze in Intelligent character model for engagement recognition based</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Human</given-names>
            <surname>Machine</surname>
          </string-name>
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          , ACM,
          <year>2012</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . on multimodal behaviors, in: 9th International [26]
          <string-name>
            <given-names>M. A. A.</given-names>
            <surname>Dewan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Murshed</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z</surname>
          </string-name>
          . Ud- Workshop on Spoken Dialogue System Technology,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>din</surname>
          </string-name>
          ,
          <article-title>A deep learning</article-title>
          approach to detecting en- Springer,
          <year>2019</year>
          , pp.
          <fpage>119</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>gagement of online learners</article-title>
          , in: IEEE SmartWorld, [37]
          <string-name>
            <given-names>F.</given-names>
            <surname>Del Duchetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Baxter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hanheide</surname>
          </string-name>
          , Are
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>munications</surname>
          </string-name>
          ,
          <source>Cloud &amp; Big Data Computing</source>
          , Inter- arXiv:
          <year>2001</year>
          .
          <volume>03515</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>net of People and Smart City Innovation</article-title>
          , IEEE,
          <year>2018</year>
          , [38]
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Chithrra Raghuram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Salam</surname>
          </string-name>
          , J. Nasir,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          pp.
          <fpage>1895</fpage>
          -
          <lpage>1902</lpage>
          . B.
          <string-name>
            <surname>Bruno</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Celiktutan</surname>
            , Personalized productive [27]
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Frank</surname>
            , G. Tofighi,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Fruchter</surname>
          </string-name>
          ,
          <article-title>En- engagement recognition in robot-mediated collabo-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>gagement detection in meetings, arXiv preprint rative learning</article-title>
          ,
          <source>in: Proceedings of the 2022</source>
          Interna-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>arXiv:1608.08711</source>
          (
          <year>2016</year>
          ). tional Conference on Multimodal Interaction,
          <year>2022</year>
          , [28]
          <string-name>
            <given-names>L.</given-names>
            <surname>Devillers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Duplessis</surname>
          </string-name>
          , L. Bechade, pp.
          <fpage>632</fpage>
          -
          <lpage>641</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yemez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Turker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sezgin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Erzin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>El</surname>
          </string-name>
          Had- [39]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kapoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Picard</surname>
          </string-name>
          , Multimodal afect recog-
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>dad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Dupont</surname>
          </string-name>
          , et al.,
          <article-title>Multifaceted engagement in nition in learning environments</article-title>
          ,
          <source>in: 13th an-</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>in: 13th IEEE International Conference on Auto- ACM</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>677</fpage>
          -
          <lpage>682</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>matic Face</surname>
          </string-name>
          &amp;
          <article-title>Gesture Recognition (FG)</article-title>
          , IEEE,
          <year>2018</year>
          , [40]
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Martínez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. N.</given-names>
            <surname>Yannakakis</surname>
          </string-name>
          , Mining multi-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          pp.
          <fpage>697</fpage>
          -
          <lpage>701</lpage>
          .
          <article-title>modal sequential patterns: a case study on afect</article-title>
          [29]
          <string-name>
            <given-names>L.</given-names>
            <surname>El Hamamsy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Johal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Asselborn</surname>
          </string-name>
          , J. Nasir, detection,
          <source>in: Proceedings of the 13th international</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Dillenbourg</surname>
          </string-name>
          ,
          <article-title>Learning by collaborative teaching: conference on multimodal interfaces</article-title>
          ,
          <source>ACM</source>
          ,
          <year>2011</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>An engaging multi-party cowriter activity</article-title>
          , in: 28th pp.
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>IEEE International Conference on Robot and Hu</source>
          <volume>-</volume>
          [41]
          <string-name>
            <given-names>G.</given-names>
            <surname>Castellano</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Leite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paiva</surname>
          </string-name>
          , Detecting perceived
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . features, Autonomous Robots (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          . [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kapoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Picard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ivanov</surname>
          </string-name>
          , Probabilistic com- [42]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Curhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pentland</surname>
          </string-name>
          , Thin slices of negotiation:
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>Proceedings of the 17th International Conference ics within the first</source>
          <volume>5</volume>
          <fpage>minutes</fpage>
          .,
          <source>Journal of Applied</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>on Pattern Recognition</source>
          , volume
          <volume>3</volume>
          , IEEE,
          <year>2004</year>
          , pp.
          <source>Psychology</source>
          <volume>92</volume>
          (
          <year>2007</year>
          )
          <fpage>802</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          969-
          <fpage>972</fpage>
          . [43]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jayagopi</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
            Sanchez-Cortes,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Otsuka</surname>
            , J. Yam[31]
            <given-names>S.-S.</given-names>
          </string-name>
          <string-name>
            <surname>Yun</surname>
            ,
            <given-names>M.</given-names>
            -T. Choi, M.
          </string-name>
          <string-name>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Song</surname>
          </string-name>
          , Intention ato,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Gatica-Perez,
          <article-title>Linking speaking</article-title>
          and looking
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <article-title>model and behavioural features, International Jour- tion, and performance</article-title>
          ,
          <source>in: Proceedings of the 14th</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>nal of Advanced Robotic Systems</source>
          (
          <year>2012</year>
          ). ACM international conference on Multimodal in[32]
          <string-name>
            <given-names>F.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Corrigan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jones</surname>
          </string-name>
          , G. Castel- teraction, ACM,
          <year>2012</year>
          , pp.
          <fpage>433</fpage>
          -
          <lpage>440</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <article-title>lano, Learner modelling</article-title>
          and automatic engage- [44]
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Frauendorfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Mast</surname>
          </string-name>
          , D. Gatica-
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          2013
          <string-name>
            <given-names>Humaine</given-names>
            <surname>Association</surname>
          </string-name>
          <article-title>Conference on Afective ity in employment interviews based on nonverbal</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Computing</surname>
            and
            <given-names>Intelligent</given-names>
          </string-name>
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <article-title>ACII 2013 behavior</article-title>
          ,
          <source>IEEE Transactions on Multimedia</source>
          <volume>6</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          (
          <year>2013</year>
          )
          <fpage>740</fpage>
          -
          <lpage>744</lpage>
          .
          <fpage>1018</fpage>
          -
          <lpage>1031</lpage>
          . [33]
          <string-name>
            <given-names>K.</given-names>
            <surname>Masui</surname>
          </string-name>
          , G. Okada,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tsumura</surname>
          </string-name>
          , Measurement [45]
          <string-name>
            <given-names>J.</given-names>
            <surname>Biel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gatica-Perez</surname>
          </string-name>
          ,
          <article-title>The youtube lens: Crowd-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>actions on Media Technology and Applications 8 on 15</source>
          (
          <year>2013</year>
          )
          <fpage>41</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          (
          <year>2020</year>
          )
          <fpage>49</fpage>
          -
          <lpage>59</lpage>
          . [46]
          <string-name>
            <given-names>L.</given-names>
            <surname>Devillers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Duplessis</surname>
          </string-name>
          , Toward a context[34]
          <string-name>
            <given-names>H.</given-names>
            <surname>Salam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Celiktutan</surname>
          </string-name>
          , I. Hupont,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>Gunes, based approach to assess engagement in human-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <article-title>ment and its relationship to personality in human-</article-title>
          <source>Robots</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>293</fpage>
          -
          <lpage>301</lpage>
          . [47]
          <string-name>
            <given-names>E.</given-names>
            <surname>Thelisson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Salam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dignum</surname>
          </string-name>
          , The
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <source>of the 2018 CHI Conference on Human Factors in</source>
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <given-names>Computing</given-names>
            <surname>Systems</surname>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . [48]
          <string-name>
            <given-names>J.</given-names>
            <surname>Greczek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Short</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Clabaugh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Swift-Spong</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>sium Series</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          . [49]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kasari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sturm</surname>
          </string-name>
          , W. Shih, Smarter approach to
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <source>and Hearing Research</source>
          <volume>61</volume>
          (
          <year>2018</year>
          )
          <fpage>2629</fpage>
          -
          <lpage>2640</lpage>
          . [50]
          <string-name>
            <surname>D. B. Jayagopi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sheikhi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Klotz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wienke</surname>
          </string-name>
          , J.-M.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>Report</surname>
          </string-name>
          , Bielefeld University,
          <year>2012</year>
          . [51]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wittenburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Brugman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Russel</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Klass-
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <source>of 5th the International Conference on Language</source>
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <source>Resources and Evaluation (LREC)</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>1556</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          1559. [52]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Mason</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Tatkow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Macrae</surname>
          </string-name>
          , The look
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <source>logical Science</source>
          <volume>16</volume>
          (
          <year>2005</year>
          )
          <fpage>236</fpage>
          -
          <lpage>239</lpage>
          . [53]
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Sidner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lesh</surname>
          </string-name>
          , Engagement when
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <article-title>with people</article-title>
          ,
          <source>in: Diabruck: Proceedings of the 7th</source>
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          logue, University of Saarland,
          <year>2003</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>130</lpage>
          . [54]
          <string-name>
            <given-names>S.</given-names>
            <surname>Baron-Cohen</surname>
          </string-name>
          ,
          <article-title>Mindblindness: An essay on</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <article-title>autism and theory of mind</article-title>
          , MIT press,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>