=Paper=
{{Paper
|id=Vol-3649/paper16
|storemode=property
|title=Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues
|pdfUrl=https://ceur-ws.org/Vol-3649/Paper16.pdf
|volume=Vol-3649
|authors=Maneesh Bilalpur,Mert Inan,Dorsa Zeinali,Jeffrey F. Cohn,Malihe Alikhani
|dblpUrl=https://dblp.org/rec/conf/aaai/BilalpurIZCA24
}}
==Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues==
Learning to Generate Context-Sensitive Backchannel Smiles
for Embodied AI Agents with Applications in Mental Health
Dialogues
Maneesh Bilalpur1,∗ , Mert Inan2 , Dorsa Zeinali2 , Jeffrey F. Cohn1 and Malihe Alikhani2
1
University of Pittsburgh, Pittsburgh, Pennsylvania, USA
2
Northeastern University, Boston, Massachusetts, USA
Abstract
Addressing the critical shortage of mental health resources for effective screening, diagnosis, and treatment remains a
significant challenge. This scarcity underscores the need for innovative solutions, particularly in enhancing the accessibility
and efficacy of therapeutic support. Embodied agents with advanced interactive capabilities emerge as a promising and
cost-effective supplement to traditional caregiving methods. Crucial to these agents’ effectiveness is their ability to simulate
non-verbal behaviors, like backchannels, that are pivotal in establishing rapport and understanding in therapeutic contexts but
remain under-explored. To improve the rapport-building capabilities of embodied agents we annotated backchannel smiles in
videos of intimate face-to-face conversations over topics such as mental health, illness, and relationships. We hypothesized
that both speaker and listener behaviors affect the duration and intensity of backchannel smiles. Using cues from speech
prosody and language along with the demographics of the speaker and listener, we found them to contain significant predictors
of the intensity of backchannel smiles. Based on our findings, we introduce backchannel smile production in embodied
agents as a generation problem. Our attention-based generative model suggests that listener information offers performance
improvements over the baseline speaker-centric generation approach. Conditioned generation using the significant predictors
of smile intensity provides statistically significant improvements in empirical measures of generation quality. Our user study
by transferring generated smiles to an embodied agent suggests that agent with backchannel smiles is perceived to be more
human-like and is an attractive alternative for non-personal conversations over agent without backchannel smiles.
1. Introduction
Fewer than a third of the US population has sufficient
access to mental health professionals [1]. This highlights
the need for additional resources to help mental health
professionals meet the community’s demands. Problems
like symptom detection and evaluating treatment effi-
cacy have made great strides with AI [2, 3, 4] and the
mental health community can greatly benefit from this
AI intervention. Embodied agent-based systems due to
their multimodal behavioral capabilities are a promis-
ing solution to support such mental health needs. How-
ever, the development of such systems presents numer-
ous challenges. These include the scarcity of mental
health-related datasets, limited access to domain experts Figure 1: Overview of steps for backchannel smile generation
for designing reliable and robust systems, and the ethi- in an embodied agent in a human-agent interaction: Speaker
cal considerations crucial to their design and adaptation. and listener (agent) turns are used to generate the listener’s
Among such challenges, one aspect that stands out is response facial expression as landmarks. The landmarks are
the agent’s ability to establish a common ground with then integrated with the embodied agent and added to the
users. Addressing this is particularly crucial when the conversation flow represented as a dotted arrow.
agent functions as a listener. Effective grounding in such
Machine Learning for Cognitive and Mental Health Workshop scenarios relies heavily on multimodal non-verbal be-
(ML4CMH), AAAI 2024, Vancouver, BC, Canada. haviors like backchannels. These subtle yet impactful
∗
Corresponding author. cues are pivotal in building rapport and understanding
Envelope-Open mab623@pitt.edu (M. Bilalpur); inan.m@northeastern.edu
(M. Inan); zeinali.d@northeastern.edu (D. Zeinali);
between the user and the agent. Hence, understanding
jeffcohn@pitt.edu (J. F. Cohn); m.alikhani@northeastern.edu and incorporating these behaviors into embodied agents
(M. Alikhani) is not only challenging but also essential for creating a
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
supportive and empathetic environment for individuals and their physical realization by emulating the
seeking mental health support. Addressing these chal- generated behavior with an embodied agent.
lenges can pave the way for more effective, accessible, 5. Show that our BC smile generation yields appro-
and empathetic digital mental health interventions. priate and natural-looking smiles through a user
In dyadic conversations, at any given time one person study involving the embodied agent.
may have the floor (i.e., is speaking) while the other is
listening. Backchannels (BC) refer to behaviors of the Results suggest speaker sex, their use of negations,
listener that do not interrupt the speaker. BCs signal loudness, word count in the listener’s turn, their usage of
attention, agreement, and emotional response to what is comparisons, and mean pitch are significant predictors
said. Inappropriate BC smiles such as ones that appear of BC smile intensity. Our generative approach shows
too short or too long or for which the timing appears that taking listeners’ behavior into account improves
“off” can disrupt the conversational rapport and result in performance, and adding the conditioning vector offers
unsuccessful or disrupted conversations. Our objective significant improvements in terms of empirical metrics
is to understand appropriate BC smiles from dyadic con- such as Average Pose Error (APE) and Probability of
versations and how an embodied agent can employ them Correct Keypoints (PCK).
when interacting with a human.
Conversational agents typically realize BC smiles us- 2. Related Work
ing rule-based systems, discriminative approaches, or
sometimes simply mimicking the smiles of the speaker. Existing works have validated the efficacy of an agent-
Mimicking, however, fails to generalize to situations that driven conversation in mental health dialogue and coun-
require a contextually relevant smile. And rule-based seling situations. DeVault et al. [6], through their agent-
and discriminative approaches offer limited coverage due based interviews for distress and trauma symptoms,
to the diversity of smiles [5]. found that participants were comfortable interacting with
We present a generative approach for BC smiles in the agent as well as sharing intimate information. Utami
listeners to address these limitations and enable contextu- and Bickmore [7] used embodied agents for couples coun-
ally relevant BC smiles in embodied agents. An overview seling. Participants reported significantly improved af-
of the approach is presented in Figure 1. Unlike existing fect and intimacy with their partner and generally en-
works that solely depend on speaker behavior for BC pro- joyed the agent-driven counseling session. Our work
duction (see related work section), we use both speaker builds on this line of research to improve the BC capabil-
and listener behaviors to study how they affect the in- ities of agents.
tensity and duration of the BC smile. We use cues from Backchannel behaviors were traditionally produced
prosody, language, and the demographics of dyads to using a set of predefined rules based on prosodic or lin-
identify statistically significant predictors (referred to as guistic cues of the speaker. Both Ward and Tsukahara
a conditioning vector) of smiles. In addition to the audio [8], Benus et al. [9] have found prosodic cues (particu-
features from both interaction participants, we leverage larly pitch and its changes) to be reliable predictors for
the conditioning vector in generating the BC smiles. In vocal BC occurrence. In contrast, we use prosody and
this paper, we: linguistic cues from both speaker and listener to identify
significant predictors of BC smiles.
1. Annotate backchannel smiles in a face-to-face
In the multimodal context, Bertrand et al. [10] stud-
interaction dataset1 of dyads that differ in their
ied prosodic, morphological, and discourse markers for
composition of biological sex and type of relation-
their effect on vocal and gestural backchannels (hand ges-
ship.
tures, smiles, eyebrows), and Truong et al. [11] explored
2. Present our statistical analysis to identify vari- visual BCs by often limiting them to head nods and, at
ous speaker and listener-specific cues that sig- times, grouping different BCs into the same category [12]
nificantly predict the duration and intensity of without accounting for their intrinsic differences. They
backchannel smiles. depended on the speaker’s behavior to identify the occur-
3. Generate backchannel smiles using an attention- rence and ignored the listener. In addition to leveraging
based generative model that uses the listener and the listener behavior, we specifically study smiles because
speaker turn features with the identified signifi- of their diversity and include both unimodal (visual) and
cant predictors. bimodal (visual together with vocal activity) BC smiles.
4. Bridge the gap between the model-based genera- Wang et al. [13] introduced diversity in generated
tion of non-verbal behaviors (as facial landmarks) smiles by conditioning on a specific class and sampling
using a variational autoencoder. Learn2Smile [14] used
1
Data and code: https://github.com/bmaneesh/Generating-Context- the facial landmarks of the speaker to generate com-
Sensitive-Backchannel-Smiles/ plete listener behavior by separately predicting the low-
frequency (nods) and high-frequency (blinks) compo-
nents of facial motion. Ng et al. [15] leverage the speaker
and listener’s motion and speech features to predict
the listener’s future motion information. Unlike earlier
works that have been limited to facial expression genera-
tion using landmarks, their usage of 3D Morphable Mod-
els to define facial expressions offers a flexible solution
to generate realistic facial expressions in the presence
of diverse head orientations. These solutions focus on Figure 2: Distribution of speaker and listener sex across differ-
the entire listener’s behavior and offer no insights about ent interpersonal relationships in annotated RealTalk dataset.
Relationships are color-coded: siblings (pink), friends (orange),
specific BC behaviors. Their integrations are also limited
paternal (green), and romantic couple (grey).
to 3D Morphable Models.
The BC smiles produced in this work not only leverage
the speaker and listener activity but also condition the
generation on salient factors that were found to be signif- the 191 annotated smiles had an A-level or higher in-
icant predictors of smile attributes – duration (the time tensity. One outlier smile was dropped because of the
elapsed between the onset of a smile and its offset) and extremely long duration. The resultant 157 smiles, along
intensity (maximum amplitude of a smile). Using an em- with their predicted intensity, were used in this work.
bodied agent, we also bridge the gap between generated In addition to the video recordings at 25 fps and 720p
landmarks and their physical realization. resolution, the dataset also contains speaker-identified
turn-level text obtained through automatic transcription
[18]. The individuals in the dyadic interaction occupied
3. Dataset fixed positions (left and right) in the videos. In this work,
the biological sex of the participants was inferred from
One of the primary challenges in studying non-verbal the videos. Videos where sex could not be established
behavior in mental health interactions is access to an with confidence were discarded.
appropriate dataset. Patient-therapist interactions or in-
teractions with mental health professionals are access-
restricted to protect the identifiable information of the 3.2. Effect of Sex and Relationship on
individuals. As a result, we use a YouTube-based large- Smile Attributes
scale dataset of face-to-face dyadic interactions–RealTalk
Given various interpersonal relationships in the dataset
[16]. The RealTalk dataset consists of individuals taking
of individuals of both sexes, we compared the mean du-
turns asking predefined, intimate questions about family,
ration of backchannel smiles across the factors using
dreams, relationships, illness, and mental health2 . We
ANOVA (Table 1) with type-III sum of squares to account
believe intimate conversations are among the closest ac-
for imbalance between males and females. Two-way in-
cessible alternatives to studying BC behaviors for mental
teractions between sex, and sex and relationship were
health applications. In this section, we elaborate on our
also included. The ANOVA analysis suggests that the
contributions in terms of the annotations for BC smiles
duration of backchannel smiles differs significantly by
and discuss how they differ by the demographics of the
listener sex and the interaction effect of the listener sex
dyads and features from the speaker and listener turn
and relationship. A post hoc Tukey revealed that male
preceding it.
listeners, when interacting with their siblings (regardless
of speaker sex), express longer BC smiles (p<0.05).
3.1. Annotating Backchannel Smiles Similarly, the intensity of smiles marginally differed
We manually annotated 191 BC smiles from 48 (out of 692) by the speaker’s sex. The post hoc Tukey revealed that
dyadic interactions in the RealTalk dataset. The dyads the smiles as a response to a male speaker are less in-
comprised male and female participants from different tense than a female speaker (p<0.1). ANOVA analysis is
ethnicities, and social relationships such as siblings, pa- presented in the appendix as Table 4.
ternal, romantic, and fraternal. The smiles were nearly
balanced across the different interpersonal relationships 3.3. Effect of Context Cues
(see Figure 2). An automated facial expression prediction
Our contextual cues were extracted from prosody and
framework [17] was used to evaluate the reliability of
speech features independently derived from the turns of
the manual annotations. About 83% (i.e., 158 smiles) of
both the speaker and the listener just before the smile on-
2
The original videos can be accessed from https://www.youtube.
set. Since the speaker’s turn continues while the listener
com/c/TheSkinDeep backchannels, speaker activity till the onset of the smiles
Table 1 Speech cues: The spoken content of speaker and lis-
tener turns was also accounted for through variables
ANOVA of listener sex, speaker sex, and relationship on dura-
tion of smile. ‘*’ indicates p<0.05 and ‘**’ indicates p<0.01).
from the Linguistic Inquiry and Word Count (LIWC) [22]
Df Sum Sq Mean Sq F value Pr(>F) framework. These variables were word count, usage of
𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 1 12.36 12.36 4.59 0.0339 *
negations (no, not, never), comparisons (greater, best,
𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 1 1.29 1.29 0.48 0.4907 after), interrogative words (how, when, what), valence
𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 3 4.18 1.39 0.52 0.6709 of the turns (positive or negative emotion), and focus on
𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗ events in the past, present and future.
3 42.80 14.27 5.29 0.0017 **
𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗ A generalized linear model predicted the smile inten-
1 0.90 0.90 0.33 0.5652
𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 sity from context cues and dyad demographics. Results
𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 ∗ using an inverse link function (model explained vari-
3 9.70 3.23 1.20 0.3123
𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
ance 𝑅2 = 0.243) with the prosody and speech cues from
Residuals 144 388.03 2.69
the audio signal are presented as Figure 3. Note that
the speakers’ and listeners’ context cues were Z-score
normalized. Speaker characteristics such as sex and nega-
was considered in this study. The audio was trimmed tions were found to be significant predictors of intensity.
to the onset to obtain corresponding contextual cues, Female speakers elicited significantly narrower smiles
and the Montreal Forced Aligner (MFA) [19] was used to from their listeners, but the speaker’s usage of negations
extract corresponding transcription information. resulted in wider smiles. The speaker’s loudness (RMS
energy) had a marginally significant negative correlation
with the smile intensity. Listener behavior also signif-
icantly impacted their BC smiles. Using comparative
words by the listener and their mean pitch in their pre-
ceding turn resulted in significantly narrower smiles. In
contrast, their word count had a marginally significant
positive correlation with intensity. A similar analysis for
duration did not reveal any significant correlations.
4. Modeling Smiles
To automatically generate BC smile and non-smile ac-
tivity in listeners, we use the audio from the speaker’s
current turn and the listener’s last turn as input. 15
smiles were dropped due to difficulties in the preprocess-
ing steps with MFA. The remaining 142 annotated smile
Figure 3: Regression slopes showing the effect of context cues instances were augmented with an equal number of non-
on the intensity of BC smiles. A positive slope indicates the smile instances. The non-smile instances were identified
smile intensity increases with a given feature (vice-versa for a so that they were at least two seconds away from the
negative slope). * indicates slope is significant at p<0.05 and ⋅ onset of the closest smile instance, a strategy adopted
indicates marginal significance at p<0.1. from [23] for turn-taking prediction. The mean duration
of smiling and non-smiling instances was ensured to be
the same.
Prosody cues: Our prosodic features consisted of some
of the fundamental characteristics of speech, such as Attention-based generative model: The generative
mean pitch during the turn, range of the pitch, and Root model (Figure 4) for facial landmark prediction primarily
Mean Square (RMS) energy of the audio signal. These fea- consisted of an encoder and a decoder with a one-layer
tures were chosen because of their relevance (see related GRU each. Inputs to the model were embeddings from
work) in BC behavior and also due to the ease of interpre- speaker and listener turns extracted using the pretrained
tation as well as their ability to convey various behavioral vggish model [24]. We limited the input context length
traits. For example, RMS energy conveys traits such as to use turn durations of 60 seconds. The output context
confidence, doubtfulness, and enthusiasm [20]. Lastly, was limited to predicting one second of facial activity.
using the OpenSMILE [21] software, prosodic features The speaker vggish embeddings were used as input to
were obtained. the encoder. The hidden state of the GRU was initialized
as the mean of the listener’s turn embeddings. The fi-
Figure 4: Generative model architecture. Encoder input contains speech embeddings of listener and speaker from the
pretrained vggish model. The encoder’s final hidden state is concatenated with the conditioning vector and then used to
initialize the decoder’s hidden state. Decoder output landmarks are sequentially fed (dotted curves) to generate the next
landmarks in the output sequence.
nal hidden state of the encoder was concatenated with the Mean Squared Error (MSE) between predictions and
the conditioning vector, and a linear layer with ReLU the ground truth. The learning rate was halved when
activation was used to match the dimensionality of the validation loss plateaued for 20 consecutive epochs. Data
decoder’s hidden state. At each decoding step, attention was partitioned into 75 (train), 15 (validation), and 15
[25] was applied between the encoder output and the de- (test) split in terms of the number of dyads. Models were
coder’s last hidden state (Equation 1) to use as the input trained for 250 epochs, and validation loss was used to de-
to the next step. termine the best model for testing. This was repeated 10
times to evaluate the statistical significance of differences
𝑎(𝑠𝑡−1 , ℎ𝑖 ) = 𝑣 𝑇 𝑡𝑎𝑛ℎ(𝑊𝑎 ℎ𝑖 + 𝑊𝑏 𝑠𝑡−1 ) (1) against baseline speaker-based BC generation setting.
where 𝑎(𝑠𝑡−1 , ℎ𝑖 ) is the attention between decoder last
Metrics: Objective measures of performance from ges-
hidden state (𝑠𝑡−1 ) and encoder output (ℎ𝑖 ). 𝑊𝑖 s and 𝑣 are
linear layers. ture generation approaches, including Average Pose Er-
ror (APE) and Probability of Correct Keypoints (PCK),
were adopted to quantify the generated landmarks
4.1. Implementation details against the ground truth from the AFAR toolbox. APE
The videos were split into two vertical halves, one cor- (Equation 2) is equivalent to the mean squared error be-
responding to each individual in the dyadic interaction. tween predicted facial expression and ground truth facial
These were used for facial landmark extraction using the expression. PCK (Equation 3) is a proximity-based metric
AFARtoolbox [17]. To account for various facial shapes, that considers the landmark to be correctly predicted if
we normalized landmarks to the mean face of the dataset the difference with ground truth falls below a margin.
using the approach described in [26]. Because of the We report mean PCK for 𝜎 = 0.1 and 0.2.
high degree of correlation between successive frames, 𝑘
frames were downsampled by a factor of three, to use 1
𝐴𝑃𝐸 = ∑ ‖(𝑦(𝑝) ̂ − 𝑦(𝑝))‖2 (2)
every third frame. Displacement was then calculated as 𝑘 𝑦=1
the difference between the landmarks from successive
where 𝑘 is the number of landmarks, 𝑦(𝑝) ̂ is the pre-
frames. These were further subjected to a min-max nor-
diction and 𝑦(𝑝) is the groundtruth.
malization to allow for individual differences in smiling
dynamics. The normalized displacements were predicted 𝑘
using the attention-based generative model. The pre- 1
𝑃𝐶𝐾𝜎 = ∑ 𝛿(‖(𝑦(𝑝) ̂ − 𝑦(𝑝))‖2 ≤ 𝜎) (3)
dicted frame-level displacements were incorporated into 𝑘 𝑦=1
the last known listener facial expression to generate the where 𝛿 is an indicator function and 𝜎 is the margin.
sequence of facial landmarks recursively.
We enforced teacher-forcing with simulated annealing
during training and linearly decreased the likelihood of 4.2. Results
using ground truth at every 20 epochs. Stochastic Gra- Using listener behavior and conditioning vector together
dient Descent with a learning rate initialized at 1𝑒 − 4 with the speaker behavior resulted in improved perfor-
weight decay and 0.99 momentum were used to minimize mance compared to the baseline speaker behavior-based
Table 2
Average Pose Error (APE) and Probability of Correct Keypoints
(PCK) metrics for generated facial expressions under various
experimental settings. A downward-facing arrow indicates
lower value implies better generation. ‘*’ indicates significance
with p <0.05 with ‘⋅’ indicates marginal significance with p
<0.1.
Model APE↓ PCK↑
Speaker only (Baseline) 9.552 0.219
Speaker and Listener 9.346⋅ 0.220⋅
Speaker and Listener with
9.279* 0.223*
Conditioning vector
Speaker and Conditioning vector 9.615 0.218⋅
prediction. As shown in Table 2, APE decreased by 0.273
points while PCK increased by 0.004; these gains were sta-
tistically significant. When listener behavior was added
to the speaker behavior, marginally significant improve-
ments were observed. APE reduced by 0.206 points while
PCK increased by 0.001 points. These reiterate our hy-
pothesis that both speaker and listener contribute to BC
behaviors. When speaker behavior was augmented with
the conditioning vector, only nominal differences were
observed against the baseline. APE increased by 0.063
points, and PCK decreased by 0.001.
To understand how the performance varies with dif-
ferent smiles, we predicted APE (and PCK) as a linear
combination of duration, intensity, and the model config-
uration using a regression model. Results from Figure 5
show that duration significantly affects the PCK. Interest-
ingly, the positive slope suggests that longer smiles are
generated better over shorter smiles. Only a marginally Figure 5: Effect of duration and intensity of smile along with
significant effect of duration can be observed for APE. ablation of inputs on generative model performance measured
using APE (top) and PCK (bottom). S & C-speaker and condi-
With the increase in the intensity of the smile, the gen-
tioning vector, S & L-speaker and listener, and S, L & C-speaker
eration performance decreases. This is significant for
and listener and conditioning vector as inputs to the model.
D-level and E-level smiles. Using listener features and ‘⋅’, ‘*’ and ‘***’ indicate significance with p <0.1, p <0.05 and p
the conditioning vector along with the speaker features <0.001 respectively.
improves the performance (negative and positive slopes
for APE and PCK, respectively) compared to the baseline
speaker-based generation. However, this effect is not utterance. However, the model fails to capture this verti-
statistically significant. cal motion.
Qualitative evaluation of ground truth landmarks from Metrics like APE and PCK provide an objective mea-
Figure 6 suggest the deficiencies of the existing facial sure of the prediction. However, evaluating concepts
landmark prediction approaches [17] to accurately track such as realism and contextual relevance of the BC predic-
lip corners both in the presence and absence of non- tion requires subjective ratings from human evaluation.
frontal head pose. While a visually noticeable difference A convention in evaluating landmark or keypoint-based
can be observed as the smile evolves, the ground truth generative approaches is the human comparison of pre-
landmarks fail to capture the subtle lip corner motion. dicted keypoints against the ground truth [14, 27]. While
This limitation in the ground truth has resulted in nom- this might work for problems such as gesture genera-
inal motion in the predicted landmarks. We also found tion that involve a strong motion component, evaluating
that BC smiles that co-occur with vocal activity are chal- subtle behaviors like facial expressions using a similar
lenging to predict. Figure 7 shows one example where strategy could be challenging. To address this concern,
the vertical distance between the upper and lower lips we leverage the emulated version of an embodied agent:
increases and decreases because of the simultaneous yeah Furhat [28].
Figure 6: Two sample smiles from the dataset showing their onsets (left-most frame to widest smile frame) and offsets (widest
smile frame to right-most frame). Note that while the evolution of smile is noticeable in ground truth landmarks (second
row) of the top smile, subtle changes between successive frames of the bottom smile are not captured by its ground truth
landmarks. This is also observed in the generated landmarks (third row). Zoom-in recommended. The faces used are from the
RealTalk dataset.
5.1. Emulation Setup
Furhat allows users to control facial expressions using
a set of facial parameters called BasicParams3 (ex.
MOUTH_SMILE_LEFT and MOUTH_SMILE_RIGHT
to control the left and the right lip corners;
BROW_UP_LEFT, BROW_UP_RIGHT to control
the left and right eyebrows, etc.). Our setup uses these
parameters to enable the embodied agent’s smile and
Figure 7: Limitation of the current approach in generating express associated eyebrow actions. The landmarks
a bimodal backchannel smile. The frames highlighted in red
from a generated smile expression were used to calculate
box correspond to the co-occurring verbal “yeah”. Notice that
ground truth landmarks (second row) fail to capture the verti- the displacement between successive frames and nor-
cal mouth movement. This is also observed in the generated malized to the [0, 1] range. For eyebrows, only vertical
landmarks (third row). Zoom-in recommended. The faces displacement was used. Our inputs to the Furhat API
used are from the RealTalk dataset. consisted of the lip corner and eyebrow displacements
corresponding to the frame with the widest smile
(maximum horizontal displacement between the lip
5. Smiles on an Embodied Agent corners). The duration of the Furhat smile was set to
the duration of the generated smile. Figure 8 shows an
So far, we have shown modeling smiles by generating example of the resultant expression. The user study was
facial landmarks. However, users in real-world scenarios conducted using the Furhat Desktop SDK. However, we
do not expect to see such abstract representations of do not foresee difficulties transferring the emulation
faces. Aligning these facial landmarks with embodied setup to a physically embodied Furhat.
agents is key for an interactable conversational agent.
To achieve this, we describe the procedure to transfer 5.2. User Study Procedure
generated landmarks to an embodied robotic simulation
system called Furhat. We then conduct a user study for We conducted a small-scale user study of participants
subjective perceived differences in Furhat’s behavior due watching two pre-recorded videos of the Furhat interact-
to BC smile. ing with an individual. They differ only in terms of Furhat
expressing a BC smile. In both interactions, Furhat starts
3
https://docs.furhat.io/remote-api/#python-remote-api
Table 3
Number of responses that expressed moderate or strong agree-
ment along various factors related to the BC smiles when
interacting with Furhat with and without backchannel behav-
iors.
Question Backchannel Non-backchannel
Figure 8: Four frames of an example Furhat robot emulation
Human-like 5 4
with different levels of smiles used as backchannels during Natural 6 6
the conversation in our user study. Willing to interact 1 0
Appropriate brightness 3 5
Longer or shorter smiles 2 0
Personal conversations 1 1
Non-personal conversations 3 2
with a brief introduction of itself, followed by a short
question–“How have you been feeling over the last two
weeks?”. As the user responds, a smile is generated at
that the brightness of the BC smile was appropriate while
the appropriate location (see Figure 8). We refer to this
two found that the duration of BC smile was longer or
scenario as the backchannel setting. Another video of
shorter than expected. While no difference was observed
the same individual interacting with Furhat with no BC
in terms of users’ preference for Furhat for personal con-
(non-backchannel) serves as our baseline. Seven gradu-
versations based on the presence of the BC smile, more
ate students then rated each video recording separately.
users (3/7) responded that they would use Furhat with
Note that raters were not primed on the study’s outcome,
BC smiles for non-personal conversations over Furhat
and no explicit instructions about smiles were given.
without BC smiles (2/7).
To quantify the user’s perception of Furhat interacting
with an individual, the influence of BC smile in addition
to the effect of its intensity and duration, and their will- 6. Discussion
ingness to interact with one was quantified through the
following questions on a 5-point Likert scale (1: strongly Our quantitative results suggest that both speaker and lis-
agree, 5: strongly disagree). tener behavior are important in generating BC behavior.
Using listener behavior together with the conditioning
1. The Furhat’s smiles looked human-like. vector offered statistically significant improvements in
2. The Furhat’s smiles looked natural and friendly. performance when compared to the baseline speaker-
3. I would talk to this agent frequently. only model. This effect was observed both in terms of
4. I felt the brightness of Furhat’s smiles was appro- APE and PCK. We also found that our attention-based
priate. generative model can predict low-intensity smiles better
5. The Furhat was smiling for longer or shorter du- than high-intensity smiles. Our user study shows that
ration than it was expected. more people find our agent human-like when it was able
6. I would feel comfortable talking to this agent to express BC smiles. Participants prefer to interact with
about non-personal topics. it over the agent with no BC smile capabilities for non-
7. I would feel comfortable talking to this agent personal conversations. However, for intimate personal
about personal topics. conversations, the presence of a BC smile did not sway
their decision.
In addition, open-ended feedback was also a part of the Some limitations of this work include the following.
questionnaire. We believe these questions help identify We employed an affordable measure of reliability for BC
some user-facing challenges in generating BC behav- smile annotations using a prediction model over a hu-
iors and how they influence users’ attitudes to embodied man rater. A robust approach would involve at least one
agent-based dialogue systems for conversations related more human annotator to perform reliability annotations
to mental health. on a portion of the dataset. The statistical analysis also
assumes that the smiles were independent of the individ-
5.3. Results uals and dyads. However, a given individual typically
produces multiple smiles. Grouping of smiles by factors
Table 3 shows that more users (5/7) expressed moderate such as individuals and dyads can be better modelled us-
or higher agreement that the Furhat agent with BC smile ing a mixed-effects model. Our user study was designed
was human-like than its counterpart without BC smile to demonstrate the feasibility of transferring generated
(4/7). One user expressed interest in frequently interact- facial landmarks to an embodied agent together with un-
ing with the agent in backchannel setting while the lack derstanding perceived differences between interactions
of backchannels resulted in increased hesitancy among with and without BC smiles. An appropriate evaluation
users in frequently using it. Three (out of 7) users found
framework would include the user interacting with the romantic relationships, and lack of age and ethnicity in-
agent. Followed by a comparison of qualitative subjec- formation in the dataset might have resulted in biased
tive ratings of user experience and quantified parameters generations. We also acknowledge that using embodied
(such as difference in turn duration, language usage, etc.)agents in such sensitive applications should undergo rig-
of the interaction with and without BC smiles. We believe orous evaluations by technical and domain experts and
such approaches provide a holistic evaluation to identify regulatory bodies. In our work, we do not interpret em-
critical instances in the interaction. Lastly, we focused on
bodied agents as a substitute for professionals in mental
BC smiles leaving out other conventional signals such as health or allied areas of healthcare but to provide tools
vocal and headpose-based BCs, and how they are affected for them to better serve the community’s demands. We
by the cues from the speaker and listener. believe that the advantages and limitations of embod-
ied agents in mental health should be presented to the
users and the healthcare experts to provide maximum
7. Conclusion benefits. The information used in this work is identified
from a publicly available dataset. Also, special attention
To enable BCs in embodied agents for mental health
has been paid to privacy and copyright requirements for
applications, we proposed an annotated dataset of face-
relevant images showing individual faces. The user study
to-face conversations including topics related to mental
raters were voluntary participants, and the University of
health. Our statistical analysis showed that speaker gen-
Pittsburgh IRB approved the data collection.
der together with prosodic and linguistic cues from both
speaker and listener turns are significant predictors of
the BC smile intensity. Using the significant predictors 9. Acknowledgments
together with the speaker and listener behaviors to gen-
erate BC smiles offers significant improvements in terms Bilalpur and Cohn were supported by the U.S. National In-
of empirical metrics over the baseline speaker-centric stitutes of Health through award MH R01-096951. Zeinali
generation. was supported through the Khoury Distinguished Fel-
We bridge the gap between conventional non-verbal lowship at Northeastern University.
behavior generation approaches such as landmarks and
poses and their realization by showing that generated
landmarks can be transferred to an embodied agent. Thus References
creating the opportunity for evaluation with a human-
[1] H. Modi, K. Orgera, A. Grover, Exploring barriers to
like manifestation over a traditional evaluation by com-
mental health care in the u.s. (2022). doi:10.15766/
paring generated landmark (or keypoint) outputs. Our
rai_a3ewcf9p .
small-scale user study suggests our Furhat agent that
[2] S. Song, S. Jaiswal, L. Shen, M. Valstar, Spectral rep-
backchannels is more human-like and are more likely to
resentation of behaviour primitives for depression
attract users for non-personal interactions. In addition
analysis, IEEE Transactions on Affective Comput-
to these contributions, we also discussed some limita-
ing 13 (2020) 829–844.
tions in existing technology towards generating accurate
[3] F. Ceccarelli, M. Mahmoud, Multimodal temporal
ground truth landmarks through examples such as failure
machine learning for bipolar disorder and depres-
to capture mouth movement in bimodal BCs and how
sion recognition, Pattern Analysis and Applications
they affect the generated outputs. We believe these limi-
25 (2022) 493–504.
tations also serve as directions for future research. Our
[4] Y. Yang, C. Fairbairn, J. F. Cohn, Detecting depres-
work serves as a baseline for computer scientists inter-
sion severity from vocal prosody, IEEE transactions
ested in behavior generation, and an attractive source of
on affective computing 4 (2012) 142–150.
BC smiles for behavioral scientists to study the effect of
[5] Z. Ambadar, J. F. Cohn, L. I. Reed, All smiles are not
context cues on BC smiles in intimate conversations.
created equal: Morphology and timing of smiles
perceived as amused, polite, and embarrassed/n-
8. Ethical Statement ervous, Journal of nonverbal behavior 33 (2009)
17–34.
We proposed a generative approach for backchannel [6] D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast,
smile production to enable naturalistic interactions with A. Gainer, K. Georgila, J. Gratch, A. Hartholt,
embodied AI agents for mental health dialogue. While M. Lhommet, et al., Simsensei kiosk: A virtual hu-
our dataset offers diverse smiles from people in different man interviewer for healthcare decision support, in:
interpersonal relationships, like many existing genera- Proceedings of the 2014 international conference
tive approaches, the choice of pretrained embeddings, on Autonomous agents and multi-agent systems,
imbalance between males and females, lack of male-male 2014, pp. 1061–1068.
[7] D. Utami, T. Bickmore, Collaborative user responses (2020).
in multiparty interaction with a couples counselor [21] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the
robot, in: 2019 14th ACM/IEEE International Con- munich versatile and fast open-source audio fea-
ference on Human-Robot Interaction (HRI), IEEE, ture extractor, in: Proceedings of the 18th ACM
2019, pp. 294–303. international conference on Multimedia, 2010, pp.
[8] N. Ward, W. Tsukahara, Prosodic features which 1459–1462.
cue back-channel responses in english and japanese, [22] J. W. Pennebaker, R. L. Boyd, K. Jordan, K. Black-
Journal of pragmatics 32 (2000) 1177–1207. burn, The development and psychometric proper-
[9] S. Benus, A. Gravano, J. B. Hirschberg, The prosody ties of LIWC2015, Technical Report, 2015.
of backchannels in american english (2007). [23] E. Ekstedt, G. Skantze, Voice activity projec-
[10] R. Bertrand, G. Ferré, P. Blache, R. Espesser, tion: Self-supervised learning of turn-taking events,
S. Rauzy, Backchannels revisited from a multimodal arXiv preprint arXiv:2205.09812 (2022).
perspective, in: Auditory-visual Speech Processing, [24] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke,
2007, pp. 1–5. A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A.
[11] K. P. Truong, R. Poppe, I. de Kok, D. Heylen, A mul- Saurous, B. Seybold, et al., Cnn architectures for
timodal analysis of vocal and visual backchannels large-scale audio classification, in: 2017 ieee inter-
in spontaneous dialogs., in: INTERSPEECH, 2011, national conference on acoustics, speech and signal
pp. 2973–2976. processing (icassp), IEEE, 2017, pp. 131–135.
[12] A. Gravano, J. Hirschberg, Backchannel-inviting [25] D. Bahdanau, K. Cho, Y. Bengio, Neural machine
cues in task-oriented dialogue, in: Tenth Annual translation by jointly learning to align and translate,
Conference of the International Speech Communi- arXiv preprint arXiv:1409.0473 (2014).
cation Association, 2009. [26] S. Stoll, N. C. Camgöz, S. Hadfield, R. Bowden, Sign
[13] W. Wang, X. Alameda-Pineda, D. Xu, P. Fua, E. Ricci, language production using neural machine transla-
N. Sebe, Every smile is unique: Landmark-guided tion and generative adversarial networks, in: Pro-
diverse smile generation, in: Proceedings of the ceedings of the 29th British Machine Vision Con-
IEEE Conference on Computer Vision and Pattern ference (BMVC 2018), British Machine Vision Asso-
Recognition, 2018, pp. 7083–7092. ciation, 2018.
[14] W. Feng, A. Kannan, G. Gkioxari, C. L. Zit- [27] C. Ahuja, D. W. Lee, R. Ishii, L.-P. Morency, No ges-
nick, Learn2smile: Learning non-verbal interac- tures left behind: Learning relationships between
tion through observation, in: 2017 IEEE/RSJ In- spoken language and freeform gestures, in: Find-
ternational Conference on Intelligent Robots and ings of the Association for Computational Linguis-
Systems (IROS), IEEE, 2017, pp. 4131–4138. tics: EMNLP 2020, 2020, pp. 1884–1895.
[15] E. Ng, H. Joo, L. Hu, H. Li, T. Darrell, A. Kanazawa, [28] S. Al Moubayed, J. Beskow, G. Skantze,
S. Ginosar, Learning to listen: Modeling non- B. Granström, Furhat: a back-projected human-
deterministic dyadic facial motion, in: Proceedings like robot head for multiparty human-machine
of the IEEE/CVF Conference on Computer Vision interaction, in: Cognitive Behavioural Systems:
and Pattern Recognition, 2022, pp. 20395–20405. COST 2102 International Training School, Dresden,
[16] S. Geng, R. Teotia, P. Tendulkar, S. Menon, C. Von- Germany, February 21-26, 2011, Revised Selected
drick, Affective faces for goal-driven dyadic com- Papers, Springer, 2012, pp. 114–130.
munication, arXiv preprint arXiv:2301.10939 (2023).
[17] I. O. Ertugrul, L. A. Jeni, W. Ding, J. F. Cohn, Afar:
A deep learning based tool for automated facial
affect recognition, in: 2019 14th IEEE international
conference on automatic face & gesture recognition
(FG 2019), IEEE, 2019, pp. 1–1.
[18] S. Schneider, A. Baevski, R. Collobert, M. Auli,
wav2vec: Unsupervised pre-training for speech
recognition, arXiv preprint arXiv:1904.05862 (2019).
[19] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner,
M. Sonderegger, Montreal Forced Aligner: Train-
able Text-Speech Alignment Using Kaldi, in: Proc.
Interspeech 2017, 2017, pp. 498–502. doi:10.21437/
Interspeech.2017- 1386 .
[20] S. A. Memon, Acoustic correlates of the voice qual-
ifiers: A survey, arXiv preprint arXiv:2010.15869
10. Appendix
10.1. Distribution of Intensity and
Duration of Smiles
Figure 9: Distribution of intensity and duration of BC smiles
in the annotated dataset. The spread of the histograms shows
the diversity of the annotated smiles.
Figure 9 shows the distribution of annotated Backchan-
nel (BC) smiles in terms of their intensity and duration.
The predicted intensity using the automated approach
showed that over 50% of smiles were of B-level intensity,
and fewer instances of high-intensity smiles (D and E-
levels) were also present. The mean duration was 3.18 ±
1.71 seconds.
10.2. Effect of Sex and Relationship on
Smile Intensity
Table 4
ANOVA of listener sex, speaker sex, and relationship on inten-
sity of smile. ‘⋅’ indicates significant at p<0.1.
Df Sum Sq Mean Sq F value Pr(>F)
𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 1 0.53 0.53 0.60 0.4417
𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 1 2.93 2.93 3.31 0.0710 ⋅
𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 3 3.23 1.08 1.22 0.3055
𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗
3 2.00 0.67 0.75 0.5225
𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗
1 0.10 0.10 0.11 0.7424
𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟
𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 ∗
3 3.15 1.05 1.19 0.3176
𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
Residuals 144 127.49 0.89
Note that the intensity of the smile differs marginally
by the speaker sex. It is not affected by other factors such
as relationship, listener sex and their interaction.