Learning to Generate Context-Sensitive Backchannel Smiles
                                for Embodied AI Agents with Applications in Mental Health
                                Dialogues
                                Maneesh Bilalpur1,∗ , Mert Inan2 , Dorsa Zeinali2 , Jeffrey F. Cohn1 and Malihe Alikhani2
                                1
                                    University of Pittsburgh, Pittsburgh, Pennsylvania, USA
                                2
                                    Northeastern University, Boston, Massachusetts, USA


                                                  Abstract
                                                  Addressing the critical shortage of mental health resources for effective screening, diagnosis, and treatment remains a
                                                  significant challenge. This scarcity underscores the need for innovative solutions, particularly in enhancing the accessibility
                                                  and efficacy of therapeutic support. Embodied agents with advanced interactive capabilities emerge as a promising and
                                                  cost-effective supplement to traditional caregiving methods. Crucial to these agents’ effectiveness is their ability to simulate
                                                  non-verbal behaviors, like backchannels, that are pivotal in establishing rapport and understanding in therapeutic contexts but
                                                  remain under-explored. To improve the rapport-building capabilities of embodied agents we annotated backchannel smiles in
                                                  videos of intimate face-to-face conversations over topics such as mental health, illness, and relationships. We hypothesized
                                                  that both speaker and listener behaviors affect the duration and intensity of backchannel smiles. Using cues from speech
                                                  prosody and language along with the demographics of the speaker and listener, we found them to contain significant predictors
                                                  of the intensity of backchannel smiles. Based on our findings, we introduce backchannel smile production in embodied
                                                  agents as a generation problem. Our attention-based generative model suggests that listener information offers performance
                                                  improvements over the baseline speaker-centric generation approach. Conditioned generation using the significant predictors
                                                  of smile intensity provides statistically significant improvements in empirical measures of generation quality. Our user study
                                                  by transferring generated smiles to an embodied agent suggests that agent with backchannel smiles is perceived to be more
                                                  human-like and is an attractive alternative for non-personal conversations over agent without backchannel smiles.


                                1. Introduction
                                Fewer than a third of the US population has sufficient
                                access to mental health professionals [1]. This highlights
                                the need for additional resources to help mental health
                                professionals meet the community’s demands. Problems
                                like symptom detection and evaluating treatment effi-
                                cacy have made great strides with AI [2, 3, 4] and the
                                mental health community can greatly benefit from this
                                AI intervention. Embodied agent-based systems due to
                                their multimodal behavioral capabilities are a promis-
                                ing solution to support such mental health needs. How-
                                ever, the development of such systems presents numer-
                                ous challenges. These include the scarcity of mental
                                health-related datasets, limited access to domain experts                                                  Figure 1: Overview of steps for backchannel smile generation
                                for designing reliable and robust systems, and the ethi-                                                   in an embodied agent in a human-agent interaction: Speaker
                                cal considerations crucial to their design and adaptation.                                                 and listener (agent) turns are used to generate the listener’s
                                Among such challenges, one aspect that stands out is                                                       response facial expression as landmarks. The landmarks are
                                the agent’s ability to establish a common ground with                                                      then integrated with the embodied agent and added to the
                                users. Addressing this is particularly crucial when the                                                    conversation flow represented as a dotted arrow.
                                agent functions as a listener. Effective grounding in such

                                Machine Learning for Cognitive and Mental Health Workshop                                                  scenarios relies heavily on multimodal non-verbal be-
                                (ML4CMH), AAAI 2024, Vancouver, BC, Canada.                                                                haviors like backchannels. These subtle yet impactful
                                ∗
                                     Corresponding author.                                                                                 cues are pivotal in building rapport and understanding
                                Envelope-Open mab623@pitt.edu (M. Bilalpur); inan.m@northeastern.edu
                                (M. Inan); zeinali.d@northeastern.edu (D. Zeinali);
                                                                                                                                           between the user and the agent. Hence, understanding
                                jeffcohn@pitt.edu (J. F. Cohn); m.alikhani@northeastern.edu                                                and incorporating these behaviors into embodied agents
                                (M. Alikhani)                                                                                              is not only challenging but also essential for creating a
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
supportive and empathetic environment for individuals                       and their physical realization by emulating the
seeking mental health support. Addressing these chal-                       generated behavior with an embodied agent.
lenges can pave the way for more effective, accessible,                  5. Show that our BC smile generation yields appro-
and empathetic digital mental health interventions.                         priate and natural-looking smiles through a user
   In dyadic conversations, at any given time one person                    study involving the embodied agent.
may have the floor (i.e., is speaking) while the other is
listening. Backchannels (BC) refer to behaviors of the                  Results suggest speaker sex, their use of negations,
listener that do not interrupt the speaker. BCs signal               loudness, word count in the listener’s turn, their usage of
attention, agreement, and emotional response to what is              comparisons, and mean pitch are significant predictors
said. Inappropriate BC smiles such as ones that appear               of BC smile intensity. Our generative approach shows
too short or too long or for which the timing appears                that taking listeners’ behavior into account improves
“off” can disrupt the conversational rapport and result in           performance, and adding the conditioning vector offers
unsuccessful or disrupted conversations. Our objective               significant improvements in terms of empirical metrics
is to understand appropriate BC smiles from dyadic con-              such as Average Pose Error (APE) and Probability of
versations and how an embodied agent can employ them                 Correct Keypoints (PCK).
when interacting with a human.
   Conversational agents typically realize BC smiles us-             2. Related Work
ing rule-based systems, discriminative approaches, or
sometimes simply mimicking the smiles of the speaker.                Existing works have validated the efficacy of an agent-
Mimicking, however, fails to generalize to situations that           driven conversation in mental health dialogue and coun-
require a contextually relevant smile. And rule-based                seling situations. DeVault et al. [6], through their agent-
and discriminative approaches offer limited coverage due             based interviews for distress and trauma symptoms,
to the diversity of smiles [5].                                      found that participants were comfortable interacting with
   We present a generative approach for BC smiles in                 the agent as well as sharing intimate information. Utami
listeners to address these limitations and enable contextu-          and Bickmore [7] used embodied agents for couples coun-
ally relevant BC smiles in embodied agents. An overview              seling. Participants reported significantly improved af-
of the approach is presented in Figure 1. Unlike existing            fect and intimacy with their partner and generally en-
works that solely depend on speaker behavior for BC pro-             joyed the agent-driven counseling session. Our work
duction (see related work section), we use both speaker              builds on this line of research to improve the BC capabil-
and listener behaviors to study how they affect the in-              ities of agents.
tensity and duration of the BC smile. We use cues from                  Backchannel behaviors were traditionally produced
prosody, language, and the demographics of dyads to                  using a set of predefined rules based on prosodic or lin-
identify statistically significant predictors (referred to as        guistic cues of the speaker. Both Ward and Tsukahara
a conditioning vector) of smiles. In addition to the audio           [8], Benus et al. [9] have found prosodic cues (particu-
features from both interaction participants, we leverage             larly pitch and its changes) to be reliable predictors for
the conditioning vector in generating the BC smiles. In              vocal BC occurrence. In contrast, we use prosody and
this paper, we:                                                      linguistic cues from both speaker and listener to identify
                                                                     significant predictors of BC smiles.
       1. Annotate backchannel smiles in a face-to-face
                                                                        In the multimodal context, Bertrand et al. [10] stud-
          interaction dataset1 of dyads that differ in their
                                                                     ied prosodic, morphological, and discourse markers for
          composition of biological sex and type of relation-
                                                                     their effect on vocal and gestural backchannels (hand ges-
          ship.
                                                                     tures, smiles, eyebrows), and Truong et al. [11] explored
       2. Present our statistical analysis to identify vari-         visual BCs by often limiting them to head nods and, at
          ous speaker and listener-specific cues that sig-           times, grouping different BCs into the same category [12]
          nificantly predict the duration and intensity of           without accounting for their intrinsic differences. They
          backchannel smiles.                                        depended on the speaker’s behavior to identify the occur-
       3. Generate backchannel smiles using an attention-            rence and ignored the listener. In addition to leveraging
          based generative model that uses the listener and          the listener behavior, we specifically study smiles because
          speaker turn features with the identified signifi-         of their diversity and include both unimodal (visual) and
          cant predictors.                                           bimodal (visual together with vocal activity) BC smiles.
       4. Bridge the gap between the model-based genera-                Wang et al. [13] introduced diversity in generated
          tion of non-verbal behaviors (as facial landmarks)         smiles by conditioning on a specific class and sampling
                                                                     using a variational autoencoder. Learn2Smile [14] used
1
    Data and code: https://github.com/bmaneesh/Generating-Context-   the facial landmarks of the speaker to generate com-
    Sensitive-Backchannel-Smiles/                                    plete listener behavior by separately predicting the low-
frequency (nods) and high-frequency (blinks) compo-
nents of facial motion. Ng et al. [15] leverage the speaker
and listener’s motion and speech features to predict
the listener’s future motion information. Unlike earlier
works that have been limited to facial expression genera-
tion using landmarks, their usage of 3D Morphable Mod-
els to define facial expressions offers a flexible solution
to generate realistic facial expressions in the presence
of diverse head orientations. These solutions focus on        Figure 2: Distribution of speaker and listener sex across differ-
the entire listener’s behavior and offer no insights about    ent interpersonal relationships in annotated RealTalk dataset.
                                                              Relationships are color-coded: siblings (pink), friends (orange),
specific BC behaviors. Their integrations are also limited
                                                              paternal (green), and romantic couple (grey).
to 3D Morphable Models.
   The BC smiles produced in this work not only leverage
the speaker and listener activity but also condition the
generation on salient factors that were found to be signif-   the 191 annotated smiles had an A-level or higher in-
icant predictors of smile attributes – duration (the time     tensity. One outlier smile was dropped because of the
elapsed between the onset of a smile and its offset) and      extremely long duration. The resultant 157 smiles, along
intensity (maximum amplitude of a smile). Using an em-        with their predicted intensity, were used in this work.
bodied agent, we also bridge the gap between generated        In addition to the video recordings at 25 fps and 720p
landmarks and their physical realization.                     resolution, the dataset also contains speaker-identified
                                                              turn-level text obtained through automatic transcription
                                                              [18]. The individuals in the dyadic interaction occupied
3. Dataset                                                    fixed positions (left and right) in the videos. In this work,
                                                              the biological sex of the participants was inferred from
One of the primary challenges in studying non-verbal          the videos. Videos where sex could not be established
behavior in mental health interactions is access to an        with confidence were discarded.
appropriate dataset. Patient-therapist interactions or in-
teractions with mental health professionals are access-
restricted to protect the identifiable information of the     3.2. Effect of Sex and Relationship on
individuals. As a result, we use a YouTube-based large-            Smile Attributes
scale dataset of face-to-face dyadic interactions–RealTalk
                                                                Given various interpersonal relationships in the dataset
[16]. The RealTalk dataset consists of individuals taking
                                                                of individuals of both sexes, we compared the mean du-
turns asking predefined, intimate questions about family,
                                                                ration of backchannel smiles across the factors using
dreams, relationships, illness, and mental health2 . We
                                                                ANOVA (Table 1) with type-III sum of squares to account
believe intimate conversations are among the closest ac-
                                                                for imbalance between males and females. Two-way in-
cessible alternatives to studying BC behaviors for mental
                                                                teractions between sex, and sex and relationship were
health applications. In this section, we elaborate on our
                                                                also included. The ANOVA analysis suggests that the
contributions in terms of the annotations for BC smiles
                                                                duration of backchannel smiles differs significantly by
and discuss how they differ by the demographics of the
                                                                listener sex and the interaction effect of the listener sex
dyads and features from the speaker and listener turn
                                                                and relationship. A post hoc Tukey revealed that male
preceding it.
                                                                listeners, when interacting with their siblings (regardless
                                                                of speaker sex), express longer BC smiles (p<0.05).
3.1. Annotating Backchannel Smiles                                 Similarly, the intensity of smiles marginally differed
We manually annotated 191 BC smiles from 48 (out of 692)        by  the speaker’s sex. The post hoc Tukey revealed that
dyadic interactions in the RealTalk dataset. The dyads          the  smiles as a response to a male speaker are less in-
comprised male and female participants from different tense than a female speaker (p<0.1). ANOVA analysis is
ethnicities, and social relationships such as siblings, pa- presented in the appendix as Table 4.
ternal, romantic, and fraternal. The smiles were nearly
balanced across the different interpersonal relationships 3.3. Effect of Context Cues
(see Figure 2). An automated facial expression prediction
                                                                Our contextual cues were extracted from prosody and
framework [17] was used to evaluate the reliability of
                                                                speech features independently derived from the turns of
the manual annotations. About 83% (i.e., 158 smiles) of
                                                                both the speaker and the listener just before the smile on-
2
  The original videos can be accessed from https://www.youtube.
                                                                set. Since the speaker’s turn continues while the listener
  com/c/TheSkinDeep                                             backchannels, speaker activity till the onset of the smiles
Table 1                                                  Speech cues: The spoken content of speaker and lis-
                                                         tener turns was also accounted for through variables
ANOVA of listener sex, speaker sex, and relationship on dura-
tion of smile. ‘*’ indicates p<0.05 and ‘**’ indicates p<0.01).
                                                         from the Linguistic Inquiry and Word Count (LIWC) [22]
                     Df Sum Sq Mean Sq F value    Pr(>F) framework. These variables were word count, usage of
  𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟         1   12.36  12.36    4.59  0.0339 *
                                                         negations (no, not, never), comparisons (greater, best,
  𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟          1    1.29   1.29    0.48    0.4907 after), interrogative words (how, when, what), valence
  𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝        3    4.18   1.39    0.52    0.6709 of the turns (positive or negative emotion), and focus on
      𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗                                      events in the past, present and future.
                      3   42.80  14.27    5.29 0.0017 **
    𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
    𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗                                           A generalized linear model predicted the smile inten-
                      1    0.90   0.90    0.33    0.5652
     𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟                                          sity from context cues and dyad demographics. Results
      𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 ∗                                       using an inverse link function (model explained vari-
                      3    9.70   3.23    1.20    0.3123
    𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
                                                         ance 𝑅2 = 0.243) with the prosody and speech cues from
  Residuals         144  388.03   2.69
                                                         the audio signal are presented as Figure 3. Note that
                                                         the speakers’ and listeners’ context cues were Z-score
                                                         normalized. Speaker characteristics such as sex and nega-
was considered in this study. The audio was trimmed tions were found to be significant predictors of intensity.
to the onset to obtain corresponding contextual cues, Female speakers elicited significantly narrower smiles
and the Montreal Forced Aligner (MFA) [19] was used to from their listeners, but the speaker’s usage of negations
extract corresponding transcription information.         resulted in wider smiles. The speaker’s loudness (RMS
                                                         energy) had a marginally significant negative correlation
                                                         with the smile intensity. Listener behavior also signif-
                                                         icantly impacted their BC smiles. Using comparative
                                                         words by the listener and their mean pitch in their pre-
                                                         ceding turn resulted in significantly narrower smiles. In
                                                         contrast, their word count had a marginally significant
                                                         positive correlation with intensity. A similar analysis for
                                                         duration did not reveal any significant correlations.


                                                                    4. Modeling Smiles
                                                                    To automatically generate BC smile and non-smile ac-
                                                                    tivity in listeners, we use the audio from the speaker’s
                                                                    current turn and the listener’s last turn as input. 15
                                                                    smiles were dropped due to difficulties in the preprocess-
                                                                    ing steps with MFA. The remaining 142 annotated smile
Figure 3: Regression slopes showing the effect of context cues      instances were augmented with an equal number of non-
on the intensity of BC smiles. A positive slope indicates the       smile instances. The non-smile instances were identified
smile intensity increases with a given feature (vice-versa for a    so that they were at least two seconds away from the
negative slope). * indicates slope is significant at p<0.05 and ⋅   onset of the closest smile instance, a strategy adopted
indicates marginal significance at p<0.1.                           from [23] for turn-taking prediction. The mean duration
                                                                    of smiling and non-smiling instances was ensured to be
                                                                    the same.
Prosody cues: Our prosodic features consisted of some
of the fundamental characteristics of speech, such as               Attention-based generative model: The generative
mean pitch during the turn, range of the pitch, and Root            model (Figure 4) for facial landmark prediction primarily
Mean Square (RMS) energy of the audio signal. These fea-            consisted of an encoder and a decoder with a one-layer
tures were chosen because of their relevance (see related           GRU each. Inputs to the model were embeddings from
work) in BC behavior and also due to the ease of interpre-          speaker and listener turns extracted using the pretrained
tation as well as their ability to convey various behavioral        vggish model [24]. We limited the input context length
traits. For example, RMS energy conveys traits such as              to use turn durations of 60 seconds. The output context
confidence, doubtfulness, and enthusiasm [20]. Lastly,              was limited to predicting one second of facial activity.
using the OpenSMILE [21] software, prosodic features                The speaker vggish embeddings were used as input to
were obtained.                                                      the encoder. The hidden state of the GRU was initialized
                                                                    as the mean of the listener’s turn embeddings. The fi-
Figure 4: Generative model architecture. Encoder input contains speech embeddings of listener and speaker from the
pretrained vggish model. The encoder’s final hidden state is concatenated with the conditioning vector and then used to
initialize the decoder’s hidden state. Decoder output landmarks are sequentially fed (dotted curves) to generate the next
landmarks in the output sequence.


nal hidden state of the encoder was concatenated with           the Mean Squared Error (MSE) between predictions and
the conditioning vector, and a linear layer with ReLU           the ground truth. The learning rate was halved when
activation was used to match the dimensionality of the          validation loss plateaued for 20 consecutive epochs. Data
decoder’s hidden state. At each decoding step, attention        was partitioned into 75 (train), 15 (validation), and 15
[25] was applied between the encoder output and the de-         (test) split in terms of the number of dyads. Models were
coder’s last hidden state (Equation 1) to use as the input      trained for 250 epochs, and validation loss was used to de-
to the next step.                                               termine the best model for testing. This was repeated 10
                                                                times to evaluate the statistical significance of differences
           𝑎(𝑠𝑡−1 , ℎ𝑖 ) = 𝑣 𝑇 𝑡𝑎𝑛ℎ(𝑊𝑎 ℎ𝑖 + 𝑊𝑏 𝑠𝑡−1 )     (1)   against baseline speaker-based BC generation setting.
   where 𝑎(𝑠𝑡−1 , ℎ𝑖 ) is the attention between decoder last
                                                          Metrics: Objective measures of performance from ges-
hidden state (𝑠𝑡−1 ) and encoder output (ℎ𝑖 ). 𝑊𝑖 s and 𝑣 are
linear layers.                                            ture generation approaches, including Average Pose Er-
                                                          ror (APE) and Probability of Correct Keypoints (PCK),
                                                          were adopted to quantify the generated landmarks
4.1. Implementation details                               against the ground truth from the AFAR toolbox. APE
The videos were split into two vertical halves, one cor-  (Equation 2) is equivalent to the mean squared error be-
responding to each individual in the dyadic interaction.  tween  predicted facial expression and ground truth facial
These were used for facial landmark extraction using the expression. PCK (Equation 3) is a proximity-based metric
AFARtoolbox [17]. To account for various facial shapes, that considers the landmark to be correctly predicted if
we normalized landmarks to the mean face of the dataset the difference with ground truth falls below a margin.
using the approach described in [26]. Because of the We report mean PCK for 𝜎 = 0.1 and 0.2.
high degree of correlation between successive frames,                               𝑘
frames were downsampled by a factor of three, to use                            1
                                                                        𝐴𝑃𝐸 = ∑ ‖(𝑦(𝑝)    ̂  − 𝑦(𝑝))‖2           (2)
every third frame. Displacement was then calculated as                          𝑘 𝑦=1
the difference between the landmarks from successive
                                                             where 𝑘 is the number of landmarks, 𝑦(𝑝) ̂  is the pre-
frames. These were further subjected to a min-max nor-
                                                          diction and 𝑦(𝑝) is the groundtruth.
malization to allow for individual differences in smiling
dynamics. The normalized displacements were predicted                           𝑘
using the attention-based generative model. The pre-                         1
                                                                   𝑃𝐶𝐾𝜎 = ∑ 𝛿(‖(𝑦(𝑝)    ̂   − 𝑦(𝑝))‖2 ≤ 𝜎)       (3)
dicted frame-level displacements were incorporated into                      𝑘 𝑦=1
the last known listener facial expression to generate the  where 𝛿 is an indicator function and 𝜎 is the margin.
sequence of facial landmarks recursively.
  We enforced teacher-forcing with simulated annealing
during training and linearly decreased the likelihood of 4.2. Results
using ground truth at every 20 epochs. Stochastic Gra- Using listener behavior and conditioning vector together
dient Descent with a learning rate initialized at 1𝑒 − 4 with the speaker behavior resulted in improved perfor-
weight decay and 0.99 momentum were used to minimize mance compared to the baseline speaker behavior-based
Table 2
Average Pose Error (APE) and Probability of Correct Keypoints
(PCK) metrics for generated facial expressions under various
experimental settings. A downward-facing arrow indicates
lower value implies better generation. ‘*’ indicates significance
with p <0.05 with ‘⋅’ indicates marginal significance with p
<0.1.
                Model                      APE↓      PCK↑
        Speaker only (Baseline)            9.552     0.219
         Speaker and Listener              9.346⋅    0.220⋅
       Speaker and Listener with
                                           9.279*    0.223*
          Conditioning vector
    Speaker and Conditioning vector        9.615     0.218⋅


prediction. As shown in Table 2, APE decreased by 0.273
points while PCK increased by 0.004; these gains were sta-
tistically significant. When listener behavior was added
to the speaker behavior, marginally significant improve-
ments were observed. APE reduced by 0.206 points while
PCK increased by 0.001 points. These reiterate our hy-
pothesis that both speaker and listener contribute to BC
behaviors. When speaker behavior was augmented with
the conditioning vector, only nominal differences were
observed against the baseline. APE increased by 0.063
points, and PCK decreased by 0.001.
   To understand how the performance varies with dif-
ferent smiles, we predicted APE (and PCK) as a linear
combination of duration, intensity, and the model config-
uration using a regression model. Results from Figure 5
show that duration significantly affects the PCK. Interest-
ingly, the positive slope suggests that longer smiles are
generated better over shorter smiles. Only a marginally             Figure 5: Effect of duration and intensity of smile along with
significant effect of duration can be observed for APE.             ablation of inputs on generative model performance measured
                                                                    using APE (top) and PCK (bottom). S & C-speaker and condi-
With the increase in the intensity of the smile, the gen-
                                                                    tioning vector, S & L-speaker and listener, and S, L & C-speaker
eration performance decreases. This is significant for
                                                                    and listener and conditioning vector as inputs to the model.
D-level and E-level smiles. Using listener features and             ‘⋅’, ‘*’ and ‘***’ indicate significance with p <0.1, p <0.05 and p
the conditioning vector along with the speaker features             <0.001 respectively.
improves the performance (negative and positive slopes
for APE and PCK, respectively) compared to the baseline
speaker-based generation. However, this effect is not               utterance. However, the model fails to capture this verti-
statistically significant.                                          cal motion.
   Qualitative evaluation of ground truth landmarks from               Metrics like APE and PCK provide an objective mea-
Figure 6 suggest the deficiencies of the existing facial            sure of the prediction. However, evaluating concepts
landmark prediction approaches [17] to accurately track             such as realism and contextual relevance of the BC predic-
lip corners both in the presence and absence of non-                tion requires subjective ratings from human evaluation.
frontal head pose. While a visually noticeable difference           A convention in evaluating landmark or keypoint-based
can be observed as the smile evolves, the ground truth              generative approaches is the human comparison of pre-
landmarks fail to capture the subtle lip corner motion.             dicted keypoints against the ground truth [14, 27]. While
This limitation in the ground truth has resulted in nom-            this might work for problems such as gesture genera-
inal motion in the predicted landmarks. We also found               tion that involve a strong motion component, evaluating
that BC smiles that co-occur with vocal activity are chal-          subtle behaviors like facial expressions using a similar
lenging to predict. Figure 7 shows one example where                strategy could be challenging. To address this concern,
the vertical distance between the upper and lower lips              we leverage the emulated version of an embodied agent:
increases and decreases because of the simultaneous yeah            Furhat [28].
Figure 6: Two sample smiles from the dataset showing their onsets (left-most frame to widest smile frame) and offsets (widest
smile frame to right-most frame). Note that while the evolution of smile is noticeable in ground truth landmarks (second
row) of the top smile, subtle changes between successive frames of the bottom smile are not captured by its ground truth
landmarks. This is also observed in the generated landmarks (third row). Zoom-in recommended. The faces used are from the
RealTalk dataset.


                                                               5.1. Emulation Setup
                                                               Furhat allows users to control facial expressions using
                                                               a set of facial parameters called BasicParams3 (ex.
                                                               MOUTH_SMILE_LEFT and MOUTH_SMILE_RIGHT
                                                               to control the left and the right lip corners;
                                                               BROW_UP_LEFT, BROW_UP_RIGHT to control
                                                               the left and right eyebrows, etc.). Our setup uses these
                                                               parameters to enable the embodied agent’s smile and
Figure 7: Limitation of the current approach in generating express associated eyebrow actions. The landmarks
a bimodal backchannel smile. The frames highlighted in red
                                                               from a generated smile expression were used to calculate
box correspond to the co-occurring verbal “yeah”. Notice that
ground truth landmarks (second row) fail to capture the verti- the displacement between successive frames and nor-
cal mouth movement. This is also observed in the generated malized to the [0, 1] range. For eyebrows, only vertical
landmarks (third row). Zoom-in recommended. The faces displacement was used. Our inputs to the Furhat API
used are from the RealTalk dataset.                            consisted of the lip corner and eyebrow displacements
                                                               corresponding to the frame with the widest smile
                                                               (maximum horizontal displacement between the lip
5. Smiles on an Embodied Agent                                 corners). The duration of the Furhat smile was set to
                                                               the duration of the generated smile. Figure 8 shows an
So far, we have shown modeling smiles by generating example of the resultant expression. The user study was
facial landmarks. However, users in real-world scenarios conducted using the Furhat Desktop SDK. However, we
do not expect to see such abstract representations of do not foresee difficulties transferring the emulation
faces. Aligning these facial landmarks with embodied setup to a physically embodied Furhat.
agents is key for an interactable conversational agent.
To achieve this, we describe the procedure to transfer 5.2. User Study Procedure
generated landmarks to an embodied robotic simulation
system called Furhat. We then conduct a user study for We conducted a small-scale user study of participants
subjective perceived differences in Furhat’s behavior due watching two pre-recorded videos of the Furhat interact-
to BC smile.                                                   ing with an individual. They differ only in terms of Furhat
                                                               expressing a BC smile. In both interactions, Furhat starts
                                                               3
                                                                   https://docs.furhat.io/remote-api/#python-remote-api
                                                               Table 3
                                                               Number of responses that expressed moderate or strong agree-
                                                               ment along various factors related to the BC smiles when
                                                               interacting with Furhat with and without backchannel behav-
                                                               iors.
                                                                         Question            Backchannel   Non-backchannel
Figure 8: Four frames of an example Furhat robot emulation
                                                                       Human-like                 5               4
with different levels of smiles used as backchannels during              Natural                  6               6
the conversation in our user study.                                 Willing to interact           1               0
                                                                  Appropriate brightness          3               5
                                                                 Longer or shorter smiles         2               0
                                                                  Personal conversations          1               1
                                                                Non-personal conversations        3               2
with a brief introduction of itself, followed by a short
question–“How have you been feeling over the last two
weeks?”. As the user responds, a smile is generated at
                                                               that the brightness of the BC smile was appropriate while
the appropriate location (see Figure 8). We refer to this
                                                               two found that the duration of BC smile was longer or
scenario as the backchannel setting. Another video of
                                                               shorter than expected. While no difference was observed
the same individual interacting with Furhat with no BC
                                                               in terms of users’ preference for Furhat for personal con-
(non-backchannel) serves as our baseline. Seven gradu-
                                                               versations based on the presence of the BC smile, more
ate students then rated each video recording separately.
                                                               users (3/7) responded that they would use Furhat with
Note that raters were not primed on the study’s outcome,
                                                               BC smiles for non-personal conversations over Furhat
and no explicit instructions about smiles were given.
                                                               without BC smiles (2/7).
   To quantify the user’s perception of Furhat interacting
with an individual, the influence of BC smile in addition
to the effect of its intensity and duration, and their will-   6. Discussion
ingness to interact with one was quantified through the
following questions on a 5-point Likert scale (1: strongly     Our quantitative results suggest that both speaker and lis-
agree, 5: strongly disagree).                                  tener behavior are important in generating BC behavior.
                                                               Using listener behavior together with the conditioning
    1. The Furhat’s smiles looked human-like.                  vector offered statistically significant improvements in
    2. The Furhat’s smiles looked natural and friendly.        performance when compared to the baseline speaker-
    3. I would talk to this agent frequently.                  only model. This effect was observed both in terms of
    4. I felt the brightness of Furhat’s smiles was appro-     APE and PCK. We also found that our attention-based
       priate.                                                 generative model can predict low-intensity smiles better
    5. The Furhat was smiling for longer or shorter du-        than high-intensity smiles. Our user study shows that
       ration than it was expected.                            more people find our agent human-like when it was able
    6. I would feel comfortable talking to this agent          to express BC smiles. Participants prefer to interact with
       about non-personal topics.                              it over the agent with no BC smile capabilities for non-
    7. I would feel comfortable talking to this agent          personal conversations. However, for intimate personal
       about personal topics.                                  conversations, the presence of a BC smile did not sway
                                                               their decision.
   In addition, open-ended feedback was also a part of the        Some limitations of this work include the following.
questionnaire. We believe these questions help identify        We employed an affordable measure of reliability for BC
some user-facing challenges in generating BC behav-            smile annotations using a prediction model over a hu-
iors and how they influence users’ attitudes to embodied       man rater. A robust approach would involve at least one
agent-based dialogue systems for conversations related         more human annotator to perform reliability annotations
to mental health.                                              on a portion of the dataset. The statistical analysis also
                                                               assumes that the smiles were independent of the individ-
5.3. Results                                                   uals and dyads. However, a given individual typically
                                                               produces multiple smiles. Grouping of smiles by factors
Table 3 shows that more users (5/7) expressed moderate         such as individuals and dyads can be better modelled us-
or higher agreement that the Furhat agent with BC smile        ing a mixed-effects model. Our user study was designed
was human-like than its counterpart without BC smile           to demonstrate the feasibility of transferring generated
(4/7). One user expressed interest in frequently interact-     facial landmarks to an embodied agent together with un-
ing with the agent in backchannel setting while the lack       derstanding perceived differences between interactions
of backchannels resulted in increased hesitancy among          with and without BC smiles. An appropriate evaluation
users in frequently using it. Three (out of 7) users found
framework would include the user interacting with the      romantic relationships, and lack of age and ethnicity in-
agent. Followed by a comparison of qualitative subjec-     formation in the dataset might have resulted in biased
tive ratings of user experience and quantified parameters  generations. We also acknowledge that using embodied
(such as difference in turn duration, language usage, etc.)agents in such sensitive applications should undergo rig-
of the interaction with and without BC smiles. We believe  orous evaluations by technical and domain experts and
such approaches provide a holistic evaluation to identify  regulatory bodies. In our work, we do not interpret em-
critical instances in the interaction. Lastly, we focused on
                                                           bodied agents as a substitute for professionals in mental
BC smiles leaving out other conventional signals such as   health or allied areas of healthcare but to provide tools
vocal and headpose-based BCs, and how they are affected    for them to better serve the community’s demands. We
by the cues from the speaker and listener.                 believe that the advantages and limitations of embod-
                                                           ied agents in mental health should be presented to the
                                                           users and the healthcare experts to provide maximum
7. Conclusion                                              benefits. The information used in this work is identified
                                                           from a publicly available dataset. Also, special attention
To enable BCs in embodied agents for mental health
                                                           has been paid to privacy and copyright requirements for
applications, we proposed an annotated dataset of face-
                                                           relevant images showing individual faces. The user study
to-face conversations including topics related to mental
                                                           raters were voluntary participants, and the University of
health. Our statistical analysis showed that speaker gen-
                                                           Pittsburgh IRB approved the data collection.
der together with prosodic and linguistic cues from both
speaker and listener turns are significant predictors of
the BC smile intensity. Using the significant predictors 9. Acknowledgments
together with the speaker and listener behaviors to gen-
erate BC smiles offers significant improvements in terms Bilalpur and Cohn were supported by the U.S. National In-
of empirical metrics over the baseline speaker-centric stitutes of Health through award MH R01-096951. Zeinali
generation.                                                was supported through the Khoury Distinguished Fel-
   We bridge the gap between conventional non-verbal lowship at Northeastern University.
behavior generation approaches such as landmarks and
poses and their realization by showing that generated
landmarks can be transferred to an embodied agent. Thus References
creating the opportunity for evaluation with a human-
                                                            [1] H. Modi, K. Orgera, A. Grover, Exploring barriers to
like manifestation over a traditional evaluation by com-
                                                                 mental health care in the u.s. (2022). doi:10.15766/
paring generated landmark (or keypoint) outputs. Our
                                                                 rai_a3ewcf9p .
small-scale user study suggests our Furhat agent that
                                                            [2] S. Song, S. Jaiswal, L. Shen, M. Valstar, Spectral rep-
backchannels is more human-like and are more likely to
                                                                 resentation of behaviour primitives for depression
attract users for non-personal interactions. In addition
                                                                 analysis, IEEE Transactions on Affective Comput-
to these contributions, we also discussed some limita-
                                                                 ing 13 (2020) 829–844.
tions in existing technology towards generating accurate
                                                            [3] F. Ceccarelli, M. Mahmoud, Multimodal temporal
ground truth landmarks through examples such as failure
                                                                 machine learning for bipolar disorder and depres-
to capture mouth movement in bimodal BCs and how
                                                                 sion recognition, Pattern Analysis and Applications
they affect the generated outputs. We believe these limi-
                                                                 25 (2022) 493–504.
tations also serve as directions for future research. Our
                                                            [4] Y. Yang, C. Fairbairn, J. F. Cohn, Detecting depres-
work serves as a baseline for computer scientists inter-
                                                                 sion severity from vocal prosody, IEEE transactions
ested in behavior generation, and an attractive source of
                                                                 on affective computing 4 (2012) 142–150.
BC smiles for behavioral scientists to study the effect of
                                                            [5] Z. Ambadar, J. F. Cohn, L. I. Reed, All smiles are not
context cues on BC smiles in intimate conversations.
                                                                 created equal: Morphology and timing of smiles
                                                                 perceived as amused, polite, and embarrassed/n-
8. Ethical Statement                                             ervous, Journal of nonverbal behavior 33 (2009)
                                                                 17–34.
We proposed a generative approach for backchannel           [6] D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast,
smile production to enable naturalistic interactions with        A. Gainer, K. Georgila, J. Gratch, A. Hartholt,
embodied AI agents for mental health dialogue. While             M. Lhommet, et al., Simsensei kiosk: A virtual hu-
our dataset offers diverse smiles from people in different       man interviewer for healthcare decision support, in:
interpersonal relationships, like many existing genera-          Proceedings of the 2014 international conference
tive approaches, the choice of pretrained embeddings,            on Autonomous agents and multi-agent systems,
imbalance between males and females, lack of male-male           2014, pp. 1061–1068.
 [7] D. Utami, T. Bickmore, Collaborative user responses           (2020).
     in multiparty interaction with a couples counselor       [21] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the
     robot, in: 2019 14th ACM/IEEE International Con-              munich versatile and fast open-source audio fea-
     ference on Human-Robot Interaction (HRI), IEEE,               ture extractor, in: Proceedings of the 18th ACM
     2019, pp. 294–303.                                            international conference on Multimedia, 2010, pp.
 [8] N. Ward, W. Tsukahara, Prosodic features which                1459–1462.
     cue back-channel responses in english and japanese,      [22] J. W. Pennebaker, R. L. Boyd, K. Jordan, K. Black-
     Journal of pragmatics 32 (2000) 1177–1207.                    burn, The development and psychometric proper-
 [9] S. Benus, A. Gravano, J. B. Hirschberg, The prosody           ties of LIWC2015, Technical Report, 2015.
     of backchannels in american english (2007).              [23] E. Ekstedt, G. Skantze, Voice activity projec-
[10] R. Bertrand, G. Ferré, P. Blache, R. Espesser,                tion: Self-supervised learning of turn-taking events,
     S. Rauzy, Backchannels revisited from a multimodal            arXiv preprint arXiv:2205.09812 (2022).
     perspective, in: Auditory-visual Speech Processing,      [24] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke,
     2007, pp. 1–5.                                                A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A.
[11] K. P. Truong, R. Poppe, I. de Kok, D. Heylen, A mul-          Saurous, B. Seybold, et al., Cnn architectures for
     timodal analysis of vocal and visual backchannels             large-scale audio classification, in: 2017 ieee inter-
     in spontaneous dialogs., in: INTERSPEECH, 2011,               national conference on acoustics, speech and signal
     pp. 2973–2976.                                                processing (icassp), IEEE, 2017, pp. 131–135.
[12] A. Gravano, J. Hirschberg, Backchannel-inviting          [25] D. Bahdanau, K. Cho, Y. Bengio, Neural machine
     cues in task-oriented dialogue, in: Tenth Annual              translation by jointly learning to align and translate,
     Conference of the International Speech Communi-               arXiv preprint arXiv:1409.0473 (2014).
     cation Association, 2009.                                [26] S. Stoll, N. C. Camgöz, S. Hadfield, R. Bowden, Sign
[13] W. Wang, X. Alameda-Pineda, D. Xu, P. Fua, E. Ricci,          language production using neural machine transla-
     N. Sebe, Every smile is unique: Landmark-guided               tion and generative adversarial networks, in: Pro-
     diverse smile generation, in: Proceedings of the              ceedings of the 29th British Machine Vision Con-
     IEEE Conference on Computer Vision and Pattern                ference (BMVC 2018), British Machine Vision Asso-
     Recognition, 2018, pp. 7083–7092.                             ciation, 2018.
[14] W. Feng, A. Kannan, G. Gkioxari, C. L. Zit-              [27] C. Ahuja, D. W. Lee, R. Ishii, L.-P. Morency, No ges-
     nick, Learn2smile: Learning non-verbal interac-               tures left behind: Learning relationships between
     tion through observation, in: 2017 IEEE/RSJ In-               spoken language and freeform gestures, in: Find-
     ternational Conference on Intelligent Robots and              ings of the Association for Computational Linguis-
     Systems (IROS), IEEE, 2017, pp. 4131–4138.                    tics: EMNLP 2020, 2020, pp. 1884–1895.
[15] E. Ng, H. Joo, L. Hu, H. Li, T. Darrell, A. Kanazawa,    [28] S. Al Moubayed, J. Beskow, G. Skantze,
     S. Ginosar, Learning to listen: Modeling non-                 B. Granström, Furhat: a back-projected human-
     deterministic dyadic facial motion, in: Proceedings           like robot head for multiparty human-machine
     of the IEEE/CVF Conference on Computer Vision                 interaction, in: Cognitive Behavioural Systems:
     and Pattern Recognition, 2022, pp. 20395–20405.               COST 2102 International Training School, Dresden,
[16] S. Geng, R. Teotia, P. Tendulkar, S. Menon, C. Von-           Germany, February 21-26, 2011, Revised Selected
     drick, Affective faces for goal-driven dyadic com-            Papers, Springer, 2012, pp. 114–130.
     munication, arXiv preprint arXiv:2301.10939 (2023).
[17] I. O. Ertugrul, L. A. Jeni, W. Ding, J. F. Cohn, Afar:
     A deep learning based tool for automated facial
     affect recognition, in: 2019 14th IEEE international
     conference on automatic face & gesture recognition
     (FG 2019), IEEE, 2019, pp. 1–1.
[18] S. Schneider, A. Baevski, R. Collobert, M. Auli,
     wav2vec: Unsupervised pre-training for speech
     recognition, arXiv preprint arXiv:1904.05862 (2019).
[19] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner,
     M. Sonderegger, Montreal Forced Aligner: Train-
     able Text-Speech Alignment Using Kaldi, in: Proc.
     Interspeech 2017, 2017, pp. 498–502. doi:10.21437/
     Interspeech.2017- 1386 .
[20] S. A. Memon, Acoustic correlates of the voice qual-
     ifiers: A survey, arXiv preprint arXiv:2010.15869
10. Appendix
10.1. Distribution of Intensity and
      Duration of Smiles


Figure 9: Distribution of intensity and duration of BC smiles
in the annotated dataset. The spread of the histograms shows
the diversity of the annotated smiles.


   Figure 9 shows the distribution of annotated Backchan-
nel (BC) smiles in terms of their intensity and duration.
The predicted intensity using the automated approach
showed that over 50% of smiles were of B-level intensity,
and fewer instances of high-intensity smiles (D and E-
levels) were also present. The mean duration was 3.18 ±
1.71 seconds.

10.2. Effect of Sex and Relationship on
      Smile Intensity

Table 4
ANOVA of listener sex, speaker sex, and relationship on inten-
sity of smile. ‘⋅’ indicates significant at p<0.1.
                      Df    Sum Sq    Mean Sq   F value   Pr(>F)
  𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟           1      0.53      0.53      0.60    0.4417
  𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟            1      2.93      2.93      3.31   0.0710 ⋅
  𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝          3      3.23      1.08      1.22    0.3055
      𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗
                        3      2.00      0.67      0.75    0.5225
    𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
    𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗
                        1      0.10      0.10      0.11    0.7424
     𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟
      𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 ∗
                        3      3.15      1.05      1.19    0.3176
    𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
  Residuals           144    127.49      0.89


   Note that the intensity of the smile differs marginally
by the speaker sex. It is not affected by other factors such
as relationship, listener sex and their interaction.