Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues Maneesh Bilalpur1,∗ , Mert Inan2 , Dorsa Zeinali2 , Jeffrey F. Cohn1 and Malihe Alikhani2 1 University of Pittsburgh, Pittsburgh, Pennsylvania, USA 2 Northeastern University, Boston, Massachusetts, USA Abstract Addressing the critical shortage of mental health resources for effective screening, diagnosis, and treatment remains a significant challenge. This scarcity underscores the need for innovative solutions, particularly in enhancing the accessibility and efficacy of therapeutic support. Embodied agents with advanced interactive capabilities emerge as a promising and cost-effective supplement to traditional caregiving methods. Crucial to these agents’ effectiveness is their ability to simulate non-verbal behaviors, like backchannels, that are pivotal in establishing rapport and understanding in therapeutic contexts but remain under-explored. To improve the rapport-building capabilities of embodied agents we annotated backchannel smiles in videos of intimate face-to-face conversations over topics such as mental health, illness, and relationships. We hypothesized that both speaker and listener behaviors affect the duration and intensity of backchannel smiles. Using cues from speech prosody and language along with the demographics of the speaker and listener, we found them to contain significant predictors of the intensity of backchannel smiles. Based on our findings, we introduce backchannel smile production in embodied agents as a generation problem. Our attention-based generative model suggests that listener information offers performance improvements over the baseline speaker-centric generation approach. Conditioned generation using the significant predictors of smile intensity provides statistically significant improvements in empirical measures of generation quality. Our user study by transferring generated smiles to an embodied agent suggests that agent with backchannel smiles is perceived to be more human-like and is an attractive alternative for non-personal conversations over agent without backchannel smiles. 1. Introduction Fewer than a third of the US population has sufficient access to mental health professionals [1]. This highlights the need for additional resources to help mental health professionals meet the community’s demands. Problems like symptom detection and evaluating treatment effi- cacy have made great strides with AI [2, 3, 4] and the mental health community can greatly benefit from this AI intervention. Embodied agent-based systems due to their multimodal behavioral capabilities are a promis- ing solution to support such mental health needs. How- ever, the development of such systems presents numer- ous challenges. These include the scarcity of mental health-related datasets, limited access to domain experts Figure 1: Overview of steps for backchannel smile generation for designing reliable and robust systems, and the ethi- in an embodied agent in a human-agent interaction: Speaker cal considerations crucial to their design and adaptation. and listener (agent) turns are used to generate the listener’s Among such challenges, one aspect that stands out is response facial expression as landmarks. The landmarks are the agent’s ability to establish a common ground with then integrated with the embodied agent and added to the users. Addressing this is particularly crucial when the conversation flow represented as a dotted arrow. agent functions as a listener. Effective grounding in such Machine Learning for Cognitive and Mental Health Workshop scenarios relies heavily on multimodal non-verbal be- (ML4CMH), AAAI 2024, Vancouver, BC, Canada. haviors like backchannels. These subtle yet impactful ∗ Corresponding author. cues are pivotal in building rapport and understanding Envelope-Open mab623@pitt.edu (M. Bilalpur); inan.m@northeastern.edu (M. Inan); zeinali.d@northeastern.edu (D. Zeinali); between the user and the agent. Hence, understanding jeffcohn@pitt.edu (J. F. Cohn); m.alikhani@northeastern.edu and incorporating these behaviors into embodied agents (M. Alikhani) is not only challenging but also essential for creating a © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings supportive and empathetic environment for individuals and their physical realization by emulating the seeking mental health support. Addressing these chal- generated behavior with an embodied agent. lenges can pave the way for more effective, accessible, 5. Show that our BC smile generation yields appro- and empathetic digital mental health interventions. priate and natural-looking smiles through a user In dyadic conversations, at any given time one person study involving the embodied agent. may have the floor (i.e., is speaking) while the other is listening. Backchannels (BC) refer to behaviors of the Results suggest speaker sex, their use of negations, listener that do not interrupt the speaker. BCs signal loudness, word count in the listener’s turn, their usage of attention, agreement, and emotional response to what is comparisons, and mean pitch are significant predictors said. Inappropriate BC smiles such as ones that appear of BC smile intensity. Our generative approach shows too short or too long or for which the timing appears that taking listeners’ behavior into account improves “off” can disrupt the conversational rapport and result in performance, and adding the conditioning vector offers unsuccessful or disrupted conversations. Our objective significant improvements in terms of empirical metrics is to understand appropriate BC smiles from dyadic con- such as Average Pose Error (APE) and Probability of versations and how an embodied agent can employ them Correct Keypoints (PCK). when interacting with a human. Conversational agents typically realize BC smiles us- 2. Related Work ing rule-based systems, discriminative approaches, or sometimes simply mimicking the smiles of the speaker. Existing works have validated the efficacy of an agent- Mimicking, however, fails to generalize to situations that driven conversation in mental health dialogue and coun- require a contextually relevant smile. And rule-based seling situations. DeVault et al. [6], through their agent- and discriminative approaches offer limited coverage due based interviews for distress and trauma symptoms, to the diversity of smiles [5]. found that participants were comfortable interacting with We present a generative approach for BC smiles in the agent as well as sharing intimate information. Utami listeners to address these limitations and enable contextu- and Bickmore [7] used embodied agents for couples coun- ally relevant BC smiles in embodied agents. An overview seling. Participants reported significantly improved af- of the approach is presented in Figure 1. Unlike existing fect and intimacy with their partner and generally en- works that solely depend on speaker behavior for BC pro- joyed the agent-driven counseling session. Our work duction (see related work section), we use both speaker builds on this line of research to improve the BC capabil- and listener behaviors to study how they affect the in- ities of agents. tensity and duration of the BC smile. We use cues from Backchannel behaviors were traditionally produced prosody, language, and the demographics of dyads to using a set of predefined rules based on prosodic or lin- identify statistically significant predictors (referred to as guistic cues of the speaker. Both Ward and Tsukahara a conditioning vector) of smiles. In addition to the audio [8], Benus et al. [9] have found prosodic cues (particu- features from both interaction participants, we leverage larly pitch and its changes) to be reliable predictors for the conditioning vector in generating the BC smiles. In vocal BC occurrence. In contrast, we use prosody and this paper, we: linguistic cues from both speaker and listener to identify significant predictors of BC smiles. 1. Annotate backchannel smiles in a face-to-face In the multimodal context, Bertrand et al. [10] stud- interaction dataset1 of dyads that differ in their ied prosodic, morphological, and discourse markers for composition of biological sex and type of relation- their effect on vocal and gestural backchannels (hand ges- ship. tures, smiles, eyebrows), and Truong et al. [11] explored 2. Present our statistical analysis to identify vari- visual BCs by often limiting them to head nods and, at ous speaker and listener-specific cues that sig- times, grouping different BCs into the same category [12] nificantly predict the duration and intensity of without accounting for their intrinsic differences. They backchannel smiles. depended on the speaker’s behavior to identify the occur- 3. Generate backchannel smiles using an attention- rence and ignored the listener. In addition to leveraging based generative model that uses the listener and the listener behavior, we specifically study smiles because speaker turn features with the identified signifi- of their diversity and include both unimodal (visual) and cant predictors. bimodal (visual together with vocal activity) BC smiles. 4. Bridge the gap between the model-based genera- Wang et al. [13] introduced diversity in generated tion of non-verbal behaviors (as facial landmarks) smiles by conditioning on a specific class and sampling using a variational autoencoder. Learn2Smile [14] used 1 Data and code: https://github.com/bmaneesh/Generating-Context- the facial landmarks of the speaker to generate com- Sensitive-Backchannel-Smiles/ plete listener behavior by separately predicting the low- frequency (nods) and high-frequency (blinks) compo- nents of facial motion. Ng et al. [15] leverage the speaker and listener’s motion and speech features to predict the listener’s future motion information. Unlike earlier works that have been limited to facial expression genera- tion using landmarks, their usage of 3D Morphable Mod- els to define facial expressions offers a flexible solution to generate realistic facial expressions in the presence of diverse head orientations. These solutions focus on Figure 2: Distribution of speaker and listener sex across differ- the entire listener’s behavior and offer no insights about ent interpersonal relationships in annotated RealTalk dataset. Relationships are color-coded: siblings (pink), friends (orange), specific BC behaviors. Their integrations are also limited paternal (green), and romantic couple (grey). to 3D Morphable Models. The BC smiles produced in this work not only leverage the speaker and listener activity but also condition the generation on salient factors that were found to be signif- the 191 annotated smiles had an A-level or higher in- icant predictors of smile attributes – duration (the time tensity. One outlier smile was dropped because of the elapsed between the onset of a smile and its offset) and extremely long duration. The resultant 157 smiles, along intensity (maximum amplitude of a smile). Using an em- with their predicted intensity, were used in this work. bodied agent, we also bridge the gap between generated In addition to the video recordings at 25 fps and 720p landmarks and their physical realization. resolution, the dataset also contains speaker-identified turn-level text obtained through automatic transcription [18]. The individuals in the dyadic interaction occupied 3. Dataset fixed positions (left and right) in the videos. In this work, the biological sex of the participants was inferred from One of the primary challenges in studying non-verbal the videos. Videos where sex could not be established behavior in mental health interactions is access to an with confidence were discarded. appropriate dataset. Patient-therapist interactions or in- teractions with mental health professionals are access- restricted to protect the identifiable information of the 3.2. Effect of Sex and Relationship on individuals. As a result, we use a YouTube-based large- Smile Attributes scale dataset of face-to-face dyadic interactions–RealTalk Given various interpersonal relationships in the dataset [16]. The RealTalk dataset consists of individuals taking of individuals of both sexes, we compared the mean du- turns asking predefined, intimate questions about family, ration of backchannel smiles across the factors using dreams, relationships, illness, and mental health2 . We ANOVA (Table 1) with type-III sum of squares to account believe intimate conversations are among the closest ac- for imbalance between males and females. Two-way in- cessible alternatives to studying BC behaviors for mental teractions between sex, and sex and relationship were health applications. In this section, we elaborate on our also included. The ANOVA analysis suggests that the contributions in terms of the annotations for BC smiles duration of backchannel smiles differs significantly by and discuss how they differ by the demographics of the listener sex and the interaction effect of the listener sex dyads and features from the speaker and listener turn and relationship. A post hoc Tukey revealed that male preceding it. listeners, when interacting with their siblings (regardless of speaker sex), express longer BC smiles (p<0.05). 3.1. Annotating Backchannel Smiles Similarly, the intensity of smiles marginally differed We manually annotated 191 BC smiles from 48 (out of 692) by the speaker’s sex. The post hoc Tukey revealed that dyadic interactions in the RealTalk dataset. The dyads the smiles as a response to a male speaker are less in- comprised male and female participants from different tense than a female speaker (p<0.1). ANOVA analysis is ethnicities, and social relationships such as siblings, pa- presented in the appendix as Table 4. ternal, romantic, and fraternal. The smiles were nearly balanced across the different interpersonal relationships 3.3. Effect of Context Cues (see Figure 2). An automated facial expression prediction Our contextual cues were extracted from prosody and framework [17] was used to evaluate the reliability of speech features independently derived from the turns of the manual annotations. About 83% (i.e., 158 smiles) of both the speaker and the listener just before the smile on- 2 The original videos can be accessed from https://www.youtube. set. Since the speaker’s turn continues while the listener com/c/TheSkinDeep backchannels, speaker activity till the onset of the smiles Table 1 Speech cues: The spoken content of speaker and lis- tener turns was also accounted for through variables ANOVA of listener sex, speaker sex, and relationship on dura- tion of smile. ‘*’ indicates p<0.05 and ‘**’ indicates p<0.01). from the Linguistic Inquiry and Word Count (LIWC) [22] Df Sum Sq Mean Sq F value Pr(>F) framework. These variables were word count, usage of 𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 1 12.36 12.36 4.59 0.0339 * negations (no, not, never), comparisons (greater, best, 𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 1 1.29 1.29 0.48 0.4907 after), interrogative words (how, when, what), valence 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 3 4.18 1.39 0.52 0.6709 of the turns (positive or negative emotion), and focus on 𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗ events in the past, present and future. 3 42.80 14.27 5.29 0.0017 ** 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗ A generalized linear model predicted the smile inten- 1 0.90 0.90 0.33 0.5652 𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 sity from context cues and dyad demographics. Results 𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 ∗ using an inverse link function (model explained vari- 3 9.70 3.23 1.20 0.3123 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 ance 𝑅2 = 0.243) with the prosody and speech cues from Residuals 144 388.03 2.69 the audio signal are presented as Figure 3. Note that the speakers’ and listeners’ context cues were Z-score normalized. Speaker characteristics such as sex and nega- was considered in this study. The audio was trimmed tions were found to be significant predictors of intensity. to the onset to obtain corresponding contextual cues, Female speakers elicited significantly narrower smiles and the Montreal Forced Aligner (MFA) [19] was used to from their listeners, but the speaker’s usage of negations extract corresponding transcription information. resulted in wider smiles. The speaker’s loudness (RMS energy) had a marginally significant negative correlation with the smile intensity. Listener behavior also signif- icantly impacted their BC smiles. Using comparative words by the listener and their mean pitch in their pre- ceding turn resulted in significantly narrower smiles. In contrast, their word count had a marginally significant positive correlation with intensity. A similar analysis for duration did not reveal any significant correlations. 4. Modeling Smiles To automatically generate BC smile and non-smile ac- tivity in listeners, we use the audio from the speaker’s current turn and the listener’s last turn as input. 15 smiles were dropped due to difficulties in the preprocess- ing steps with MFA. The remaining 142 annotated smile Figure 3: Regression slopes showing the effect of context cues instances were augmented with an equal number of non- on the intensity of BC smiles. A positive slope indicates the smile instances. The non-smile instances were identified smile intensity increases with a given feature (vice-versa for a so that they were at least two seconds away from the negative slope). * indicates slope is significant at p<0.05 and ⋅ onset of the closest smile instance, a strategy adopted indicates marginal significance at p<0.1. from [23] for turn-taking prediction. The mean duration of smiling and non-smiling instances was ensured to be the same. Prosody cues: Our prosodic features consisted of some of the fundamental characteristics of speech, such as Attention-based generative model: The generative mean pitch during the turn, range of the pitch, and Root model (Figure 4) for facial landmark prediction primarily Mean Square (RMS) energy of the audio signal. These fea- consisted of an encoder and a decoder with a one-layer tures were chosen because of their relevance (see related GRU each. Inputs to the model were embeddings from work) in BC behavior and also due to the ease of interpre- speaker and listener turns extracted using the pretrained tation as well as their ability to convey various behavioral vggish model [24]. We limited the input context length traits. For example, RMS energy conveys traits such as to use turn durations of 60 seconds. The output context confidence, doubtfulness, and enthusiasm [20]. Lastly, was limited to predicting one second of facial activity. using the OpenSMILE [21] software, prosodic features The speaker vggish embeddings were used as input to were obtained. the encoder. The hidden state of the GRU was initialized as the mean of the listener’s turn embeddings. The fi- Figure 4: Generative model architecture. Encoder input contains speech embeddings of listener and speaker from the pretrained vggish model. The encoder’s final hidden state is concatenated with the conditioning vector and then used to initialize the decoder’s hidden state. Decoder output landmarks are sequentially fed (dotted curves) to generate the next landmarks in the output sequence. nal hidden state of the encoder was concatenated with the Mean Squared Error (MSE) between predictions and the conditioning vector, and a linear layer with ReLU the ground truth. The learning rate was halved when activation was used to match the dimensionality of the validation loss plateaued for 20 consecutive epochs. Data decoder’s hidden state. At each decoding step, attention was partitioned into 75 (train), 15 (validation), and 15 [25] was applied between the encoder output and the de- (test) split in terms of the number of dyads. Models were coder’s last hidden state (Equation 1) to use as the input trained for 250 epochs, and validation loss was used to de- to the next step. termine the best model for testing. This was repeated 10 times to evaluate the statistical significance of differences 𝑎(𝑠𝑡−1 , ℎ𝑖 ) = 𝑣 𝑇 𝑡𝑎𝑛ℎ(𝑊𝑎 ℎ𝑖 + 𝑊𝑏 𝑠𝑡−1 ) (1) against baseline speaker-based BC generation setting. where 𝑎(𝑠𝑡−1 , ℎ𝑖 ) is the attention between decoder last Metrics: Objective measures of performance from ges- hidden state (𝑠𝑡−1 ) and encoder output (ℎ𝑖 ). 𝑊𝑖 s and 𝑣 are linear layers. ture generation approaches, including Average Pose Er- ror (APE) and Probability of Correct Keypoints (PCK), were adopted to quantify the generated landmarks 4.1. Implementation details against the ground truth from the AFAR toolbox. APE The videos were split into two vertical halves, one cor- (Equation 2) is equivalent to the mean squared error be- responding to each individual in the dyadic interaction. tween predicted facial expression and ground truth facial These were used for facial landmark extraction using the expression. PCK (Equation 3) is a proximity-based metric AFARtoolbox [17]. To account for various facial shapes, that considers the landmark to be correctly predicted if we normalized landmarks to the mean face of the dataset the difference with ground truth falls below a margin. using the approach described in [26]. Because of the We report mean PCK for 𝜎 = 0.1 and 0.2. high degree of correlation between successive frames, 𝑘 frames were downsampled by a factor of three, to use 1 𝐴𝑃𝐸 = ∑ ‖(𝑦(𝑝) ̂ − 𝑦(𝑝))‖2 (2) every third frame. Displacement was then calculated as 𝑘 𝑦=1 the difference between the landmarks from successive where 𝑘 is the number of landmarks, 𝑦(𝑝) ̂ is the pre- frames. These were further subjected to a min-max nor- diction and 𝑦(𝑝) is the groundtruth. malization to allow for individual differences in smiling dynamics. The normalized displacements were predicted 𝑘 using the attention-based generative model. The pre- 1 𝑃𝐶𝐾𝜎 = ∑ 𝛿(‖(𝑦(𝑝) ̂ − 𝑦(𝑝))‖2 ≤ 𝜎) (3) dicted frame-level displacements were incorporated into 𝑘 𝑦=1 the last known listener facial expression to generate the where 𝛿 is an indicator function and 𝜎 is the margin. sequence of facial landmarks recursively. We enforced teacher-forcing with simulated annealing during training and linearly decreased the likelihood of 4.2. Results using ground truth at every 20 epochs. Stochastic Gra- Using listener behavior and conditioning vector together dient Descent with a learning rate initialized at 1𝑒 − 4 with the speaker behavior resulted in improved perfor- weight decay and 0.99 momentum were used to minimize mance compared to the baseline speaker behavior-based Table 2 Average Pose Error (APE) and Probability of Correct Keypoints (PCK) metrics for generated facial expressions under various experimental settings. A downward-facing arrow indicates lower value implies better generation. ‘*’ indicates significance with p <0.05 with ‘⋅’ indicates marginal significance with p <0.1. Model APE↓ PCK↑ Speaker only (Baseline) 9.552 0.219 Speaker and Listener 9.346⋅ 0.220⋅ Speaker and Listener with 9.279* 0.223* Conditioning vector Speaker and Conditioning vector 9.615 0.218⋅ prediction. As shown in Table 2, APE decreased by 0.273 points while PCK increased by 0.004; these gains were sta- tistically significant. When listener behavior was added to the speaker behavior, marginally significant improve- ments were observed. APE reduced by 0.206 points while PCK increased by 0.001 points. These reiterate our hy- pothesis that both speaker and listener contribute to BC behaviors. When speaker behavior was augmented with the conditioning vector, only nominal differences were observed against the baseline. APE increased by 0.063 points, and PCK decreased by 0.001. To understand how the performance varies with dif- ferent smiles, we predicted APE (and PCK) as a linear combination of duration, intensity, and the model config- uration using a regression model. Results from Figure 5 show that duration significantly affects the PCK. Interest- ingly, the positive slope suggests that longer smiles are generated better over shorter smiles. Only a marginally Figure 5: Effect of duration and intensity of smile along with significant effect of duration can be observed for APE. ablation of inputs on generative model performance measured using APE (top) and PCK (bottom). S & C-speaker and condi- With the increase in the intensity of the smile, the gen- tioning vector, S & L-speaker and listener, and S, L & C-speaker eration performance decreases. This is significant for and listener and conditioning vector as inputs to the model. D-level and E-level smiles. Using listener features and ‘⋅’, ‘*’ and ‘***’ indicate significance with p <0.1, p <0.05 and p the conditioning vector along with the speaker features <0.001 respectively. improves the performance (negative and positive slopes for APE and PCK, respectively) compared to the baseline speaker-based generation. However, this effect is not utterance. However, the model fails to capture this verti- statistically significant. cal motion. Qualitative evaluation of ground truth landmarks from Metrics like APE and PCK provide an objective mea- Figure 6 suggest the deficiencies of the existing facial sure of the prediction. However, evaluating concepts landmark prediction approaches [17] to accurately track such as realism and contextual relevance of the BC predic- lip corners both in the presence and absence of non- tion requires subjective ratings from human evaluation. frontal head pose. While a visually noticeable difference A convention in evaluating landmark or keypoint-based can be observed as the smile evolves, the ground truth generative approaches is the human comparison of pre- landmarks fail to capture the subtle lip corner motion. dicted keypoints against the ground truth [14, 27]. While This limitation in the ground truth has resulted in nom- this might work for problems such as gesture genera- inal motion in the predicted landmarks. We also found tion that involve a strong motion component, evaluating that BC smiles that co-occur with vocal activity are chal- subtle behaviors like facial expressions using a similar lenging to predict. Figure 7 shows one example where strategy could be challenging. To address this concern, the vertical distance between the upper and lower lips we leverage the emulated version of an embodied agent: increases and decreases because of the simultaneous yeah Furhat [28]. Figure 6: Two sample smiles from the dataset showing their onsets (left-most frame to widest smile frame) and offsets (widest smile frame to right-most frame). Note that while the evolution of smile is noticeable in ground truth landmarks (second row) of the top smile, subtle changes between successive frames of the bottom smile are not captured by its ground truth landmarks. This is also observed in the generated landmarks (third row). Zoom-in recommended. The faces used are from the RealTalk dataset. 5.1. Emulation Setup Furhat allows users to control facial expressions using a set of facial parameters called BasicParams3 (ex. MOUTH_SMILE_LEFT and MOUTH_SMILE_RIGHT to control the left and the right lip corners; BROW_UP_LEFT, BROW_UP_RIGHT to control the left and right eyebrows, etc.). Our setup uses these parameters to enable the embodied agent’s smile and Figure 7: Limitation of the current approach in generating express associated eyebrow actions. The landmarks a bimodal backchannel smile. The frames highlighted in red from a generated smile expression were used to calculate box correspond to the co-occurring verbal “yeah”. Notice that ground truth landmarks (second row) fail to capture the verti- the displacement between successive frames and nor- cal mouth movement. This is also observed in the generated malized to the [0, 1] range. For eyebrows, only vertical landmarks (third row). Zoom-in recommended. The faces displacement was used. Our inputs to the Furhat API used are from the RealTalk dataset. consisted of the lip corner and eyebrow displacements corresponding to the frame with the widest smile (maximum horizontal displacement between the lip 5. Smiles on an Embodied Agent corners). The duration of the Furhat smile was set to the duration of the generated smile. Figure 8 shows an So far, we have shown modeling smiles by generating example of the resultant expression. The user study was facial landmarks. However, users in real-world scenarios conducted using the Furhat Desktop SDK. However, we do not expect to see such abstract representations of do not foresee difficulties transferring the emulation faces. Aligning these facial landmarks with embodied setup to a physically embodied Furhat. agents is key for an interactable conversational agent. To achieve this, we describe the procedure to transfer 5.2. User Study Procedure generated landmarks to an embodied robotic simulation system called Furhat. We then conduct a user study for We conducted a small-scale user study of participants subjective perceived differences in Furhat’s behavior due watching two pre-recorded videos of the Furhat interact- to BC smile. ing with an individual. They differ only in terms of Furhat expressing a BC smile. In both interactions, Furhat starts 3 https://docs.furhat.io/remote-api/#python-remote-api Table 3 Number of responses that expressed moderate or strong agree- ment along various factors related to the BC smiles when interacting with Furhat with and without backchannel behav- iors. Question Backchannel Non-backchannel Figure 8: Four frames of an example Furhat robot emulation Human-like 5 4 with different levels of smiles used as backchannels during Natural 6 6 the conversation in our user study. Willing to interact 1 0 Appropriate brightness 3 5 Longer or shorter smiles 2 0 Personal conversations 1 1 Non-personal conversations 3 2 with a brief introduction of itself, followed by a short question–“How have you been feeling over the last two weeks?”. As the user responds, a smile is generated at that the brightness of the BC smile was appropriate while the appropriate location (see Figure 8). We refer to this two found that the duration of BC smile was longer or scenario as the backchannel setting. Another video of shorter than expected. While no difference was observed the same individual interacting with Furhat with no BC in terms of users’ preference for Furhat for personal con- (non-backchannel) serves as our baseline. Seven gradu- versations based on the presence of the BC smile, more ate students then rated each video recording separately. users (3/7) responded that they would use Furhat with Note that raters were not primed on the study’s outcome, BC smiles for non-personal conversations over Furhat and no explicit instructions about smiles were given. without BC smiles (2/7). To quantify the user’s perception of Furhat interacting with an individual, the influence of BC smile in addition to the effect of its intensity and duration, and their will- 6. Discussion ingness to interact with one was quantified through the following questions on a 5-point Likert scale (1: strongly Our quantitative results suggest that both speaker and lis- agree, 5: strongly disagree). tener behavior are important in generating BC behavior. Using listener behavior together with the conditioning 1. The Furhat’s smiles looked human-like. vector offered statistically significant improvements in 2. The Furhat’s smiles looked natural and friendly. performance when compared to the baseline speaker- 3. I would talk to this agent frequently. only model. This effect was observed both in terms of 4. I felt the brightness of Furhat’s smiles was appro- APE and PCK. We also found that our attention-based priate. generative model can predict low-intensity smiles better 5. The Furhat was smiling for longer or shorter du- than high-intensity smiles. Our user study shows that ration than it was expected. more people find our agent human-like when it was able 6. I would feel comfortable talking to this agent to express BC smiles. Participants prefer to interact with about non-personal topics. it over the agent with no BC smile capabilities for non- 7. I would feel comfortable talking to this agent personal conversations. However, for intimate personal about personal topics. conversations, the presence of a BC smile did not sway their decision. In addition, open-ended feedback was also a part of the Some limitations of this work include the following. questionnaire. We believe these questions help identify We employed an affordable measure of reliability for BC some user-facing challenges in generating BC behav- smile annotations using a prediction model over a hu- iors and how they influence users’ attitudes to embodied man rater. A robust approach would involve at least one agent-based dialogue systems for conversations related more human annotator to perform reliability annotations to mental health. on a portion of the dataset. The statistical analysis also assumes that the smiles were independent of the individ- 5.3. Results uals and dyads. However, a given individual typically produces multiple smiles. Grouping of smiles by factors Table 3 shows that more users (5/7) expressed moderate such as individuals and dyads can be better modelled us- or higher agreement that the Furhat agent with BC smile ing a mixed-effects model. Our user study was designed was human-like than its counterpart without BC smile to demonstrate the feasibility of transferring generated (4/7). One user expressed interest in frequently interact- facial landmarks to an embodied agent together with un- ing with the agent in backchannel setting while the lack derstanding perceived differences between interactions of backchannels resulted in increased hesitancy among with and without BC smiles. An appropriate evaluation users in frequently using it. Three (out of 7) users found framework would include the user interacting with the romantic relationships, and lack of age and ethnicity in- agent. Followed by a comparison of qualitative subjec- formation in the dataset might have resulted in biased tive ratings of user experience and quantified parameters generations. We also acknowledge that using embodied (such as difference in turn duration, language usage, etc.)agents in such sensitive applications should undergo rig- of the interaction with and without BC smiles. We believe orous evaluations by technical and domain experts and such approaches provide a holistic evaluation to identify regulatory bodies. In our work, we do not interpret em- critical instances in the interaction. Lastly, we focused on bodied agents as a substitute for professionals in mental BC smiles leaving out other conventional signals such as health or allied areas of healthcare but to provide tools vocal and headpose-based BCs, and how they are affected for them to better serve the community’s demands. We by the cues from the speaker and listener. believe that the advantages and limitations of embod- ied agents in mental health should be presented to the users and the healthcare experts to provide maximum 7. Conclusion benefits. The information used in this work is identified from a publicly available dataset. Also, special attention To enable BCs in embodied agents for mental health has been paid to privacy and copyright requirements for applications, we proposed an annotated dataset of face- relevant images showing individual faces. The user study to-face conversations including topics related to mental raters were voluntary participants, and the University of health. Our statistical analysis showed that speaker gen- Pittsburgh IRB approved the data collection. der together with prosodic and linguistic cues from both speaker and listener turns are significant predictors of the BC smile intensity. Using the significant predictors 9. Acknowledgments together with the speaker and listener behaviors to gen- erate BC smiles offers significant improvements in terms Bilalpur and Cohn were supported by the U.S. National In- of empirical metrics over the baseline speaker-centric stitutes of Health through award MH R01-096951. Zeinali generation. was supported through the Khoury Distinguished Fel- We bridge the gap between conventional non-verbal lowship at Northeastern University. behavior generation approaches such as landmarks and poses and their realization by showing that generated landmarks can be transferred to an embodied agent. Thus References creating the opportunity for evaluation with a human- [1] H. Modi, K. Orgera, A. Grover, Exploring barriers to like manifestation over a traditional evaluation by com- mental health care in the u.s. (2022). doi:10.15766/ paring generated landmark (or keypoint) outputs. Our rai_a3ewcf9p . small-scale user study suggests our Furhat agent that [2] S. Song, S. Jaiswal, L. Shen, M. Valstar, Spectral rep- backchannels is more human-like and are more likely to resentation of behaviour primitives for depression attract users for non-personal interactions. In addition analysis, IEEE Transactions on Affective Comput- to these contributions, we also discussed some limita- ing 13 (2020) 829–844. tions in existing technology towards generating accurate [3] F. Ceccarelli, M. Mahmoud, Multimodal temporal ground truth landmarks through examples such as failure machine learning for bipolar disorder and depres- to capture mouth movement in bimodal BCs and how sion recognition, Pattern Analysis and Applications they affect the generated outputs. We believe these limi- 25 (2022) 493–504. tations also serve as directions for future research. Our [4] Y. Yang, C. Fairbairn, J. F. Cohn, Detecting depres- work serves as a baseline for computer scientists inter- sion severity from vocal prosody, IEEE transactions ested in behavior generation, and an attractive source of on affective computing 4 (2012) 142–150. BC smiles for behavioral scientists to study the effect of [5] Z. Ambadar, J. F. Cohn, L. I. Reed, All smiles are not context cues on BC smiles in intimate conversations. created equal: Morphology and timing of smiles perceived as amused, polite, and embarrassed/n- 8. Ethical Statement ervous, Journal of nonverbal behavior 33 (2009) 17–34. We proposed a generative approach for backchannel [6] D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, smile production to enable naturalistic interactions with A. Gainer, K. Georgila, J. Gratch, A. Hartholt, embodied AI agents for mental health dialogue. While M. Lhommet, et al., Simsensei kiosk: A virtual hu- our dataset offers diverse smiles from people in different man interviewer for healthcare decision support, in: interpersonal relationships, like many existing genera- Proceedings of the 2014 international conference tive approaches, the choice of pretrained embeddings, on Autonomous agents and multi-agent systems, imbalance between males and females, lack of male-male 2014, pp. 1061–1068. [7] D. Utami, T. Bickmore, Collaborative user responses (2020). in multiparty interaction with a couples counselor [21] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the robot, in: 2019 14th ACM/IEEE International Con- munich versatile and fast open-source audio fea- ference on Human-Robot Interaction (HRI), IEEE, ture extractor, in: Proceedings of the 18th ACM 2019, pp. 294–303. international conference on Multimedia, 2010, pp. [8] N. Ward, W. Tsukahara, Prosodic features which 1459–1462. cue back-channel responses in english and japanese, [22] J. W. Pennebaker, R. L. Boyd, K. Jordan, K. Black- Journal of pragmatics 32 (2000) 1177–1207. burn, The development and psychometric proper- [9] S. Benus, A. Gravano, J. B. Hirschberg, The prosody ties of LIWC2015, Technical Report, 2015. of backchannels in american english (2007). [23] E. Ekstedt, G. Skantze, Voice activity projec- [10] R. Bertrand, G. Ferré, P. Blache, R. Espesser, tion: Self-supervised learning of turn-taking events, S. Rauzy, Backchannels revisited from a multimodal arXiv preprint arXiv:2205.09812 (2022). perspective, in: Auditory-visual Speech Processing, [24] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, 2007, pp. 1–5. A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. [11] K. P. Truong, R. Poppe, I. de Kok, D. Heylen, A mul- Saurous, B. Seybold, et al., Cnn architectures for timodal analysis of vocal and visual backchannels large-scale audio classification, in: 2017 ieee inter- in spontaneous dialogs., in: INTERSPEECH, 2011, national conference on acoustics, speech and signal pp. 2973–2976. processing (icassp), IEEE, 2017, pp. 131–135. [12] A. Gravano, J. Hirschberg, Backchannel-inviting [25] D. Bahdanau, K. Cho, Y. Bengio, Neural machine cues in task-oriented dialogue, in: Tenth Annual translation by jointly learning to align and translate, Conference of the International Speech Communi- arXiv preprint arXiv:1409.0473 (2014). cation Association, 2009. [26] S. Stoll, N. C. Camgöz, S. Hadfield, R. Bowden, Sign [13] W. Wang, X. Alameda-Pineda, D. Xu, P. Fua, E. Ricci, language production using neural machine transla- N. Sebe, Every smile is unique: Landmark-guided tion and generative adversarial networks, in: Pro- diverse smile generation, in: Proceedings of the ceedings of the 29th British Machine Vision Con- IEEE Conference on Computer Vision and Pattern ference (BMVC 2018), British Machine Vision Asso- Recognition, 2018, pp. 7083–7092. ciation, 2018. [14] W. Feng, A. Kannan, G. Gkioxari, C. L. Zit- [27] C. Ahuja, D. W. Lee, R. Ishii, L.-P. Morency, No ges- nick, Learn2smile: Learning non-verbal interac- tures left behind: Learning relationships between tion through observation, in: 2017 IEEE/RSJ In- spoken language and freeform gestures, in: Find- ternational Conference on Intelligent Robots and ings of the Association for Computational Linguis- Systems (IROS), IEEE, 2017, pp. 4131–4138. tics: EMNLP 2020, 2020, pp. 1884–1895. [15] E. Ng, H. Joo, L. Hu, H. Li, T. Darrell, A. Kanazawa, [28] S. Al Moubayed, J. Beskow, G. Skantze, S. Ginosar, Learning to listen: Modeling non- B. Granström, Furhat: a back-projected human- deterministic dyadic facial motion, in: Proceedings like robot head for multiparty human-machine of the IEEE/CVF Conference on Computer Vision interaction, in: Cognitive Behavioural Systems: and Pattern Recognition, 2022, pp. 20395–20405. COST 2102 International Training School, Dresden, [16] S. Geng, R. Teotia, P. Tendulkar, S. Menon, C. Von- Germany, February 21-26, 2011, Revised Selected drick, Affective faces for goal-driven dyadic com- Papers, Springer, 2012, pp. 114–130. munication, arXiv preprint arXiv:2301.10939 (2023). [17] I. O. Ertugrul, L. A. Jeni, W. Ding, J. F. Cohn, Afar: A deep learning based tool for automated facial affect recognition, in: 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019), IEEE, 2019, pp. 1–1. [18] S. Schneider, A. Baevski, R. Collobert, M. Auli, wav2vec: Unsupervised pre-training for speech recognition, arXiv preprint arXiv:1904.05862 (2019). [19] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, M. Sonderegger, Montreal Forced Aligner: Train- able Text-Speech Alignment Using Kaldi, in: Proc. Interspeech 2017, 2017, pp. 498–502. doi:10.21437/ Interspeech.2017- 1386 . [20] S. A. Memon, Acoustic correlates of the voice qual- ifiers: A survey, arXiv preprint arXiv:2010.15869 10. Appendix 10.1. Distribution of Intensity and Duration of Smiles Figure 9: Distribution of intensity and duration of BC smiles in the annotated dataset. The spread of the histograms shows the diversity of the annotated smiles. Figure 9 shows the distribution of annotated Backchan- nel (BC) smiles in terms of their intensity and duration. The predicted intensity using the automated approach showed that over 50% of smiles were of B-level intensity, and fewer instances of high-intensity smiles (D and E- levels) were also present. The mean duration was 3.18 ± 1.71 seconds. 10.2. Effect of Sex and Relationship on Smile Intensity Table 4 ANOVA of listener sex, speaker sex, and relationship on inten- sity of smile. ‘⋅’ indicates significant at p<0.1. Df Sum Sq Mean Sq F value Pr(>F) 𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 1 0.53 0.53 0.60 0.4417 𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 1 2.93 2.93 3.31 0.0710 ⋅ 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 3 3.23 1.08 1.22 0.3055 𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗ 3 2.00 0.67 0.75 0.5225 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑠𝑒𝑥𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟 ∗ 1 0.10 0.10 0.11 0.7424 𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 𝑠𝑒𝑥𝑠𝑝𝑒𝑎𝑘𝑒𝑟 ∗ 3 3.15 1.05 1.19 0.3176 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 Residuals 144 127.49 0.89 Note that the intensity of the smile differs marginally by the speaker sex. It is not affected by other factors such as relationship, listener sex and their interaction.