<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Maneesh</forename><surname>Bilalpur</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Pittsburgh</orgName>
								<address>
									<settlement>Pittsburgh</settlement>
									<region>Pennsylvania</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Dorsa</forename><surname>Zeinali</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Northeastern University</orgName>
								<address>
									<settlement>Boston</settlement>
									<region>Massachusetts</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jeffrey</forename><forename type="middle">F</forename><surname>Cohn</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Pittsburgh</orgName>
								<address>
									<settlement>Pittsburgh</settlement>
									<region>Pennsylvania</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Northeastern University</orgName>
								<address>
									<settlement>Boston</settlement>
									<region>Massachusetts</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Malihe</forename><surname>Alikhani</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Northeastern University</orgName>
								<address>
									<settlement>Boston</settlement>
									<region>Massachusetts</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">8F6B6EB35BF90997AECDFD8B857F2757</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:45+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Addressing the critical shortage of mental health resources for effective screening, diagnosis, and treatment remains a significant challenge. This scarcity underscores the need for innovative solutions, particularly in enhancing the accessibility and efficacy of therapeutic support. Embodied agents with advanced interactive capabilities emerge as a promising and cost-effective supplement to traditional caregiving methods. Crucial to these agents' effectiveness is their ability to simulate non-verbal behaviors, like backchannels, that are pivotal in establishing rapport and understanding in therapeutic contexts but remain under-explored. To improve the rapport-building capabilities of embodied agents we annotated backchannel smiles in videos of intimate face-to-face conversations over topics such as mental health, illness, and relationships. We hypothesized that both speaker and listener behaviors affect the duration and intensity of backchannel smiles. Using cues from speech prosody and language along with the demographics of the speaker and listener, we found them to contain significant predictors of the intensity of backchannel smiles. Based on our findings, we introduce backchannel smile production in embodied agents as a generation problem. Our attention-based generative model suggests that listener information offers performance improvements over the baseline speaker-centric generation approach. Conditioned generation using the significant predictors of smile intensity provides statistically significant improvements in empirical measures of generation quality. Our user study by transferring generated smiles to an embodied agent suggests that agent with backchannel smiles is perceived to be more human-like and is an attractive alternative for non-personal conversations over agent without backchannel smiles.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Fewer than a third of the US population has sufficient access to mental health professionals <ref type="bibr" target="#b0">[1]</ref>. This highlights the need for additional resources to help mental health professionals meet the community's demands. Problems like symptom detection and evaluating treatment efficacy have made great strides with AI <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref> and the mental health community can greatly benefit from this AI intervention. Embodied agent-based systems due to their multimodal behavioral capabilities are a promising solution to support such mental health needs. However, the development of such systems presents numerous challenges. These include the scarcity of mental health-related datasets, limited access to domain experts for designing reliable and robust systems, and the ethical considerations crucial to their design and adaptation. Among such challenges, one aspect that stands out is the agent's ability to establish a common ground with users. Addressing this is particularly crucial when the agent functions as a listener. Effective grounding in such Envelope mab623@pitt.edu (M. Bilalpur); inan.m@northeastern.edu (M. Inan); zeinali.d@northeastern.edu (D. Zeinali); jeffcohn@pitt.edu (J. F. Cohn); m.alikhani@northeastern.edu (M. Alikhani) Figure <ref type="figure">1</ref>: Overview of steps for backchannel smile generation in an embodied agent in a human-agent interaction: Speaker and listener (agent) turns are used to generate the listener's response facial expression as landmarks. The landmarks are then integrated with the embodied agent and added to the conversation flow represented as a dotted arrow. scenarios relies heavily on multimodal non-verbal behaviors like backchannels. These subtle yet impactful cues are pivotal in building rapport and understanding between the user and the agent. Hence, understanding and incorporating these behaviors into embodied agents is not only challenging but also essential for creating a supportive and empathetic environment for individuals seeking mental health support. Addressing these challenges can pave the way for more effective, accessible, and empathetic digital mental health interventions.</p><p>In dyadic conversations, at any given time one person may have the floor (i.e., is speaking) while the other is listening. Backchannels (BC) refer to behaviors of the listener that do not interrupt the speaker. BCs signal attention, agreement, and emotional response to what is said. Inappropriate BC smiles such as ones that appear too short or too long or for which the timing appears "off" can disrupt the conversational rapport and result in unsuccessful or disrupted conversations. Our objective is to understand appropriate BC smiles from dyadic conversations and how an embodied agent can employ them when interacting with a human.</p><p>Conversational agents typically realize BC smiles using rule-based systems, discriminative approaches, or sometimes simply mimicking the smiles of the speaker. Mimicking, however, fails to generalize to situations that require a contextually relevant smile. And rule-based and discriminative approaches offer limited coverage due to the diversity of smiles <ref type="bibr" target="#b4">[5]</ref>.</p><p>We present a generative approach for BC smiles in listeners to address these limitations and enable contextually relevant BC smiles in embodied agents. An overview of the approach is presented in Figure <ref type="figure">1</ref>. Unlike existing works that solely depend on speaker behavior for BC production (see related work section), we use both speaker and listener behaviors to study how they affect the intensity and duration of the BC smile. We use cues from prosody, language, and the demographics of dyads to identify statistically significant predictors (referred to as a conditioning vector) of smiles. In addition to the audio features from both interaction participants, we leverage the conditioning vector in generating the BC smiles. In this paper, we:</p><p>1. Annotate backchannel smiles in a face-to-face interaction dataset<ref type="foot" target="#foot_0">1</ref> of dyads that differ in their composition of biological sex and type of relationship. 2. Present our statistical analysis to identify various speaker and listener-specific cues that significantly predict the duration and intensity of backchannel smiles. 3. Generate backchannel smiles using an attentionbased generative model that uses the listener and speaker turn features with the identified significant predictors. 4. Bridge the gap between the model-based generation of non-verbal behaviors (as facial landmarks)</p><p>and their physical realization by emulating the generated behavior with an embodied agent. 5. Show that our BC smile generation yields appropriate and natural-looking smiles through a user study involving the embodied agent.</p><p>Results suggest speaker sex, their use of negations, loudness, word count in the listener's turn, their usage of comparisons, and mean pitch are significant predictors of BC smile intensity. Our generative approach shows that taking listeners' behavior into account improves performance, and adding the conditioning vector offers significant improvements in terms of empirical metrics such as Average Pose Error (APE) and Probability of Correct Keypoints (PCK).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Existing works have validated the efficacy of an agentdriven conversation in mental health dialogue and counseling situations. DeVault et al. <ref type="bibr" target="#b5">[6]</ref>, through their agentbased interviews for distress and trauma symptoms, found that participants were comfortable interacting with the agent as well as sharing intimate information. Utami and Bickmore <ref type="bibr" target="#b6">[7]</ref> used embodied agents for couples counseling. Participants reported significantly improved affect and intimacy with their partner and generally enjoyed the agent-driven counseling session. Our work builds on this line of research to improve the BC capabilities of agents.</p><p>Backchannel behaviors were traditionally produced using a set of predefined rules based on prosodic or linguistic cues of the speaker. Both Ward and Tsukahara <ref type="bibr" target="#b7">[8]</ref>, Benus et al. <ref type="bibr" target="#b8">[9]</ref> have found prosodic cues (particularly pitch and its changes) to be reliable predictors for vocal BC occurrence. In contrast, we use prosody and linguistic cues from both speaker and listener to identify significant predictors of BC smiles.</p><p>In the multimodal context, Bertrand et al. <ref type="bibr" target="#b9">[10]</ref> studied prosodic, morphological, and discourse markers for their effect on vocal and gestural backchannels (hand gestures, smiles, eyebrows), and Truong et al. <ref type="bibr" target="#b10">[11]</ref> explored visual BCs by often limiting them to head nods and, at times, grouping different BCs into the same category <ref type="bibr" target="#b11">[12]</ref> without accounting for their intrinsic differences. They depended on the speaker's behavior to identify the occurrence and ignored the listener. In addition to leveraging the listener behavior, we specifically study smiles because of their diversity and include both unimodal (visual) and bimodal (visual together with vocal activity) BC smiles.</p><p>Wang et al. <ref type="bibr" target="#b12">[13]</ref> introduced diversity in generated smiles by conditioning on a specific class and sampling using a variational autoencoder. Learn2Smile <ref type="bibr" target="#b13">[14]</ref> used the facial landmarks of the speaker to generate complete listener behavior by separately predicting the low-frequency (nods) and high-frequency (blinks) components of facial motion. Ng et al. <ref type="bibr" target="#b14">[15]</ref> leverage the speaker and listener's motion and speech features to predict the listener's future motion information. Unlike earlier works that have been limited to facial expression generation using landmarks, their usage of 3D Morphable Models to define facial expressions offers a flexible solution to generate realistic facial expressions in the presence of diverse head orientations. These solutions focus on the entire listener's behavior and offer no insights about specific BC behaviors. Their integrations are also limited to 3D Morphable Models.</p><p>The BC smiles produced in this work not only leverage the speaker and listener activity but also condition the generation on salient factors that were found to be significant predictors of smile attributes -duration (the time elapsed between the onset of a smile and its offset) and intensity (maximum amplitude of a smile). Using an embodied agent, we also bridge the gap between generated landmarks and their physical realization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Dataset</head><p>One of the primary challenges in studying non-verbal behavior in mental health interactions is access to an appropriate dataset. Patient-therapist interactions or interactions with mental health professionals are accessrestricted to protect the identifiable information of the individuals. As a result, we use a YouTube-based largescale dataset of face-to-face dyadic interactions-RealTalk <ref type="bibr" target="#b15">[16]</ref>. The RealTalk dataset consists of individuals taking turns asking predefined, intimate questions about family, dreams, relationships, illness, and mental health<ref type="foot" target="#foot_1">2</ref> . We believe intimate conversations are among the closest accessible alternatives to studying BC behaviors for mental health applications. In this section, we elaborate on our contributions in terms of the annotations for BC smiles and discuss how they differ by the demographics of the dyads and features from the speaker and listener turn preceding it.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Annotating Backchannel Smiles</head><p>We manually annotated 191 BC smiles from 48 (out of 692) dyadic interactions in the RealTalk dataset. The dyads comprised male and female participants from different ethnicities, and social relationships such as siblings, paternal, romantic, and fraternal. The smiles were nearly balanced across the different interpersonal relationships (see Figure <ref type="figure" target="#fig_1">2</ref>). An automated facial expression prediction framework <ref type="bibr" target="#b16">[17]</ref> was used to evaluate the reliability of the manual annotations. About 83% (i.e., 158 smiles) of the 191 annotated smiles had an A-level or higher intensity. One outlier smile was dropped because of the extremely long duration. The resultant 157 smiles, along with their predicted intensity, were used in this work. In addition to the video recordings at 25 fps and 720p resolution, the dataset also contains speaker-identified turn-level text obtained through automatic transcription <ref type="bibr" target="#b17">[18]</ref>. The individuals in the dyadic interaction occupied fixed positions (left and right) in the videos. In this work, the biological sex of the participants was inferred from the videos. Videos where sex could not be established with confidence were discarded.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Effect of Sex and Relationship on Smile Attributes</head><p>Given various interpersonal relationships in the dataset of individuals of both sexes, we compared the mean duration of backchannel smiles across the factors using ANOVA (Table <ref type="table" target="#tab_0">1</ref>) with type-III sum of squares to account for imbalance between males and females. Two-way interactions between sex, and sex and relationship were also included. The ANOVA analysis suggests that the duration of backchannel smiles differs significantly by listener sex and the interaction effect of the listener sex and relationship. A post hoc Tukey revealed that male listeners, when interacting with their siblings (regardless of speaker sex), express longer BC smiles (p&lt;0.05).</p><p>Similarly, the intensity of smiles marginally differed by the speaker's sex. The post hoc Tukey revealed that the smiles as a response to a male speaker are less intense than a female speaker (p&lt;0.1). ANOVA analysis is presented in the appendix as Table <ref type="table" target="#tab_2">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Effect of Context Cues</head><p>Our contextual cues were extracted from prosody and speech features independently derived from the turns of both the speaker and the listener just before the smile onset. Since the speaker's turn continues while the listener backchannels, speaker activity till the onset of the smiles was considered in this study. The audio was trimmed to the onset to obtain corresponding contextual cues, and the Montreal Forced Aligner (MFA) <ref type="bibr" target="#b18">[19]</ref> was used to extract corresponding transcription information. Prosody cues: Our prosodic features consisted of some of the fundamental characteristics of speech, such as mean pitch during the turn, range of the pitch, and Root Mean Square (RMS) energy of the audio signal. These features were chosen because of their relevance (see related work) in BC behavior and also due to the ease of interpretation as well as their ability to convey various behavioral traits. For example, RMS energy conveys traits such as confidence, doubtfulness, and enthusiasm <ref type="bibr" target="#b19">[20]</ref>. Lastly, using the OpenSMILE <ref type="bibr" target="#b20">[21]</ref> software, prosodic features were obtained.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Speech cues:</head><p>The spoken content of speaker and listener turns was also accounted for through variables from the Linguistic Inquiry and Word Count (LIWC) <ref type="bibr" target="#b21">[22]</ref> framework. These variables were word count, usage of negations (no, not, never), comparisons (greater, best, after), interrogative words (how, when, what), valence of the turns (positive or negative emotion), and focus on events in the past, present and future.</p><p>A generalized linear model predicted the smile intensity from context cues and dyad demographics. Results using an inverse link function (model explained variance 𝑅 2 = 0.243) with the prosody and speech cues from the audio signal are presented as Figure <ref type="figure" target="#fig_2">3</ref>. Note that the speakers' and listeners' context cues were Z-score normalized. Speaker characteristics such as sex and negations were found to be significant predictors of intensity. Female speakers elicited significantly narrower smiles from their listeners, but the speaker's usage of negations resulted in wider smiles. The speaker's loudness (RMS energy) had a marginally significant negative correlation with the smile intensity. Listener behavior also significantly impacted their BC smiles. Using comparative words by the listener and their mean pitch in their preceding turn resulted in significantly narrower smiles. In contrast, their word count had a marginally significant positive correlation with intensity. A similar analysis for duration did not reveal any significant correlations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Modeling Smiles</head><p>To automatically generate BC smile and non-smile activity in listeners, we use the audio from the speaker's current turn and the listener's last turn as input. 15 smiles were dropped due to difficulties in the preprocessing steps with MFA. The remaining 142 annotated smile instances were augmented with an equal number of nonsmile instances. The non-smile instances were identified so that they were at least two seconds away from the onset of the closest smile instance, a strategy adopted from <ref type="bibr" target="#b22">[23]</ref> for turn-taking prediction. The mean duration of smiling and non-smiling instances was ensured to be the same.</p><p>Attention-based generative model: The generative model (Figure <ref type="figure" target="#fig_3">4</ref>) for facial landmark prediction primarily consisted of an encoder and a decoder with a one-layer GRU each. Inputs to the model were embeddings from speaker and listener turns extracted using the pretrained vggish model <ref type="bibr" target="#b23">[24]</ref>. We limited the input context length to use turn durations of 60 seconds. The output context was limited to predicting one second of facial activity. The speaker vggish embeddings were used as input to the encoder. The hidden state of the GRU was initialized as the mean of the listener's turn embeddings. The fi- nal hidden state of the encoder was concatenated with the conditioning vector, and a linear layer with ReLU activation was used to match the dimensionality of the decoder's hidden state. At each decoding step, attention <ref type="bibr" target="#b24">[25]</ref> was applied between the encoder output and the decoder's last hidden state (Equation <ref type="formula" target="#formula_0">1</ref>) to use as the input to the next step.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝑎(𝑠</head><formula xml:id="formula_0">𝑡−1 , ℎ 𝑖 ) = 𝑣 𝑇 𝑡𝑎𝑛ℎ(𝑊 𝑎 ℎ 𝑖 + 𝑊 𝑏 𝑠 𝑡−1 )<label>(1)</label></formula><p>where 𝑎(𝑠 𝑡−1 , ℎ 𝑖 ) is the attention between decoder last hidden state (𝑠 𝑡−1 ) and encoder output (ℎ 𝑖 ). 𝑊 𝑖 s and 𝑣 are linear layers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Implementation details</head><p>The videos were split into two vertical halves, one corresponding to each individual in the dyadic interaction. These were used for facial landmark extraction using the AFARtoolbox <ref type="bibr" target="#b16">[17]</ref>. To account for various facial shapes, we normalized landmarks to the mean face of the dataset using the approach described in <ref type="bibr" target="#b25">[26]</ref>. Because of the high degree of correlation between successive frames, frames were downsampled by a factor of three, to use every third frame. Displacement was then calculated as the difference between the landmarks from successive frames. These were further subjected to a min-max normalization to allow for individual differences in smiling dynamics. The normalized displacements were predicted using the attention-based generative model. The predicted frame-level displacements were incorporated into the last known listener facial expression to generate the sequence of facial landmarks recursively.</p><p>We enforced teacher-forcing with simulated annealing during training and linearly decreased the likelihood of using ground truth at every 20 epochs. Stochastic Gradient Descent with a learning rate initialized at 1𝑒 − 4 weight decay and 0.99 momentum were used to minimize the Mean Squared Error (MSE) between predictions and the ground truth. The learning rate was halved when validation loss plateaued for 20 consecutive epochs. Data was partitioned into 75 (train), 15 (validation), and 15 (test) split in terms of the number of dyads. Models were trained for 250 epochs, and validation loss was used to determine the best model for testing. This was repeated 10 times to evaluate the statistical significance of differences against baseline speaker-based BC generation setting.</p><p>Metrics: Objective measures of performance from gesture generation approaches, including Average Pose Error (APE) and Probability of Correct Keypoints (PCK), were adopted to quantify the generated landmarks against the ground truth from the AFAR toolbox. APE (Equation <ref type="formula" target="#formula_1">2</ref>) is equivalent to the mean squared error between predicted facial expression and ground truth facial expression. PCK (Equation <ref type="formula" target="#formula_2">3</ref>) is a proximity-based metric that considers the landmark to be correctly predicted if the difference with ground truth falls below a margin. We report mean PCK for 𝜎 = 0.1 and 0.2.</p><formula xml:id="formula_1">𝐴𝑃𝐸 = 1 𝑘 𝑘 ∑ 𝑦=1 ‖( ŷ (𝑝) − 𝑦(𝑝))‖ 2<label>(2)</label></formula><p>where 𝑘 is the number of landmarks, ŷ (𝑝) is the prediction and 𝑦(𝑝) is the groundtruth.</p><formula xml:id="formula_2">𝑃𝐶𝐾 𝜎 = 1 𝑘 𝑘 ∑ 𝑦=1 𝛿(‖( ŷ (𝑝) − 𝑦(𝑝))‖ 2 ≤ 𝜎)<label>(3)</label></formula><p>where 𝛿 is an indicator function and 𝜎 is the margin.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Results</head><p>Using listener behavior and conditioning vector together with the speaker behavior resulted in improved performance compared to the baseline speaker behavior-based  <ref type="table" target="#tab_1">2</ref>, APE decreased by 0.273 points while PCK increased by 0.004; these gains were statistically significant. When listener behavior was added to the speaker behavior, marginally significant improvements were observed. APE reduced by 0.206 points while PCK increased by 0.001 points. These reiterate our hypothesis that both speaker and listener contribute to BC behaviors. When speaker behavior was augmented with the conditioning vector, only nominal differences were observed against the baseline. APE increased by 0.063 points, and PCK decreased by 0.001.</p><p>To understand how the performance varies with different smiles, we predicted APE (and PCK) as a linear combination of duration, intensity, and the model configuration using a regression model. Results from Figure <ref type="figure" target="#fig_4">5</ref> show that duration significantly affects the PCK. Interestingly, the positive slope suggests that longer smiles are generated better over shorter smiles. Only a marginally significant effect of duration can be observed for APE. With the increase in the intensity of the smile, the generation performance decreases. This is significant for D-level and E-level smiles. Using listener features and the conditioning vector along with the speaker features improves the performance (negative and positive slopes for APE and PCK, respectively) compared to the baseline speaker-based generation. However, this effect is not statistically significant.</p><p>Qualitative evaluation of ground truth landmarks from Figure <ref type="figure">6</ref> suggest the deficiencies of the existing facial landmark prediction approaches <ref type="bibr" target="#b16">[17]</ref> to accurately track lip corners both in the presence and absence of nonfrontal head pose. While a visually noticeable difference can be observed as the smile evolves, the ground truth landmarks fail to capture the subtle lip corner motion. This limitation in the ground truth has resulted in nominal motion in the predicted landmarks. We also found that BC smiles that co-occur with vocal activity are challenging to predict. Figure <ref type="figure" target="#fig_5">7</ref> shows one example where the vertical distance between the upper and lower lips increases and decreases because of the simultaneous yeah utterance. However, the model fails to capture this vertical motion.</p><p>Metrics like APE and PCK provide an objective measure of the prediction. However, evaluating concepts such as realism and contextual relevance of the BC prediction requires subjective ratings from human evaluation. A convention in evaluating landmark or keypoint-based generative approaches is the human comparison of predicted keypoints against the ground truth <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b26">27]</ref>. While this might work for problems such as gesture generation that involve a strong motion component, evaluating subtle behaviors like facial expressions using a similar strategy could be challenging. To address this concern, we leverage the emulated version of an embodied agent: Furhat <ref type="bibr" target="#b27">[28]</ref>.</p><p>Figure <ref type="figure">6</ref>: Two sample smiles from the dataset showing their onsets (left-most frame to widest smile frame) and offsets (widest smile frame to right-most frame). Note that while the evolution of smile is noticeable in ground truth landmarks (second row) of the top smile, subtle changes between successive frames of the bottom smile are not captured by its ground truth landmarks. This is also observed in the generated landmarks (third row). Zoom-in recommended. The faces used are from the RealTalk dataset. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Smiles on an Embodied Agent</head><p>So far, we have shown modeling smiles by generating facial landmarks. However, users in real-world scenarios do not expect to see such abstract representations of faces. Aligning these facial landmarks with embodied agents is key for an interactable conversational agent. To achieve this, we describe the procedure to transfer generated landmarks to an embodied robotic simulation system called Furhat. We then conduct a user study for subjective perceived differences in Furhat's behavior due to BC smile.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Emulation Setup</head><p>Furhat allows users to control facial expressions using a set of facial parameters called BasicParams<ref type="foot" target="#foot_2">3</ref> (ex. MOUTH_SMILE_LEFT and MOUTH_SMILE_RIGHT to control the left and the right lip corners; BROW_UP_LEFT, BROW_UP_RIGHT to control the left and right eyebrows, etc.). Our setup uses these parameters to enable the embodied agent's smile and express associated eyebrow actions. The landmarks from a generated smile expression were used to calculate the displacement between successive frames and normalized to the [0, 1] range. For eyebrows, only vertical displacement was used. Our inputs to the Furhat API consisted of the lip corner and eyebrow displacements corresponding to the frame with the widest smile (maximum horizontal displacement between the lip corners). The duration of the Furhat smile was set to the duration of the generated smile. Figure <ref type="figure" target="#fig_6">8</ref> shows an example of the resultant expression. The user study was conducted using the Furhat Desktop SDK. However, we do not foresee difficulties transferring the emulation setup to a physically embodied Furhat.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">User Study Procedure</head><p>We conducted a small-scale user study of participants watching two pre-recorded videos of the Furhat interacting with an individual. They differ only in terms of Furhat expressing a BC smile. In both interactions, Furhat starts with a brief introduction of itself, followed by a short question-"How have you been feeling over the last two weeks?". As the user responds, a smile is generated at the appropriate location (see Figure <ref type="figure" target="#fig_6">8</ref>). We refer to this scenario as the backchannel setting. Another video of the same individual interacting with Furhat with no BC (non-backchannel) serves as our baseline. Seven graduate students then rated each video recording separately. Note that raters were not primed on the study's outcome, and no explicit instructions about smiles were given.</p><p>To quantify the user's perception of Furhat interacting with an individual, the influence of BC smile in addition to the effect of its intensity and duration, and their willingness to interact with one was quantified through the following questions on a 5-point Likert scale (1: strongly agree, 5: strongly disagree).</p><p>1. The Furhat's smiles looked human-like. 2. The Furhat's smiles looked natural and friendly. 3. I would talk to this agent frequently. 4. I felt the brightness of Furhat's smiles was appropriate. 5. The Furhat was smiling for longer or shorter duration than it was expected. 6. I would feel comfortable talking to this agent about non-personal topics. 7. I would feel comfortable talking to this agent about personal topics.</p><p>In addition, open-ended feedback was also a part of the questionnaire. We believe these questions help identify some user-facing challenges in generating BC behaviors and how they influence users' attitudes to embodied agent-based dialogue systems for conversations related to mental health.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Results</head><p>Table <ref type="table">3</ref> shows that more users (5/7) expressed moderate or higher agreement that the Furhat agent with BC smile was human-like than its counterpart without BC smile (4/7). One user expressed interest in frequently interacting with the agent in backchannel setting while the lack of backchannels resulted in increased hesitancy among users in frequently using it. Three (out of 7) users found</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Number of responses that expressed moderate or strong agreement along various factors related to the BC smiles when interacting with Furhat with and without backchannel behaviors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Question</head><p>Backchannel Non-backchannel Human-like that the brightness of the BC smile was appropriate while two found that the duration of BC smile was longer or shorter than expected. While no difference was observed in terms of users' preference for Furhat for personal conversations based on the presence of the BC smile, more users (3/7) responded that they would use Furhat with BC smiles for non-personal conversations over Furhat without BC smiles (2/7).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion</head><p>Our quantitative results suggest that both speaker and listener behavior are important in generating BC behavior.</p><p>Using listener behavior together with the conditioning vector offered statistically significant improvements in performance when compared to the baseline speakeronly model. This effect was observed both in terms of APE and PCK. We also found that our attention-based generative model can predict low-intensity smiles better than high-intensity smiles. Our user study shows that more people find our agent human-like when it was able to express BC smiles. Participants prefer to interact with it over the agent with no BC smile capabilities for nonpersonal conversations. However, for intimate personal conversations, the presence of a BC smile did not sway their decision. Some limitations of this work include the following. We employed an affordable measure of reliability for BC smile annotations using a prediction model over a human rater. A robust approach would involve at least one more human annotator to perform reliability annotations on a portion of the dataset. The statistical analysis also assumes that the smiles were independent of the individuals and dyads. However, a given individual typically produces multiple smiles. Grouping of smiles by factors such as individuals and dyads can be better modelled using a mixed-effects model. Our user study was designed to demonstrate the feasibility of transferring generated facial landmarks to an embodied agent together with understanding perceived differences between interactions with and without BC smiles. An appropriate evaluation framework would include the user interacting with the agent. Followed by a comparison of qualitative subjective ratings of user experience and quantified parameters (such as difference in turn duration, language usage, etc.) of the interaction with and without BC smiles. We believe such approaches provide a holistic evaluation to identify critical instances in the interaction. Lastly, we focused on BC smiles leaving out other conventional signals such as vocal and headpose-based BCs, and how they are affected by the cues from the speaker and listener.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>To enable BCs in embodied agents for mental health applications, we proposed an annotated dataset of faceto-face conversations including topics related to mental health. Our statistical analysis showed that speaker gender together with prosodic and linguistic cues from both speaker and listener turns are significant predictors of the BC smile intensity. Using the significant predictors together with the speaker and listener behaviors to generate BC smiles offers significant improvements in terms of empirical metrics over the baseline speaker-centric generation.</p><p>We bridge the gap between conventional non-verbal behavior generation approaches such as landmarks and poses and their realization by showing that generated landmarks can be transferred to an embodied agent. Thus creating the opportunity for evaluation with a humanlike manifestation over a traditional evaluation by comparing generated landmark (or keypoint) outputs. Our small-scale user study suggests our Furhat agent that backchannels is more human-like and are more likely to attract users for non-personal interactions. In addition to these contributions, we also discussed some limitations in existing technology towards generating accurate ground truth landmarks through examples such as failure to capture mouth movement in bimodal BCs and how they affect the generated outputs. We believe these limitations also serve as directions for future research. Our work serves as a baseline for computer scientists interested in behavior generation, and an attractive source of BC smiles for behavioral scientists to study the effect of context cues on BC smiles in intimate conversations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Ethical Statement</head><p>We proposed a generative approach for backchannel smile production to enable naturalistic interactions with embodied AI agents for mental health dialogue. While our dataset offers diverse smiles from people in different interpersonal relationships, like many existing generative approaches, the choice of pretrained embeddings, imbalance between males and females, lack of male-male romantic relationships, and lack of age and ethnicity information in the dataset might have resulted in biased generations. We also acknowledge that using embodied agents in such sensitive applications should undergo rigorous evaluations by technical and domain experts and regulatory bodies. In our work, we do not interpret embodied agents as a substitute for professionals in mental health or allied areas of healthcare but to provide tools for them to better serve the community's demands. We believe that the advantages and limitations of embodied agents in mental health should be presented to the users and the healthcare experts to provide maximum benefits. The information used in this work is identified from a publicly available dataset. Also, special attention has been paid to privacy and copyright requirements for relevant images showing individual faces. The user study raters were voluntary participants, and the University of Pittsburgh IRB approved the data collection.  Figure <ref type="figure" target="#fig_9">9</ref> shows the distribution of annotated Backchannel (BC) smiles in terms of their intensity and duration. The predicted intensity using the automated approach showed that over 50% of smiles were of B-level intensity, and fewer instances of high-intensity smiles (D and Elevels) were also present. The mean duration was 3.18 ± 1.71 seconds. Note that the intensity of the smile differs marginally by the speaker sex. It is not affected by other factors such as relationship, listener sex and their interaction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10.2.">Effect of Sex and Relationship on Smile Intensity</head></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Machine</head><label></label><figDesc>Learning for Cognitive and Mental Health Workshop (ML4CMH), AAAI 2024, Vancouver, BC, Canada. * Corresponding author.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Distribution of speaker and listener sex across different interpersonal relationships in annotated RealTalk dataset. Relationships are color-coded: siblings (pink), friends (orange), paternal (green), and romantic couple (grey).</figDesc><graphic coords="3,332.10,84.19,142.35,81.78" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Regression slopes showing the effect of context cues on the intensity of BC smiles. A positive slope indicates the smile intensity increases with a given feature (vice-versa for a negative slope). * indicates slope is significant at p&lt;0.05 and ⋅ indicates marginal significance at p&lt;0.1.</figDesc><graphic coords="4,89.29,299.30,203.37,156.84" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Generative model architecture. Encoder input contains speech embeddings of listener and speaker from the pretrained vggish model. The encoder's final hidden state is concatenated with the conditioning vector and then used to initialize the decoder's hidden state. Decoder output landmarks are sequentially fed (dotted curves) to generate the next landmarks in the output sequence.</figDesc><graphic coords="5,89.29,84.19,416.70,110.69" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Effect of duration and intensity of smile along with ablation of inputs on generative model performance measured using APE (top) and PCK (bottom). S &amp; C-speaker and conditioning vector, S &amp; L-speaker and listener, and S, L &amp; C-speaker and listener and conditioning vector as inputs to the model. '⋅', '*' and '***' indicate significance with p &lt;0.1, p &lt;0.05 and p &lt;0.001 respectively.</figDesc><graphic coords="6,302.62,248.60,203.37,158.37" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Limitation of the current approach in generating a bimodal backchannel smile. The frames highlighted in red box correspond to the co-occurring verbal "yeah". Notice that ground truth landmarks (second row) fail to capture the vertical mouth movement. This is also observed in the generated landmarks (third row). Zoom-in recommended. The faces used are from the RealTalk dataset.</figDesc><graphic coords="7,89.29,344.54,203.35,76.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Four frames of an example Furhat robot emulation with different levels of smiles used as backchannels during the conversation in our user study.</figDesc><graphic coords="8,118.77,84.19,142.35,52.83" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 9 :</head><label>9</label><figDesc>Figure 9: Distribution of intensity and duration of BC smiles in the annotated dataset. The spread of the histograms shows the diversity of the annotated smiles.</figDesc><graphic coords="11,89.29,146.45,203.35,67.44" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>ANOVA of listener sex, speaker sex, and relationship on duration of smile. '*' indicates p&lt;0.05 and '**' indicates p&lt;0.01).</figDesc><table><row><cell></cell><cell cols="4">Df Sum Sq Mean Sq F value</cell><cell>Pr(&gt;F)</cell></row><row><cell>𝑠𝑒𝑥 𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟</cell><cell>1</cell><cell>12.36</cell><cell>12.36</cell><cell>4.59</cell><cell>0.0339 *</cell></row><row><cell>𝑠𝑒𝑥 𝑠𝑝𝑒𝑎𝑘𝑒𝑟</cell><cell>1</cell><cell>1.29</cell><cell>1.29</cell><cell>0.48</cell><cell>0.4907</cell></row><row><cell>𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝</cell><cell>3</cell><cell>4.18</cell><cell>1.39</cell><cell>0.52</cell><cell>0.6709</cell></row><row><cell>𝑠𝑒𝑥 𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟  *  𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝</cell><cell>3</cell><cell>42.80</cell><cell>14.27</cell><cell cols="2">5.29 0.0017 **</cell></row><row><cell>𝑠𝑒𝑥 𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟  *  𝑠𝑒𝑥 𝑠𝑝𝑒𝑎𝑘𝑒𝑟</cell><cell>1</cell><cell>0.90</cell><cell>0.90</cell><cell>0.33</cell><cell>0.5652</cell></row><row><cell>𝑠𝑒𝑥 𝑠𝑝𝑒𝑎𝑘𝑒𝑟  *  𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝</cell><cell>3</cell><cell>9.70</cell><cell>3.23</cell><cell>1.20</cell><cell>0.3123</cell></row><row><cell>Residuals</cell><cell>144</cell><cell>388.03</cell><cell>2.69</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Average Pose Error (APE) and Probability of Correct Keypoints (PCK) metrics for generated facial expressions under various experimental settings. A downward-facing arrow indicates lower value implies better generation. '*' indicates significance with p &lt;0.05 with '⋅' indicates marginal significance with p &lt;0.1.</figDesc><table><row><cell>Model</cell><cell>APE↓</cell><cell>PCK↑</cell></row><row><cell>Speaker only (Baseline) Speaker and Listener</cell><cell>9.552 9.346 ⋅</cell><cell>0.219 0.220 ⋅</cell></row><row><cell>Speaker and Listener with Conditioning vector Speaker and Conditioning vector</cell><cell cols="2">9.279* 0.223* 9.615 0.218 ⋅</cell></row></table><note>prediction. As shown in Table</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 4</head><label>4</label><figDesc>ANOVA of listener sex, speaker sex, and relationship on intensity of smile. '⋅' indicates significant at p&lt;0.1.</figDesc><table><row><cell></cell><cell cols="4">Df Sum Sq Mean Sq F value</cell><cell>Pr(&gt;F)</cell></row><row><cell>𝑠𝑒𝑥 𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟</cell><cell>1</cell><cell>0.53</cell><cell>0.53</cell><cell>0.60</cell><cell>0.4417</cell></row><row><cell>𝑠𝑒𝑥 𝑠𝑝𝑒𝑎𝑘𝑒𝑟</cell><cell>1</cell><cell>2.93</cell><cell>2.93</cell><cell cols="2">3.31 0.0710 ⋅</cell></row><row><cell>𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝</cell><cell>3</cell><cell>3.23</cell><cell>1.08</cell><cell>1.22</cell><cell>0.3055</cell></row><row><cell>𝑠𝑒𝑥 𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟  *  𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝</cell><cell>3</cell><cell>2.00</cell><cell>0.67</cell><cell>0.75</cell><cell>0.5225</cell></row><row><cell>𝑠𝑒𝑥 𝑙𝑖𝑠𝑡𝑒𝑛𝑒𝑟  *  𝑠𝑒𝑥 𝑠𝑝𝑒𝑎𝑘𝑒𝑟</cell><cell>1</cell><cell>0.10</cell><cell>0.10</cell><cell>0.11</cell><cell>0.7424</cell></row><row><cell>𝑠𝑒𝑥 𝑠𝑝𝑒𝑎𝑘𝑒𝑟  *  𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝</cell><cell>3</cell><cell>3.15</cell><cell>1.05</cell><cell>1.19</cell><cell>0.3176</cell></row><row><cell>Residuals</cell><cell>144</cell><cell>127.49</cell><cell>0.89</cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Data and code: https://github.com/bmaneesh/Generating-Context-Sensitive-Backchannel-Smiles/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">The original videos can be accessed from https://www.youtube. com/c/TheSkinDeep</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://docs.furhat.io/remote-api/#python-remote-api</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Acknowledgments</head><p>Bilalpur and Cohn were supported by the U.S. National Institutes of Health through award MH R01-096951. Zeinali was supported through the Khoury Distinguished Fellowship at Northeastern University.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Exploring barriers to mental health care in the u</title>
		<author>
			<persName><forename type="first">H</forename><surname>Modi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Orgera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Grover</surname></persName>
		</author>
		<idno type="DOI">10.15766/rai_a3ewcf9p</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Spectral representation of behaviour primitives for depression analysis</title>
		<author>
			<persName><forename type="first">S</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jaiswal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Valstar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Affective Computing</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="829" to="844" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Multimodal temporal machine learning for bipolar disorder and depression recognition</title>
		<author>
			<persName><forename type="first">F</forename><surname>Ceccarelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mahmoud</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Analysis and Applications</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page" from="493" to="504" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Detecting depression severity from vocal prosody</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Fairbairn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Cohn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE transactions on affective computing</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="142" to="150" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">All smiles are not created equal: Morphology and timing of smiles perceived as amused, polite, and embarrassed/nervous</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Ambadar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Cohn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">I</forename><surname>Reed</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of nonverbal behavior</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="17" to="34" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Simsensei kiosk: A virtual human interviewer for healthcare decision support</title>
		<author>
			<persName><forename type="first">D</forename><surname>Devault</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Artstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Benn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Dey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gainer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Georgila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gratch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hartholt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lhommet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems</title>
				<meeting>the 2014 international conference on Autonomous agents and multi-agent systems</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1061" to="1068" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Collaborative user responses in multiparty interaction with a couples counselor robot</title>
		<author>
			<persName><forename type="first">D</forename><surname>Utami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Bickmore</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), IEEE</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="294" to="303" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Prosodic features which cue back-channel responses in english and japanese</title>
		<author>
			<persName><forename type="first">N</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Tsukahara</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of pragmatics</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page" from="1177" to="1207" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">The prosody of backchannels in american english</title>
		<author>
			<persName><forename type="first">S</forename><surname>Benus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gravano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">B</forename><surname>Hirschberg</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Backchannels revisited from a multimodal perspective</title>
		<author>
			<persName><forename type="first">R</forename><surname>Bertrand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ferré</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Blache</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Espesser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rauzy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Auditory-visual Speech Processing</title>
				<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A multimodal analysis of vocal and visual backchannels in spontaneous dialogs</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">P</forename><surname>Truong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Poppe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Heylen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">INTERSPEECH</title>
				<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="2973" to="2976" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Backchannel-inviting cues in task-oriented dialogue</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gravano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hirschberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Tenth Annual Conference of the International Speech Communication Association</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Every smile is unique: Landmark-guided diverse smile generation</title>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Alameda-Pineda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ricci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sebe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="7083" to="7092" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Learn2smile: Learning non-verbal interaction through observation</title>
		<author>
			<persName><forename type="first">W</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kannan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gkioxari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Zitnick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="4131" to="4138" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Learning to listen: Modeling nondeterministic dyadic facial motion</title>
		<author>
			<persName><forename type="first">E</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Joo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kanazawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ginosar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="20395" to="20405" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Geng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Teotia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tendulkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Menon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Vondrick</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2301.10939</idno>
		<title level="m">Affective faces for goal-driven dyadic communication</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Afar: A deep learning based tool for automated facial affect recognition</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">O</forename><surname>Ertugrul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Jeni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Cohn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">14th IEEE international conference on automatic face &amp; gesture recognition (FG 2019)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="1" to="1" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Schneider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Baevski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Collobert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Auli</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.05862</idno>
		<title level="m">wav2vec: Unsupervised pre-training for speech recognition</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mcauliffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Socolof</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mihuc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wagner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sonderegger</surname></persName>
		</author>
		<idno type="DOI">10.21437/Interspeech.2017-1386</idno>
	</analytic>
	<monogr>
		<title level="m">Proc. Interspeech 2017</title>
				<meeting>Interspeech 2017</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="498" to="502" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Memon</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.15869</idno>
		<title level="m">Acoustic correlates of the voice qualifiers: A survey</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Opensmile: the munich versatile and fast open-source audio feature extractor</title>
		<author>
			<persName><forename type="first">F</forename><surname>Eyben</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wöllmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Schuller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th ACM international conference on Multimedia</title>
				<meeting>the 18th ACM international conference on Multimedia</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="1459" to="1462" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Pennebaker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">L</forename><surname>Boyd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Jordan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename></persName>
		</author>
		<title level="m">Blackburn, The development and psychometric properties of LIWC2015</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Ekstedt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Skantze</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2205.09812</idno>
		<title level="m">Voice activity projection: Self-supervised learning of turn-taking events</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Cnn architectures for large-scale audio classification</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hershey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chaudhuri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Ellis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Gemmeke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">C</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Plakal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Platt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Saurous</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Seybold</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2017 ieee international conference on acoustics, speech and signal processing (icassp)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="131" to="135" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Bahdanau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1409.0473</idno>
		<title level="m">Neural machine translation by jointly learning to align and translate</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Sign language production using neural machine translation and generative adversarial networks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Stoll</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">C</forename><surname>Camgöz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hadfield</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bowden</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 29th British Machine Vision Conference (BMVC 2018)</title>
				<meeting>the 29th British Machine Vision Conference (BMVC 2018)</meeting>
		<imprint>
			<publisher>British Machine Vision Association</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">No gestures left behind: Learning relationships between spoken language and freeform gestures</title>
		<author>
			<persName><forename type="first">C</forename><surname>Ahuja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ishii</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-P</forename><surname>Morency</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: EMNLP 2020</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="1884" to="1895" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Al Moubayed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Beskow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Skantze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Granström</surname></persName>
		</author>
		<title level="m">Furhat: a back-projected humanlike robot head for multiparty human-machine interaction</title>
				<meeting><address><addrLine>Dresden, Germany</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2011">February 21-26, 2011. 2012</date>
			<biblScope unit="page" from="114" to="130" />
		</imprint>
	</monogr>
	<note>Cognitive Behavioural Systems: COST 2102 International Training School. Revised Selected Papers</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
