<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maneesh Bilalpur</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mert Inan</string-name>
          <email>inan.m@northeastern.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dorsa Zeinali</string-name>
          <email>zeinali.d@northeastern.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jefrey F.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Northeastern University</institution>
          ,
          <addr-line>Boston, Massachusetts</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Pittsburgh</institution>
          ,
          <addr-line>Pittsburgh, Pennsylvania</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Addressing the critical shortage of mental health resources for efective screening, diagnosis, and treatment remains a significant challenge. This scarcity underscores the need for innovative solutions, particularly in enhancing the accessibility and eficacy of therapeutic support. Embodied agents with advanced interactive capabilities emerge as a promising and cost-efective supplement to traditional caregiving methods. Crucial to these agents' efectiveness is their ability to simulate non-verbal behaviors, like backchannels, that are pivotal in establishing rapport and understanding in therapeutic contexts but remain under-explored. To improve the rapport-building capabilities of embodied agents we annotated backchannel smiles in videos of intimate face-to-face conversations over topics such as mental health, illness, and relationships. We hypothesized that both speaker and listener behaviors afect the duration and intensity of backchannel smiles. Using cues from speech prosody and language along with the demographics of the speaker and listener, we found them to contain significant predictors of the intensity of backchannel smiles. Based on our findings, we introduce backchannel smile production in embodied agents as a generation problem. Our attention-based generative model suggests that listener information ofers performance improvements over the baseline speaker-centric generation approach. Conditioned generation using the significant predictors of smile intensity provides statistically significant improvements in empirical measures of generation quality. Our user study by transferring generated smiles to an embodied agent suggests that agent with backchannel smiles is perceived to be more Workshop Proceedings</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>human-like and is an attractive alternative for non-personal conversations over agent without backchannel smiles.
CEUR</p>
      <p>ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Fewer than a third of the US population has suficient
access to mental health professionals [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This highlights
the need for additional resources to help mental health
professionals meet the community’s demands. Problems
like symptom detection and evaluating treatment
eficacy have made great strides with AI [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ] and the
mental health community can greatly benefit from this
AI intervention. Embodied agent-based systems due to
their multimodal behavioral capabilities are a
promising solution to support such mental health needs.
However, the development of such systems presents
numerous challenges. These include the scarcity of mental
health-related datasets, limited access to domain experts
for designing reliable and robust systems, and the
ethical considerations crucial to their design and adaptation.
      </p>
      <sec id="sec-2-1">
        <title>Among such challenges, one aspect that stands out is the agent’s ability to establish a common ground with users. Addressing this is particularly crucial when the agent functions as a listener. Efective grounding in such</title>
        <p>Machine Learning for Cognitive and Mental Health Workshop
(ML4CMH), AAAI 2024, Vancouver, BC, Canada.
∗Corresponding author.
nEvelop-O
in an embodied agent in a human-agent interaction: Speaker
and listener (agent) turns are used to generate the listener’s
response facial expression as landmarks. The landmarks are
then integrated with the embodied agent and added to the
conversation flow represented as a dotted arrow.
scenarios relies heavily on multimodal non-verbal
behaviors like backchannels. These subtle yet impactful
cues are pivotal in building rapport and understanding
between the user and the agent. Hence, understanding
and incorporating these behaviors into embodied agents
is not only challenging but also essential for creating a
supportive and empathetic environment for individuals and their physical realization by emulating the
seeking mental health support. Addressing these chal- generated behavior with an embodied agent.
lenges can pave the way for more efective, accessible, 5. Show that our BC smile generation yields
approand empathetic digital mental health interventions. priate and natural-looking smiles through a user</p>
        <p>In dyadic conversations, at any given time one person study involving the embodied agent.
may have the floor (i.e., is speaking) while the other is
listening. Backchannels (BC) refer to behaviors of the Results suggest speaker sex, their use of negations,
listener that do not interrupt the speaker. BCs signal loudness, word count in the listener’s turn, their usage of
attention, agreement, and emotional response to what is comparisons, and mean pitch are significant predictors
said. Inappropriate BC smiles such as ones that appear of BC smile intensity. Our generative approach shows
too short or too long or for which the timing appears that taking listeners’ behavior into account improves
“of” can disrupt the conversational rapport and result in performance, and adding the conditioning vector ofers
unsuccessful or disrupted conversations. Our objective significant improvements in terms of empirical metrics
is to understand appropriate BC smiles from dyadic con- such as Average Pose Error (APE) and Probability of
versations and how an embodied agent can employ them Correct Keypoints (PCK).
when interacting with a human.</p>
        <p>
          Conversational agents typically realize BC smiles us- 2. Related Work
ing rule-based systems, discriminative approaches, or
sometimes simply mimicking the smiles of the speaker. Existing works have validated the eficacy of an
agentMimicking, however, fails to generalize to situations that driven conversation in mental health dialogue and
counrequire a contextually relevant smile. And rule-based seling situations. DeVault et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], through their
agentand discriminative approaches ofer limited coverage due based interviews for distress and trauma symptoms,
to the diversity of smiles [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. found that participants were comfortable interacting with
        </p>
        <p>
          We present a generative approach for BC smiles in the agent as well as sharing intimate information. Utami
listeners to address these limitations and enable contextu- and Bickmore [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] used embodied agents for couples
counally relevant BC smiles in embodied agents. An overview seling. Participants reported significantly improved
afof the approach is presented in Figure 1. Unlike existing fect and intimacy with their partner and generally
enworks that solely depend on speaker behavior for BC pro- joyed the agent-driven counseling session. Our work
duction (see related work section), we use both speaker builds on this line of research to improve the BC
capabiland listener behaviors to study how they afect the in- ities of agents.
tensity and duration of the BC smile. We use cues from Backchannel behaviors were traditionally produced
prosody, language, and the demographics of dyads to using a set of predefined rules based on prosodic or
linidentify statistically significant predictors (referred to as guistic cues of the speaker. Both Ward and Tsukahara
a conditioning vector) of smiles. In addition to the audio [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], Benus et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] have found prosodic cues
(particufeatures from both interaction participants, we leverage larly pitch and its changes) to be reliable predictors for
the conditioning vector in generating the BC smiles. In vocal BC occurrence. In contrast, we use prosody and
this paper, we: linguistic cues from both speaker and listener to identify
significant predictors of BC smiles.
1. Annotate backchannel smiles in a face-to-face In the multimodal context, Bertrand et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
studinteraction dataset1 of dyads that difer in their ied prosodic, morphological, and discourse markers for
composition of biological sex and type of relation- their efect on vocal and gestural backchannels (hand
gesship. tures, smiles, eyebrows), and Truong et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] explored
2. Present our statistical analysis to identify vari- visual BCs by often limiting them to head nods and, at
ous speaker and listener-specific cues that sig- times, grouping diferent BCs into the same category [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
nificantly predict the duration and intensity of without accounting for their intrinsic diferences. They
backchannel smiles. depended on the speaker’s behavior to identify the
occur3. Generate backchannel smiles using an attention- rence and ignored the listener. In addition to leveraging
based generative model that uses the listener and the listener behavior, we specifically study smiles because
speaker turn features with the identified signifi- of their diversity and include both unimodal (visual) and
cant predictors. bimodal (visual together with vocal activity) BC smiles.
4. Bridge the gap between the model-based genera- Wang et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] introduced diversity in generated
tion of non-verbal behaviors (as facial landmarks) smiles by conditioning on a specific class and sampling
using a variational autoencoder. Learn2Smile [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] used
1Data and code: https://github.com/bmaneesh/Generating-Context- the facial landmarks of the speaker to generate
comSensitive-Backchannel-Smiles/ plete listener behavior by separately predicting the
lowfrequency (nods) and high-frequency (blinks)
components of facial motion. Ng et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] leverage the speaker
and listener’s motion and speech features to predict
the listener’s future motion information. Unlike earlier
works that have been limited to facial expression
generation using landmarks, their usage of 3D Morphable
Models to define facial expressions ofers a flexible solution
to generate realistic facial expressions in the presence
of diverse head orientations. These solutions focus on Figure 2: Distribution of speaker and listener sex across
diferthe entire listener’s behavior and ofer no insights about ent interpersonal relationships in annotated RealTalk dataset.
specific BC behaviors. Their integrations are also limited Relationships are color-coded: siblings (pink), friends (orange),
to 3D Morphable Models. paternal (green), and romantic couple (grey).
        </p>
        <p>
          The BC smiles produced in this work not only leverage
the speaker and listener activity but also condition the
generation on salient factors that were found to be signif- the 191 annotated smiles had an A-level or higher
inicant predictors of smile attributes – duration (the time tensity. One outlier smile was dropped because of the
elapsed between the onset of a smile and its ofset) and extremely long duration. The resultant 157 smiles, along
intensity (maximum amplitude of a smile). Using an em- with their predicted intensity, were used in this work.
bodied agent, we also bridge the gap between generated In addition to the video recordings at 25 fps and 720p
landmarks and their physical realization. resolution, the dataset also contains speaker-identified
turn-level text obtained through automatic transcription
[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The individuals in the dyadic interaction occupied
3. Dataset ifxed positions (left and right) in the videos. In this work,
the biological sex of the participants was inferred from
the videos. Videos where sex could not be established
with confidence were discarded.
        </p>
        <sec id="sec-2-1-1">
          <title>3.2. Efect of Sex and Relationship on</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Smile Attributes</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Given various interpersonal relationships in the dataset</title>
        <p>of individuals of both sexes, we compared the mean
duration of backchannel smiles across the factors using
ANOVA (Table 1) with type-III sum of squares to account
for imbalance between males and females. Two-way
interactions between sex, and sex and relationship were
also included. The ANOVA analysis suggests that the
duration of backchannel smiles difers significantly by
listener sex and the interaction efect of the listener sex
and relationship. A post hoc Tukey revealed that male
listeners, when interacting with their siblings (regardless
of speaker sex), express longer BC smiles (p&lt;0.05).</p>
        <p>Similarly, the intensity of smiles marginally difered
by the speaker’s sex. The post hoc Tukey revealed that
the smiles as a response to a male speaker are less
intense than a female speaker (p&lt;0.1). ANOVA analysis is
presented in the appendix as Table 4.</p>
        <sec id="sec-2-2-1">
          <title>3.3. Efect of Context Cues</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Our contextual cues were extracted from prosody and</title>
        <p>
          speech features independently derived from the turns of
both the speaker and the listener just before the smile
onset. Since the speaker’s turn continues while the listener
backchannels, speaker activity till the onset of the smiles
One of the primary challenges in studying non-verbal
behavior in mental health interactions is access to an
appropriate dataset. Patient-therapist interactions or
interactions with mental health professionals are
accessrestricted to protect the identifiable information of the
individuals. As a result, we use a YouTube-based
largescale dataset of face-to-face dyadic interactions–RealTalk
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The RealTalk dataset consists of individuals taking
turns asking predefined, intimate questions about family,
dreams, relationships, illness, and mental health2. We
believe intimate conversations are among the closest
accessible alternatives to studying BC behaviors for mental
health applications. In this section, we elaborate on our
contributions in terms of the annotations for BC smiles
and discuss how they difer by the demographics of the
dyads and features from the speaker and listener turn
preceding it.
        </p>
        <sec id="sec-2-3-1">
          <title>3.1. Annotating Backchannel Smiles</title>
          <p>
            We manually annotated 191 BC smiles from 48 (out of 692)
dyadic interactions in the RealTalk dataset. The dyads
comprised male and female participants from diferent
ethnicities, and social relationships such as siblings,
paternal, romantic, and fraternal. The smiles were nearly
balanced across the diferent interpersonal relationships
(see Figure 2). An automated facial expression prediction
framework [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] was used to evaluate the reliability of
the manual annotations. About 83% (i.e., 158 smiles) of
          </p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2The original videos can be accessed from https://www.youtube.</title>
        <p>com/c/TheSkinDeep</p>
      </sec>
      <sec id="sec-2-5">
        <title>Prosody cues: Our prosodic features consisted of some</title>
        <p>
          of the fundamental characteristics of speech, such as
mean pitch during the turn, range of the pitch, and Root
Mean Square (RMS) energy of the audio signal. These
features were chosen because of their relevance (see related
work) in BC behavior and also due to the ease of
interpretation as well as their ability to convey various behavioral
traits. For example, RMS energy conveys traits such as
confidence, doubtfulness, and enthusiasm [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Lastly,
using the OpenSMILE [21] software, prosodic features
were obtained.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Modeling Smiles</title>
      <p>To automatically generate BC smile and non-smile
activity in listeners, we use the audio from the speaker’s
current turn and the listener’s last turn as input. 15
smiles were dropped due to dificulties in the
preprocessing steps with MFA. The remaining 142 annotated smile
instances were augmented with an equal number of
nonsmile instances. The non-smile instances were identified
so that they were at least two seconds away from the
onset of the closest smile instance, a strategy adopted
from [23] for turn-taking prediction. The mean duration
of smiling and non-smiling instances was ensured to be
the same.</p>
      <p>Attention-based generative model: The generative
model (Figure 4) for facial landmark prediction primarily
consisted of an encoder and a decoder with a one-layer
GRU each. Inputs to the model were embeddings from
speaker and listener turns extracted using the pretrained
vggish model [24]. We limited the input context length
to use turn durations of 60 seconds. The output context
was limited to predicting one second of facial activity.</p>
      <p>The speaker vggish embeddings were used as input to
the encoder. The hidden state of the GRU was initialized
as the mean of the listener’s turn embeddings. The
final hidden state of the encoder was concatenated with the Mean Squared Error (MSE) between predictions and
the conditioning vector, and a linear layer with ReLU the ground truth. The learning rate was halved when
activation was used to match the dimensionality of the validation loss plateaued for 20 consecutive epochs. Data
decoder’s hidden state. At each decoding step, attention was partitioned into 75 (train), 15 (validation), and 15
[25] was applied between the encoder output and the de- (test) split in terms of the number of dyads. Models were
coder’s last hidden state (Equation 1) to use as the input trained for 250 epochs, and validation loss was used to
deto the next step. termine the best model for testing. This was repeated 10
times to evaluate the statistical significance of diferences
(1) against baseline speaker-based BC generation setting.</p>
      <p>ℎ +    −1 )
where ( −1 , ℎ ) is the attention between decoder last
hidden state ( −1 ) and encoder output (ℎ ).   s and  are
linear layers.</p>
      <sec id="sec-3-1">
        <title>Metrics: Objective measures of performance from ges</title>
        <p>
          ture generation approaches, including Average Pose
Error (APE) and Probability of Correct Keypoints (PCK),
were adopted to quantify the generated landmarks
4.1. Implementation details against the ground truth from the AFAR toolbox. APE
The videos were split into two vertical halves, one cor- (Equation 2) is equivalent to the mean squared error
beresponding to each individual in the dyadic interaction. tween predicted facial expression and ground truth facial
These were used for facial landmark extraction using the expression. PCK (Equation 3) is a proximity-based metric
AFARtoolbox [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. To account for various facial shapes, that considers the landmark to be correctly predicted if
we normalized landmarks to the mean face of the dataset the diference with ground truth falls below a margin.
using the approach described in [26]. Because of the We report mean PCK for  = 0.1 and 0.2.
high degree of correlation between successive frames,
fervaemryesthwiredrefrdamowe.nDsaimspplalecdembeynat wfaacstotrheonf tcharlceuel,attoedusaes   = 1 ∑=1 ‖( (̂) −  ()) ‖2 (2)
the diference between the landmarks from successive
frames. These were further subjected to a min-max nor- where  is the number of landmarks,  (̂) is the
premalization to allow for individual diferences in smiling diction and  () is the groundtruth.
dynamics. The normalized displacements were predicted 1 
using the attention-based generative model. The pre-    = ∑  ( ‖( (̂) −  ()) ‖2 ≤  ) (3)
dicted frame-level displacements were incorporated into  =1
the last known listener facial expression to generate the where  is an indicator function and  is the margin.
sequence of facial landmarks recursively.
        </p>
        <p>We enforced teacher-forcing with simulated annealing
during training and linearly decreased the likelihood of 4.2. Results
using ground truth at every 20 epochs. Stochastic Gra- Using listener behavior and conditioning vector together
dient Descent with a learning rate initialized at 1 − 4 with the speaker behavior resulted in improved
perforweight decay and 0.99 momentum were used to minimize mance compared to the baseline speaker behavior-based
prediction. As shown in Table 2, APE decreased by 0.273
points while PCK increased by 0.004; these gains were
statistically significant. When listener behavior was added
to the speaker behavior, marginally significant
improvements were observed. APE reduced by 0.206 points while
PCK increased by 0.001 points. These reiterate our
hypothesis that both speaker and listener contribute to BC
behaviors. When speaker behavior was augmented with
the conditioning vector, only nominal diferences were
observed against the baseline. APE increased by 0.063
points, and PCK decreased by 0.001.</p>
        <p>To understand how the performance varies with
different smiles, we predicted APE (and PCK) as a linear
combination of duration, intensity, and the model
configuration using a regression model. Results from Figure 5
show that duration significantly afects the PCK.
Interestingly, the positive slope suggests that longer smiles are
generated better over shorter smiles. Only a marginally Figure 5: Efect of duration and intensity of smile along with
significant efect of duration can be observed for APE. ablation of inputs on generative model performance measured
With the increase in the intensity of the smile, the gen- using APE (top) and PCK (bottom). S &amp; C-speaker and
condieration performance decreases. This is significant for tainodnilnisgtveencetroar,nSd&amp;cLo-nsdpietaiokneirnagnvdelcisttoernaesr, iannpduSts,Lto&amp;tCh-espmeoadkeerl.
D-level and E-level smiles. Using listener features and ‘⋅’, ‘*’ and ‘***’ indicate significance with p &lt;0.1, p &lt;0.05 and p
the conditioning vector along with the speaker features &lt;0.001 respectively.
improves the performance (negative and positive slopes
for APE and PCK, respectively) compared to the baseline
speaker-based generation. However, this efect is not utterance. However, the model fails to capture this
vertistatistically significant. cal motion.</p>
        <p>
          Qualitative evaluation of ground truth landmarks from Metrics like APE and PCK provide an objective
meaFigure 6 suggest the deficiencies of the existing facial sure of the prediction. However, evaluating concepts
landmark prediction approaches [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] to accurately track such as realism and contextual relevance of the BC
prediclip corners both in the presence and absence of non- tion requires subjective ratings from human evaluation.
frontal head pose. While a visually noticeable diference A convention in evaluating landmark or keypoint-based
can be observed as the smile evolves, the ground truth generative approaches is the human comparison of
prelandmarks fail to capture the subtle lip corner motion. dicted keypoints against the ground truth [
          <xref ref-type="bibr" rid="ref14">14, 27</xref>
          ]. While
This limitation in the ground truth has resulted in nom- this might work for problems such as gesture
generainal motion in the predicted landmarks. We also found tion that involve a strong motion component, evaluating
that BC smiles that co-occur with vocal activity are chal- subtle behaviors like facial expressions using a similar
lenging to predict. Figure 7 shows one example where strategy could be challenging. To address this concern,
the vertical distance between the upper and lower lips we leverage the emulated version of an embodied agent:
increases and decreases because of the simultaneous yeah Furhat [28].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Smiles on an Embodied Agent</title>
      <sec id="sec-4-1">
        <title>So far, we have shown modeling smiles by generating</title>
        <p>facial landmarks. However, users in real-world scenarios
do not expect to see such abstract representations of
faces. Aligning these facial landmarks with embodied
agents is key for an interactable conversational agent.
To achieve this, we describe the procedure to transfer
generated landmarks to an embodied robotic simulation
system called Furhat. We then conduct a user study for
subjective perceived diferences in Furhat’s behavior due
to BC smile.</p>
        <sec id="sec-4-1-1">
          <title>5.1. Emulation Setup</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Furhat allows users to control facial expressions using</title>
        <p>
          a set of facial parameters called BasicParams3 (ex.
MOUTH_SMILE_LEFT and MOUTH_SMILE_RIGHT
to control the left and the right lip corners;
BROW_UP_LEFT, BROW_UP_RIGHT to control
the left and right eyebrows, etc.). Our setup uses these
parameters to enable the embodied agent’s smile and
express associated eyebrow actions. The landmarks
from a generated smile expression were used to calculate
the displacement between successive frames and
normalized to the [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] range. For eyebrows, only vertical
displacement was used. Our inputs to the Furhat API
consisted of the lip corner and eyebrow displacements
corresponding to the frame with the widest smile
(maximum horizontal displacement between the lip
corners). The duration of the Furhat smile was set to
the duration of the generated smile. Figure 8 shows an
example of the resultant expression. The user study was
conducted using the Furhat Desktop SDK. However, we
do not foresee dificulties transferring the emulation
setup to a physically embodied Furhat.
        </p>
        <sec id="sec-4-2-1">
          <title>5.2. User Study Procedure</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>We conducted a small-scale user study of participants</title>
        <p>watching two pre-recorded videos of the Furhat
interacting with an individual. They difer only in terms of Furhat
expressing a BC smile. In both interactions, Furhat starts
with a brief introduction of itself, followed by a short
question–“How have you been feeling over the last two
weeks?”. As the user responds, a smile is generated at
the appropriate location (see Figure 8). We refer to this that the brightness of the BC smile was appropriate while
scenario as the backchannel setting. Another video of two found that the duration of BC smile was longer or
the same individual interacting with Furhat with no BC shorter than expected. While no diference was observed
(non-backchannel) serves as our baseline. Seven gradu- in terms of users’ preference for Furhat for personal
conate students then rated each video recording separately. versations based on the presence of the BC smile, more
Note that raters were not primed on the study’s outcome, users (3/7) responded that they would use Furhat with
and no explicit instructions about smiles were given. BC smiles for non-personal conversations over Furhat</p>
        <p>To quantify the user’s perception of Furhat interacting without BC smiles (2/7).
with an individual, the influence of BC smile in addition
to the efect of its intensity and duration, and their will- 6. Discussion
ingness to interact with one was quantified through the
following questions on a 5-point Likert scale (1: strongly
agree, 5: strongly disagree).</p>
      </sec>
      <sec id="sec-4-4">
        <title>Our quantitative results suggest that both speaker and lis</title>
        <p>tener behavior are important in generating BC behavior.</p>
        <p>Using listener behavior together with the conditioning
1. The Furhat’s smiles looked human-like. vector ofered statistically significant improvements in
2. The Furhat’s smiles looked natural and friendly. performance when compared to the baseline
speaker3. I would talk to this agent frequently. only model. This efect was observed both in terms of
4. I felt the brightness of Furhat’s smiles was appro- APE and PCK. We also found that our attention-based
priate. generative model can predict low-intensity smiles better
5. The Furhat was smiling for longer or shorter du- than high-intensity smiles. Our user study shows that
ration than it was expected. more people find our agent human-like when it was able
6. I would feel comfortable talking to this agent to express BC smiles. Participants prefer to interact with
about non-personal topics. it over the agent with no BC smile capabilities for
non7. I would feel comfortable talking to this agent personal conversations. However, for intimate personal
about personal topics. conversations, the presence of a BC smile did not sway
their decision.</p>
        <p>In addition, open-ended feedback was also a part of the Some limitations of this work include the following.
questionnaire. We believe these questions help identify We employed an afordable measure of reliability for BC
some user-facing challenges in generating BC behav- smile annotations using a prediction model over a
huiors and how they influence users’ attitudes to embodied man rater. A robust approach would involve at least one
agent-based dialogue systems for conversations related more human annotator to perform reliability annotations
to mental health. on a portion of the dataset. The statistical analysis also
assumes that the smiles were independent of the
individ5.3. Results uals and dyads. However, a given individual typically
produces multiple smiles. Grouping of smiles by factors
such as individuals and dyads can be better modelled
using a mixed-efects model. Our user study was designed
to demonstrate the feasibility of transferring generated
facial landmarks to an embodied agent together with
understanding perceived diferences between interactions
with and without BC smiles. An appropriate evaluation
framework would include the user interacting with the romantic relationships, and lack of age and ethnicity
inagent. Followed by a comparison of qualitative subjec- formation in the dataset might have resulted in biased
tive ratings of user experience and quantified parameters generations. We also acknowledge that using embodied
(such as diference in turn duration, language usage, etc.) agents in such sensitive applications should undergo
rigof the interaction with and without BC smiles. We believe orous evaluations by technical and domain experts and
such approaches provide a holistic evaluation to identify regulatory bodies. In our work, we do not interpret
emcritical instances in the interaction. Lastly, we focused on bodied agents as a substitute for professionals in mental
BC smiles leaving out other conventional signals such as health or allied areas of healthcare but to provide tools
vocal and headpose-based BCs, and how they are afected for them to better serve the community’s demands. We
by the cues from the speaker and listener. believe that the advantages and limitations of
embodied agents in mental health should be presented to the
users and the healthcare experts to provide maximum
7. Conclusion benefits. The information used in this work is identified
from a publicly available dataset. Also, special attention
has been paid to privacy and copyright requirements for
relevant images showing individual faces. The user study
raters were voluntary participants, and the University of
Pittsburgh IRB approved the data collection.</p>
        <p>To enable BCs in embodied agents for mental health
applications, we proposed an annotated dataset of
faceto-face conversations including topics related to mental
health. Our statistical analysis showed that speaker
gender together with prosodic and linguistic cues from both
speaker and listener turns are significant predictors of
the BC smile intensity. Using the significant predictors 9. Acknowledgments
together with the speaker and listener behaviors to
generate BC smiles ofers significant improvements in terms Bilalpur and Cohn were supported by the U.S. National
Inof empirical metrics over the baseline speaker-centric stitutes of Health through award MH R01-096951. Zeinali
generation. was supported through the Khoury Distinguished
Fel</p>
        <p>We bridge the gap between conventional non-verbal lowship at Northeastern University.
behavior generation approaches such as landmarks and
poses and their realization by showing that generated
landmarks can be transferred to an embodied agent. Thus References
creating the opportunity for evaluation with a
humanlike manifestation over a traditional evaluation by
comparing generated landmark (or keypoint) outputs. Our
small-scale user study suggests our Furhat agent that
backchannels is more human-like and are more likely to
attract users for non-personal interactions. In addition
to these contributions, we also discussed some
limitations in existing technology towards generating accurate
ground truth landmarks through examples such as failure
to capture mouth movement in bimodal BCs and how
they afect the generated outputs. We believe these
limitations also serve as directions for future research. Our
work serves as a baseline for computer scientists
interested in behavior generation, and an attractive source of
BC smiles for behavioral scientists to study the efect of
context cues on BC smiles in intimate conversations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>8. Ethical Statement</title>
      <p>We proposed a generative approach for backchannel
smile production to enable naturalistic interactions with
embodied AI agents for mental health dialogue. While
our dataset ofers diverse smiles from people in diferent
interpersonal relationships, like many existing
generative approaches, the choice of pretrained embeddings,
imbalance between males and females, lack of male-male</p>
      <sec id="sec-5-1">
        <title>Note that the intensity of the smile difers marginally by the speaker sex. It is not afected by other factors such as relationship, listener sex and their interaction.</title>
        <p>Mean Sq
F value
 
 
ℎ
 
ℎ
 
 
 
ℎ
∗
∗
∗
0.60
3.31
1.22
0.4417
0.0710 ⋅
0.3055</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Modi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Orgera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <article-title>Exploring barriers to mental health care in the u</article-title>
          .s. (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .15766/ rai_a3ewcf9p.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Valstar</surname>
          </string-name>
          ,
          <article-title>Spectral representation of behaviour primitives for depression analysis</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>829</fpage>
          -
          <lpage>844</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ceccarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahmoud</surname>
          </string-name>
          ,
          <article-title>Multimodal temporal machine learning for bipolar disorder and depression recognition</article-title>
          ,
          <source>Pattern Analysis and Applications</source>
          <volume>25</volume>
          (
          <year>2022</year>
          )
          <fpage>493</fpage>
          -
          <lpage>504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fairbairn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <article-title>Detecting depression severity from vocal prosody</article-title>
          ,
          <source>IEEE transactions on afective computing 4</source>
          (
          <year>2012</year>
          )
          <fpage>142</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ambadar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. I. Reed</surname>
          </string-name>
          ,
          <article-title>All smiles are not created equal: Morphology and timing of smiles perceived as amused, polite, and embarrassed/nervous</article-title>
          ,
          <source>Journal of nonverbal behavior 33</source>
          (
          <year>2009</year>
          )
          <fpage>17</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>D. DeVault</surname>
          </string-name>
          , R. Artstein, G. Benn,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gainer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Georgila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gratch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartholt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lhommet</surname>
          </string-name>
          , et al.,
          <article-title>Simsensei kiosk: A virtual human interviewer for healthcare decision support, in: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems</article-title>
          ,
          <year>2014</year>
          , pp.
          <fpage>1061</fpage>
          -
          <lpage>1068</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Utami</surname>
          </string-name>
          , T. Bickmore,
          <article-title>Collaborative user responses (2020). in multiparty interaction with a couples counselor</article-title>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wöllmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <article-title>Opensmile: the robot</article-title>
          ,
          <source>in: 2019</source>
          14th ACM/IEEE International Con
          <article-title>- munich versatile and fast open-source audio feaference on Human-Robot Interaction (HRI), IEEE, ture extractor</article-title>
          ,
          <source>in: Proceedings of the 18th ACM</source>
          <year>2019</year>
          , pp.
          <fpage>294</fpage>
          -
          <lpage>303</lpage>
          . international conference on Multimedia,
          <year>2010</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W. Tsukahara, Prosodic features which 1459-
          <fpage>1462</fpage>
          .
          <article-title>cue back-channel responses in english and japanese</article-title>
          , [22]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Boyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>BlackJournal of pragmatics 32 (</article-title>
          <year>2000</year>
          )
          <fpage>1177</fpage>
          -
          <lpage>1207</lpage>
          . burn,
          <article-title>The development</article-title>
          and psychometric proper-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Benus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gravano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Hirschberg</surname>
          </string-name>
          ,
          <source>The prosody ties of LIWC2015</source>
          ,
          <source>Technical Report</source>
          ,
          <year>2015</year>
          .
          <article-title>of backchannels in american english (</article-title>
          <year>2007</year>
          ). [23]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ekstedt</surname>
          </string-name>
          , G. Skantze, Voice activity projec-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bertrand</surname>
          </string-name>
          , G. Ferré,
          <string-name>
            <given-names>P.</given-names>
            <surname>Blache</surname>
          </string-name>
          , R. Espesser,
          <article-title>tion: Self-supervised learning of turn-taking events, S. Rauzy, Backchannels revisited from a multimodal arXiv preprint</article-title>
          arXiv:
          <volume>2205</volume>
          .09812 (
          <year>2022</year>
          ). perspective, in: Auditory-visual Speech Processing, [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hershey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Ellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Gemmeke</surname>
          </string-name>
          ,
          <year>2007</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . A.
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>R. C.</given-names>
          </string-name>
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Plakal</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Platt</surname>
          </string-name>
          , R. A.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Truong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Poppe</surname>
          </string-name>
          , I. de Kok,
          <string-name>
            <given-names>D.</given-names>
            <surname>Heylen</surname>
          </string-name>
          ,
          <article-title>A mul-</article-title>
          <string-name>
            <surname>Saurous</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Seybold</surname>
          </string-name>
          , et al.,
          <article-title>Cnn architectures for timodal analysis of vocal and visual backchannels large-scale audio classification, in: 2017 ieee interin spontaneous dialogs</article-title>
          .,
          <source>in: INTERSPEECH</source>
          ,
          <year>2011</year>
          , national conference on acoustics, speech and signal pp.
          <fpage>2973</fpage>
          -
          <lpage>2976</lpage>
          . processing (icassp), IEEE,
          <year>2017</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gravano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hirschberg</surname>
          </string-name>
          , Backchannel-inviting [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Bengio,</surname>
          </string-name>
          <article-title>Neural machine cues in task-oriented dialogue, in: Tenth Annual translation by jointly learning to align and translate</article-title>
          ,
          <source>Conference of the International Speech Communi- arXiv preprint arXiv:1409.0473</source>
          (
          <year>2014</year>
          ).
          <source>cation Association</source>
          ,
          <year>2009</year>
          . [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Stoll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Camgöz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hadfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bowden</surname>
          </string-name>
          , Sign
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Alameda-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fua</surname>
          </string-name>
          , E. Ricci,
          <article-title>language production using neural machine translaN</article-title>
          . Sebe,
          <article-title>Every smile is unique: Landmark-guided tion and generative adversarial networks, in: Prodiverse smile generation</article-title>
          ,
          <source>in: Proceedings of the ceedings of the 29th British Machine Vision ConIEEE Conference on Computer Vision and Pattern ference (BMVC</source>
          <year>2018</year>
          ),
          <source>British Machine Vision AssoRecognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>7083</fpage>
          -
          <lpage>7092</lpage>
          . ciation,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kannan</surname>
          </string-name>
          , G. Gkioxari, C. L. Zit- [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Morency</surname>
          </string-name>
          ,
          <article-title>No gesnick, Learn2smile: Learning non-verbal interac- tures left behind: Learning relationships between tion through observation, in: 2017 IEEE/RSJ In- spoken language and freeform gestures, in: Findternational Conference on Intelligent Robots and ings of the Association for Computational LinguisSystems (IROS)</article-title>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>4131</fpage>
          -
          <lpage>4138</lpage>
          . tics:
          <source>EMNLP</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>1884</fpage>
          -
          <lpage>1895</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Joo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kanazawa</surname>
          </string-name>
          , [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Al Moubayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Beskow</surname>
          </string-name>
          , G. Skantze,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ginosar</surname>
          </string-name>
          , Learning to listen: Modeling non- B. Granström,
          <article-title>Furhat: a back-projected humandeterministic dyadic facial motion, in: Proceedings like robot head for multiparty human-machine of the IEEE/CVF Conference on Computer Vision interaction</article-title>
          ,
          <source>in: Cognitive Behavioural Systems: and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>20395</fpage>
          -
          <lpage>20405</lpage>
          . COST 2102
          <string-name>
            <given-names>International</given-names>
            <surname>Training</surname>
          </string-name>
          <string-name>
            <surname>School</surname>
          </string-name>
          , Dresden,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Teotia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tendulkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menon</surname>
          </string-name>
          , C. Von- Germany, February 21-
          <issue>26</issue>
          ,
          <year>2011</year>
          ,
          <string-name>
            <surname>Revised</surname>
          </string-name>
          <article-title>Selected drick, Afective faces for goal-driven dyadic com-</article-title>
          <source>Papers</source>
          , Springer,
          <year>2012</year>
          , pp.
          <fpage>114</fpage>
          -
          <lpage>130</lpage>
          . munication,
          <source>arXiv preprint arXiv:2301.10939</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I. O.</given-names>
            <surname>Ertugrul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Jeni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <article-title>Afar: A deep learning based tool for automated facial afect recognition</article-title>
          ,
          <source>in: 2019 14th IEEE international conference on automatic face &amp; gesture recognition (FG</source>
          <year>2019</year>
          ), IEEE,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec:
          <article-title>Unsupervised pre-training for speech recognition</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>05862</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>M. McAulife</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Socolof</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Mihuc</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sonderegger</surname>
          </string-name>
          , Montreal Forced Aligner:
          <article-title>Trainable Text-Speech Alignment Using Kaldi</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2017</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>498</fpage>
          -
          <lpage>502</lpage>
          . doi:
          <volume>10</volume>
          .21437/ Interspeech.2017-
          <volume>1386</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Memon</surname>
          </string-name>
          ,
          <article-title>Acoustic correlates of the voice qualifiers: A survey</article-title>
          ,
          <source>arXiv preprint arXiv:2010.15869 10. Appendix 10.1. Distribution of Intensity and Duration of Smiles 10.2. Efect of Sex and Relationship on Smile Intensity 0.67 0.10 1.05 0.89 0.75 0.11 1.19 0.5225 0.7424 0</source>
          .
          <fpage>3176</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>