<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Support Conversation⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mengzhao Jia</string-name>
          <email>jiamengzhao98@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qianglong Chen</string-name>
          <email>chenqianglong@zju.edu.cn</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liqiang Jing</string-name>
          <email>jingliqiang6@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dawei Fu</string-name>
          <email>dawei.fdw@alibaba-inc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renyu Li</string-name>
          <email>renyu.rl@alibaba-inc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alibaba Group</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Shandong University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Texas at Dallas</institution>
          ,
          <addr-line>TX</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Zhejiang University</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The prevalence of mental disorders has become a significant issue, leading to the increased focus on Emotional Support Conversation as an efective supplement for mental health support. Existing methods have achieved compelling results, however, they still face three challenges: 1) variability of emotions, 2) practicality of the response, and 3) intricate strategy modeling. To address these challenges, we propose a novel knowledge-enhanced Memory mODEl for emotional suppoRt coNversation (MODERN). Specifically, we first devise a knowledge-enriched dialogue context encoding to perceive the dynamic emotion change of diferent periods of the conversation for coherent user state modeling and select context-related concepts from ConceptNet for practical response generation. Thereafter, we implement a novel memory-enhanced strategy modeling module to model the semantic patterns behind the strategy categories. Extensive experiments on a widely used large-scale dataset verify the superiority of our model over cutting-edge baselines.</p>
      </abstract>
      <kwd-group>
        <kwd>Mental disorders</kwd>
        <kwd>Emotional Support Conversation</kwd>
        <kwd>Mental health support</kwd>
        <kwd>ConceptNet</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Mental disorders are known for their high burden, with
more than 50% of adults experiencing a mental illness
or disorder at some point in their lives; yet despite its
high prevalence, only one in five patients receive
professional treatment1. Recent studies have shown that an
efective mental health intervention method is the
provi</p>
      <sec id="sec-2-1">
        <title>Emotional Support Conversations (ESConv), as defined</title>
        <p>
          by Liu et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], has garnered substantial attention in
recent years. They have emerged as a promising
alternative strategy for mental health intervention, paving
the way for the development of neural dialogue systems
designed to provide support for those in need.
        </p>
        <p>
          The ESConv takes place between a help-seeker (user)
and a supporter (dialogue model) in a multi-turn
manner. It requires the dialogue model to employ a range
of supportive strategies efectively, easing the emotional
distress of the users and helping them overcome the
challenges they face. Prior research primarily concentrated
on two aspects. The first aimed to enhance the model’s
comprehension of the contextual semantics in the
conMachine Learning for Cognitive and Mental Health Workshop
(ML4CMH), AAAI 2024, Vancouver, BC, Canada
∗Corresponding author.
nEvelop-O
(R. Li)
versations, such as the user’s situation, emotions, and
intentions. An example of these eforts is the work of Peng
et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] who designed a hierarchical graph network to
capture the overall emotional problem cause and specific
user intentions. The second aspect focused on predicting
the dialogue strategy accurately and responding based
on the predicted strategy category. For example, Cheng
et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] employed a lookahead heuristics for dialogue
        </p>
        <p>Despite the success of existing studies, this task is
nontrivial due to the following three challenges.</p>
        <p>• Variability of emotions. As the conversation
progresses, the user’s emotional state evolves
subtly and constantly. Accurately recognizing the
emotional change is indispensable to
understanding the user’s real-time state and thus responding
empathically [7, 8, 9]. How to model the dynamic
emotional change during the dialogue process is
the first challenge.
• Practicality of the response. In the absence of
explicit cues, neural dialogue systems are inclined
to make generic responses [10, 11]. As shown in
Figure 1, the generic responses are deficient to
provide personalized and suitable suggestions for
the user’s specific concerns. To resolve this issue,
introducing context-related concepts (doctor and
recover ) can promote generating more
meaningful and actionable suggestions for specific
situations. As a result, the integration of appropriate
concepts poses a non-trivial challenge in
generating practical responses.
port strategies. Finally, the third module aims to generate
the target response with the BART decoder.</p>
        <p>
          Our contributions can be summarized as follows: 1)
We first analyze the current challenges of the ESConv
task, and according to that propose a novel
knowledgeenhanced Memory mODEl for emotional suppoRt
coNversation, named MODERN, which can model complex
supportive strategy as well as utilize emotional
knowledge and context-related concepts to perceive the
variability of emotions and provide practical support
advice. 2) We propose a memory-enhanced strategy
modeling module, where a unique memory bank is designed
to model intricate strategy patterns, and an auxiliary
strategy classification task is introduced to distill the
strategy pattern information. 3) We present a
thorough validation and evaluation of our model,
providing an in-depth analysis of the results and a
comparison with other models. The extensive experiments
on the ESConv dataset [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] demonstrate that MODERN
achieves state-of-the-art performance under both
automatic and human evaluations. The code is avaliable at
https://projs2release.wixsite.com/modern.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <sec id="sec-3-1">
        <title>2.1. Emotional and Empathetic Dialogue</title>
      </sec>
      <sec id="sec-3-2">
        <title>Systems</title>
        <p>hospital
doctor</p>
        <p>patient
My mother in law is sick with COVID and my
husband, when he contracted COVID from her,
got terminated from his job.</p>
        <p>I can see why you feel anxious, I would be too
given the situation.</p>
        <p>Yeah, it's just really stressful and I am over
anxious about it. She is elderly and has cancer...
worried
uneasy
disease</p>
        <p>medicine
Cancer is already a hard enough disease to fight
against...</p>
        <p>Yeah, she went again the other day. recover
She is doing quite a bit better... improve
Practical FCrOomVImDyakftneorw2letdoge3, pweeoepklse,uysouuallmyirgehctovceornfsriodmer
Information having a doctor see your mother if she is still</p>
        <p>struggling.</p>
        <p>Generic Don’t worry, she will be fine.</p>
        <p>Response I’m sorry to hear about it.</p>
        <p>• Intricate strategy modeling. Dialogue strategy,
as a kind of linguistic pattern, has been reported
as a highly complex concept encompassing
various intricate linguistic features [12, 13]. Previous
work resorted to a single vector (a category
indicator) for strategy representation, which is
insuficient to fully represent the complex strategy
pattern information. Therefore, how to model
strategy information suficiently is the third
challenge.</p>
        <p>With the popularity and growing success of dialogue
systems, many research interests have recently
endeavored to empower the system to reply with a specific and
proper emotion, therefore forming a more human-like
conversation. Particularly, two research directions arise</p>
        <p>
          To overcome these challenges, as shown in Figure 2, we researchers’ interest, namely the emotional [15] and
emintroduce a novel knowledge-enhanced Memory mODEl pathetic [16] response generation. The former direction
for emotional suppoRt coNversation, dubbed MODERN. expects the dialogue agent to respond with a given
emoIn particular, MODERN adapts the BART [14] as its back- tion [17, 18, 19, 20]. While the latter requires the dialogue
bone and consists of the knowledge-enriched dialogue system to actively detect and understand the users’
emocontext encoding module, the memory-enhanced strat- tions and then respond with an appropriate emotion [21].
egy modeling module, and the response decoding module. For example, Lin et al. utilized multiple decoders as
diferTo capture the emotional change as the conversation pro- ent listeners to react to diferent emotions and then softly
gresses, the first module detects the emotions for all utter- combine the output states of the decoders appropriately
ances and explicitly injects them into the dialogue context based on the recognized user’s emotion. Nevertheless,
as a kind of emotional knowledge, thus understanding unlike above directions, the ESConv task concentrates
the user’s status coherently. In addition, this module also on alleviating users’ negative emotion intensity and
prointroduces the concepts reasoning and selection com- viding supportive instructions to help them overcome
ponent to acquire valid context-related concepts from a struggling situations.
knowledge graph called ConceptNet [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and incorporate
them into the dialogue context to fulfill meaningful and
practical suggestion generation. Moreover, in contrast 2.2. Emotional Support Conversation
to existing studies that depend on simplistic indicators As an emerging research task, emotional support
converto represent strategy categories, the memory-enhanced sation has gradually attracted intense attention in recent
strategy modeling module learns strategy patterns by a years. Existing works mostly focus on two aspects. The
strategy-specific memory bank. In this way, it can detect ifrst is to understand the complicated user emotions and
and mimic the intricate patterns in human emotional
sup
        </p>
        <p>Memory-enhanced Strategy Modeling</p>
        <p>Store</p>
        <p>Strategy-specific</p>
        <p>Memory Bank
ReprPeastetnetrantions
I would suggest talking to
the professor..[EOS]</p>
        <p>Add &amp; Norm
Feed Forward
Add &amp; Norm
Cross-Attention
Add &amp; Norm</p>
        <p>Self-Attention
[BOS] I would suggest
talking to the professor..</p>
        <p>
          Training Set Responses
keep the organization going.
[Providing Suggestions] It may also
help to contribute some money to
[Self-disclosure] I have done
plored context semantic relationships [
          <xref ref-type="bibr" rid="ref4">23, 4</xref>
          ], common- the supportive response with    tokens. The goal of the
sense knowledge [
          <xref ref-type="bibr" rid="ref4">24, 4</xref>
          ], or emotion causes [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to better
capture and understand the emotions and intentions of
users. The other trend in addressing the task is to
predict the strategy category accurately so as to respond in
accordance with it [25, 23]. For instance, Tu et al. firstly
proposed to predict a strategy probability distribution
and generate the response guided by a mixture of
multiple strategies. Despite their remarkable achievements,
egy modeling, variability of emotions, and practicality of
the responses.
3. Task Formulation
denoted as   = ( 1,  2, … ,
        </p>
        <p>).   = ( 1
,  2, … ,</p>
        <p>) is
and situation   as follows,
ESConv task is to learn a model ℱ that can generate a
supportive response  ̂ referring to the input context  
 ̂ = ℱ (  ,   |Θ),</p>
        <p>(1)
where Θ refers to the set of to-be-learned parameters of
the model ℱ. Notably, the ground truth   can be utilized
to be predicted in the inference phrase. For brevity, we
temporally omit the superscript  of the  -th sample in
the rest of this paper.
4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Method</title>
      <p>existing work still face three challenges: intricate strat- in the model training stage but is not available and need
For concise mathematical expression, we first declare
some notations in this paper. We use bold capital letters In this section, we detail the proposed model MODERN,
(X) and bold lowercase letters (x) to represent matrices
and vectors, respectively. We adopt non-bold letters ( )
which comprises three main components:
knowledgeenriched dialogue context encoding, memory-enhanced
to denote scalars. Greek letters ( ) refer to hyperparame- strategy modeling, and response decoding, demonstrated
ters. All the vectors, if not clarified, are in column forms.</p>
      <sec id="sec-4-1">
        <title>In the setting of emotional support conversation, the</title>
        <p>dialogue is participated by a help-seeker and a supporter,
and the latter tries to comfort and support the help-seeker
to lower he/she’s emotional intensity level. The target
in Figure 2.</p>
        <sec id="sec-4-1-1">
          <title>4.1. Knowledge-enriched Dialogue</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Context Encoding</title>
          <p>get response   . Therein,   contains a sequence of   
history utterances between the user and the supporter,
a dialogue context   , a support strategy   , and a tar- tion generation.
is to generate a response based on the dialogue history. In this section, we first utilize an emotion detector to
Besides, the supporter is required to select one of 
strategies and respond accordingly. Suppose we have a training
dataset  = {</p>
          <p>1,  2, … ,   } composed of  samples. Each
sample   = {  ,   ,   ,   } includes a seeker’s situation   ,
recognize fine-grained emotions of utterances as
emotional knowledge for capturing emotional change. In
addition, we select related concepts from the ConceptNet
knowledge graph for meaningful and practical
sugges4.1.1. Change-aware Emotion Detection
Psychiatric and mental health studies have proved that
empathy is essential to emotionally helping
relationships [26, 27, 28]. And fine-grained emotional
information is one of the key factors to enhance the
empathetic ability [29]. Apart from the static emotional
signals, dynamic emotional changes during the conversation
progress are also beneficial. Concretely, change-aware
emotion information enriches the model to understand
the user’s status coherently. Inspired by this, we devise
to identify the user’s fine-grained emotions and perceive
the dynamic changes of emotions in the dialogue context.</p>
          <p>Specifically, we start by obtaining the fine-grained
emotion via an of-the-shelf pretrained emotion
detector2, which can recognize up to 28 diferent emotional
categories. We apply the detector to every utterance in
that would interfere with the generative model learn- To acquire the representation of the dialogue context
be used to mine the underlying concept information in
the dialogue context. We first identify all the concepts in
ConceptNet that are mentioned in the dialogue context
and remove the top- frequent concepts in the training
set because these words are usually too general to provide
valid suggestions for a specific situation, such as</p>
          <p>help,
thing, and feeling. In this way, we derive   concepts,
denoted as { 1, ⋯ ,    }.</p>
          <p>Thereafter, we leverage the   concepts as the
anchors to reason the related concepts. Specifically, for
each anchor concept  , we retrieve all its one-hop
neighboring concept-relation pairs from the
ConceptNet.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Mathematically, let  ()</title>
        <p>be the set of
neighboring concept-relation pairs of the anchor concept
 .</p>
        <p>Then the context-related concept-relation pairs
can be represented as { (</p>
        <p>1),  ( 2), ⋯ ,  (   )}, where</p>
        <p>, ℎ 1
pairs. Following Liu et al. [30], to</p>
        <p>)} is a set of    retrieved
cepts with excluded relations, Antonym, ExternalURL, and
NotCapableOf. Finally, we concatenate the concepts in
the filtered pairs and deem them as context-related
con&lt; ,   &gt;
cepts  .
4.1.3. Knowledge-enriched Dialogue Context</p>
        <p>Embedding
the dialogue context for emotion recognition as follows,  (  ) = {( 11, ℎ11), ⋯ , (  1
  = Emo(  ),
0 ≤  ≤   ,</p>
        <p>(2) avoid introducing extraneous concepts, we filter out
consupport system. Therefore, we mine and associate com- to responding under a strategy category by simply
prowhere   is the predicted emotional category word
representing the detected emotion in   . Thereafter, we
directly inject the natural language form of emotion
category words into the dialogue context as additional
emotion knowledge. This practice aligns closely with
the input format of the pretrained BART model.
Moreover, it also avoids introducing unnecessary parameters
ing. Empirically, we concatenate the emotional category
into the sequence of dialogue context tokens, denoted as
 = [ 1;  1; SEP; …   ;   ; SEP; …</p>
        <p>;    ]. In this way, the
dynamic emotional changes corresponding to the
dialogue progress can be coherently exploit by the emotinal
support model.
4.1.2. Context-related Concepts Reasoning and</p>
        <p>Selection
Considering ConceptNet involve abundant general
human knowledge, which plays an important role in
understanding human situations and associating them with
practical suggestions, we select potentially useful
contextrelated concepts to enrich the model to generate
responses with high informativeness. For instance, the
commonsense knowledge database can easily relate
failing exam to academic stress. In terms of daily human
activities, this knowledge serves as a guide for dealing
with daily afairs and problems. Such knowledge is
useful for providing advice and guidance in the emotional
monsense knowledge of the data to provide potentially
useful information for generating instructive responses.</p>
        <p>Concretely, the ConceptNet knowledge graph involves
3.1 million concepts and 38 million relations, which can</p>
      </sec>
      <sec id="sec-4-3">
        <title>2https://huggingface.co/arpanghoshal/EmoRoBERTa.</title>
        <p>and corresponding knowledge, we first encode the
dialogue context sequence (user situation  , emotional-aware
dialogue context  , and concepts  ) with a
Transformerbased encoder as follows,</p>
        <p>H = Enc ([;  ; ]),
(3)
where H ∈ ℝ×</p>
        <p>denotes the hidden contextual
representation.  and  refer to the number of tokens in [;  ; ]
and the representation hidden dimension, respectively.</p>
        <sec id="sec-4-3-1">
          <title>4.2. Memory-enhanced Strategy Modeling</title>
          <p>During the conversation, the supporter adopts diferent
strategies for diferent purposes, ultimately achieving
the goal of reducing the intensity of the user’s negative
emotions. For example, using the Question strategy helps
the supporter to explore the concrete situation faced by
the user, while the use of the Reflection of Feelings
strategy conveys the supporter’s understanding of the user’s
current emotions. Existing work constrains the model
viding a single vector indicating the strategy’s name or
description. However, the semantic patterns of strategies
are highly complex, the name or description is not able to
contain the specific linguistic patterns (expression
manner, wording, and phrasing) of the strategy. Therefore,
inspired by [31], we propose to disentangle the strategy
patterns from the same-strategy responses to provide
more specific guidance for the strategy-constrained
generation.
4.2.1. Strategy Pattern Modeling</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>We first acquire strategy pattern representations of each</title>
        <p>responses in the training set via a strategy pattern
extractor Enc as follows,</p>
        <p>r = MaxPooling(Enc ()),
where r ∈ ℝ denotes the strategy pattern representation. generation tasks considering the fact that jointly optimal
In order to use the strategy pattern information in the
memory bank, the model requires selecting a proper
strategy category based on the dialogue context. To achieve
these, we leverage a strategy predictor, which aims to
capture information relevant to strategy decisions in the
context.</p>
        <p>The strategy predictor is composed of a
Transformerbased encoder and a classification module. The encoder
ifrst encodes the dialogue context into a strategy predict
(4) representation. It is worth noting that we adopt
independent representations for strategy prediction and response
solutions may not always exist for diferent tasks.
Subsequently, the classification module maps the vector as
a  dimension vector, which is regarded as the
probability distribution of the  strategy types. Formally, the
strategy prediction can be written as follows,
{
s = MaxPooling(Enc ( )),
 =̂ argmax( ( MLP(s)),
where Enc is a Transformer-based encoder. The strategy
prediction representation is denoted as s ∈ ℝ . MLP
and  (⋅) are a multi-layer perceptron and the Sigmoid
function, respectively. The argmax operation is used to
obtain the predicted strategy category  ̂. We use the
following objective to optimize the strategy prediction
task,</p>
        <p>ℒ = − log  (| s) .
4.2.4. Memory-enhanced Encoding
After predicting the strategy category  ̂, instead of
directly using  ̂ as an indicator to constrain the response
ing memory bank matrix M</p>
        <p>and the context
representation, so as to fully exploit the abundant pattern
informa</p>
        <p>Empirically, we fuse the matrix and the context
representation with a multi-head cross-attention module [32]
as follows,
m = MaxPooling(CrossAtt(H, M )),
(10)
(8)
(9)
where H and M act as the query and the key-value pair
(7)
in the cross-attention, respectively, m ∈ ℝ
memory-enhanced strategy modeling feature.</p>
        <p>denotes the
Meanwhile, in order to accurately capture the strategy
pattern information and avoid irrelevant disturbance, we
design an auxiliary strategy classification task to guide
the extractor to map more strategy-related information
into the pattern representation. The auxiliary objective
ℒ is derived by the following loss function,
ℒ = − log  (| r) .</p>
        <p>(5)
4.2.2. Strategy-specific Memory Bank
To utilize detail and ample strategy pattern information,
instead of using a single representation vector, we
devise a memory bank mechanism to store multiple
pattern representations according to their strategy
categories. In particular, we denote the memory bank as
ℳ = {M1, … , M }, in which M</p>
        <p>pattern representations corresponding to  -th
strategy category, and  is the total number of strategy
categories.    is 0 at the initial training step. As the training
progresses,    continues to increase until the maximum
threshold   is reached. In particular, we store pattern</p>
        <p>∈ ℝ   × is a matrix of
sponding M</p>
        <p>by concatenation as follows,
representation of  -th strategy category into the corre- generation, we integrate the aforementioned
correspond
M
⟵ [M ; r ],</p>
        <p>(6) tion of the particular strategy.
where r denotes a representation belongs to the  -th
strategy category and [⋅; ⋅] refers to the concatenation
operation. As the representations are optimized along
with the classification training process, we dynamically
update M in a first-in-first-out</p>
        <p>manner as follows,
M
 = {</p>
        <p>M


M ,
[   −  ∶  ]</p>
        <p>,
otherwise</p>
        <p>if    &gt;  
where   and    denote the maximum storage limit
and the current storage volume of each memory matrix,
respectively. The algorithm of the memory storing and
updating operation is summarized in the appendix.</p>
        <sec id="sec-4-4-1">
          <title>4.3. Response Decoding</title>
          <p>In order to generate the emotional supportive response,
we input the encoded features, the memory-enhanced
strategy modeling feature m and the knowledge-enriched
dialogue context embedding H, into the Transformer
 ( ̂ ∣  &lt;̂ , E) = Dec( &lt;̂ , E),
where E = [m; H].  &lt;̂ refers to the previous generated
 − 1 tokens of the target response. Dec(⋅) denotes the
decoder module. Notably, to avoid the accumulated
error, we replace  &lt;̂ by  &lt; in the training phase. For
optimization, we introduce the standard cross-entropy loss
function for response generation as follows,
ℒ = −
1</p>
          <p>∑ log  (  ∣  &lt; , E) ,
  =1
where   denotes the length of the target response.</p>
          <p>[t] Training Procedure.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>Input: training set  for optimizing the model ℱ,</title>
        <p>hyperparameters { 1,  2}.</p>
        <p>
          Output: Parameters Θ.
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] Initialize parameters: Θ Initialize memory bank:
ℳ Randomly sample a batch from  . each sample
( , , , ) Add strategy pattern representation into the
memory bank ℳ by Eqn. (6). Update the memory bank
ℳ by Eqn. (7). Update Θ by optimizing the loss function
in Eqn. (13). ℱ converges.
decoder. The generation process aims to predict the
conditional probability distribution ( ̂  | &lt;̂ , m, H) in an
auto-regressive manner, which means the decoder
generates the  -th word conditioned on all previous generated
words as well as the encoded representation. Formally,
we deploy the decoding process, which predicts the
conditional probability distribution over each token in the
target response in an auto-regressive manner as follows,
strategy category used in every supporter’s response.
The dataset contains 1,300 long conversations and overall
38,350 utterances, with an average of 29.5 utterances in
each dialogue. For the data split, we followed the same
setting of previous work [
          <xref ref-type="bibr" rid="ref3 ref5">3, 5</xref>
          ].
        </p>
        <sec id="sec-4-5-1">
          <title>5.2. Implementation Details</title>
          <p>
            Following the setting of the previous study [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], we also
(11) utilized encoder and decoder of the pretrained BART3
provided by HuggingFace [33] to initialize the
parameters of the context encoder, the strategy predictor, and
the decoder, respectively. The number of layers in the
encoder and decoder are both 6. The dimension of
hidden feature  equals to 512. To form a mini-batch, the
input sequence length  is unified to 512. The hidden
dimension  is 768. The category number  equals 8.
          </p>
          <p>The maximum memory storage   is set as 64.  equals
(12) 20.  1 and  2 are 0.3 and 0.1, respectively. The batch
size is 16. We use the PPL metric on the validation set
to monitor the training progress. Empirically, it takes
around 15 epochs to get the peak performance. During
the generation stage, we use a maximum of 64 steps to
decode the responses. We adopt AdamW [34] optimizer
with  1 = 0.9 and  2 = 0.999. The initial learning rate is
2 -5 and a linear learning rate scheduler with 100
warmup steps is used to reduce the learning rate progressively.
For the framework, we use the PyTorch [35] version 1.8.1
to implement the experiment codes. All experiments
were conducted on an NVIDIA Tesla V100 32GB.</p>
        </sec>
        <sec id="sec-4-5-2">
          <title>5.3. Evaluation Metrics</title>
          <p>For the comprehensive evaluation, we conducted both
automatic and human evaluations.</p>
        </sec>
        <sec id="sec-4-5-3">
          <title>4.4. Model Training</title>
          <p>Ultimately, we combine all the loss functions for
optimizing our model as follows,
Automatic Evaluation For the automatic evaluation,
we adopt several mainstream metrics commonly used
ℒ = ℒ +  1ℒ +  2ℒ , (13) in dialogue response tasks, PPL (Perplexity),
BLEU{1,2,3,4} (B-{1,2,3,4}) [36], ROUGE-L (R-L) [37], METEOR
where ℒ is the final training objective and  1 and  2 are (MT) [38], and CIDEr [39].
the non-negative hyperparameters used for balancing
the efect of each loss function on the entire training
process. The overall algorithm of the optimization is
briefly summarized in the algorithm 4.3.</p>
          <p>
            Human Evaluation Apart from the automatic
evaluation, we also consider the human evaluation, as it has
been reported [40] that the automatic evaluation is
sometimes unreliable in generation tasks. We first randomly
5. Experiments select 70 testing samples for evaluation. Then, we
employ 2 volunteers to manually choose which response
5.1. Dataset outperforms the other one. Every sample is annotated
twice. For each case, the volunteers need to compare
We conducted experiments on the ESConv dataset [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. the generated texts of diferent models according to the
Each sample in the dataset is a dialogue between a help- following four dimensions: 1) Fluency: which response
seeker and a supporter. In addition to the context, it also
contains rich information, the situation that the
helpseeker’s is facing. It also provides the annotation of the 3https://huggingface.co/facebook/bart-base.
MT
Ground Truth: (Self-disclosure) I've dealt with the same problem Ground Truth: (Self-disclosure) I believe there are special lamps
with my partner more than once. I love him very much and I found that you could get to help. I personally have been trying to get up a
him to be a 9 to my 5 as well. little earlier in the morning and enjoy my coffee with the sunrise...
w/o-Mem: (Reflection of Feelings) I am sorry to hear that. I am sorry w/o-KG: (Providing Suggestions) I've been in that situation myself,
you are feeling that way. I know how you feel. and I've found that it's very easy to get depressed.
MODERN: (Self-disclosure) I can understand how that would be MODERN: (Providing Suggestions) I've found that taking a walk or
difficult. I had a similar situation with my ex - boyfriend I know sitting down to write out a list of things that you'd like to do helps to
how hard it can be to let go of someone. clear your mind.
          </p>
          <p>(a)
is more fluent, correct, and coherent in grammar and
syntax; 2) Relevance: which response talks more relevantly
regarding current dialogue context; 3) Empathy: which
response is better to react with appropriate emotion
according to the user’s emotional state; 4) Information:
which response provides more suggestive information to
help solve the problem. To further control the quality of
the evaluation, we also invite an inspector to randomly
sample and double-check 10% rating results.</p>
        </sec>
        <sec id="sec-4-5-4">
          <title>5.4. Model Comparison &amp; Analysis</title>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>To validate the efectiveness of our model, we compare it</title>
        <p>with several state-of-the-art baselines:
• MoEL [22]. This method adopts multiple
decoders as diferent listeners for diferent emotions.</p>
        <p>
          The outputs of decoders are softly combined to
generate the response.
• MIME [21]. This model shares the same
architecture as MoEL and extends it to mimic the
speaker’s emotion.
• DialoGPT-Joint [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This model is built on a
pre-trained dialog agent DialoGPT [41]. It first
predicts a strategy and prepends a special token
before the response sentence to control the
generation under that strategy.
• BlenderBot-Joint [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This model adopts the
same strategy prediction and generation scheme
as DialoGPT. Difer from the former one, it is
built on a pre-trained conversational response
generation model named BlenderBot [42].
• MISC [24]. This model also adopts BlenderBot
as the backbone. It leverages common sense
achieves better (denoted as “Win”), equal (denoted as
“Tie”), and worse performance (denoted as “Lose”)
compared with the baselines. As seen, MODERN outperforms
all baselines across diferent evaluation metrics, as the
number of “Win” cases is always significantly larger than
that of “Lose” cases in each pair of model comparisons,
which is consistent with the results in Table 1. In
addition, the number of “Win” cases is the largest for the
Information metric compared with other metrics, which
demonstrates that integrating context-related concepts
can supply meaningful information for emotional
support.
knowledge to enhance the understanding of the
speaker’s emotional state, and the response is
generated conditioned on a mixture of strategy
distribution.
        </p>
        <p>
          5.5. Ablation Study
• GLHG [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This model has a global-to-local
hierarchical graph structure. It models the global We compare the original MODERN model with the
folcause and the local intention of the speaker to lowing derivatives to demonstrate that all the designed
provide more supportive responses. modules are essential for the ESConv task. 1) w/o-ℒ .
• PoKE [23]. This work utilized Conditional Varia- To show the benefit of constraint for strategy prediction,
tional Autoencoder [43] to model the mixed strat- we removed the corresponding loss function by setting
egy.  1 = 0 in Eqn.(13). 2) w/o-ℒ . To show the efect of
• FADO [25]. This work devises a dual-level feed- the auxiliary strategy classification task, we removed the
back strategy selector to encourage or penalize corresponding loss function by setting  2 = 0 in Eqn.(13).
strategies during the strategy selection process. 3) w/o-Mem. In this derivative, we disabled the memory
• MultiESC [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] This work proposes lookahead module, which stores pattern representations of diferent
heuristics to estimate the future strategy and cap- strategies. 4) w/o-Emo. In this derivative, we removed
ture users’ subtle emotional expressions with the the change-aware emotion detection module. And 5)
w/oNRC VAD lexicon [44] for user state modeling. KG. We discarded the context-relate concepts reasoning
and selection component.
        </p>
        <p>We provided the ablation study results on the ESConv
Automatic Evaluation We compared our model with dataset in Table 3 in terms of all metrics. From this
tathe above baselines using automatic metrics and results ble, we have the following observations: 1) Our
MODare reported in Table 1. As we can see, 1) MODERN out- ERN consistently outperforms w/o-Mem, especially on
performs the baselines in most metrics, which is a power- the BLUE metrics (B-{1,2,3,4}), which suggests that the
ful proof of the efectiveness of the proposed method. 2) memory-enhanced strategy modeling module can
proThe models with BART backbone (MultiESC and MOD- vide suficient linguistic pattern references and hence
ERN) surpass those baselines with BlenderBot [42] back- boost the performance of generating responses in
acbone (BlenderBot-Joint, MISC, GLHG, FADO, and PokE) cordance with specific strategy categories. 2) MODERN
across most of the metrics despite the latter being pre- exceeds w/o-ℒ across all metrics. This verifies it is
indistrained on empathetic-related data. One possible expla- pensable to constrain the model to predict and respond
nation is that the ESConv task requires sophisticated under proper strategies. 3) w/o-ℒ obtains a slightly
linguistic knowledge (e.g. correct grammar and wording worse result than MODERN, which demonstrates the
appropriate to the current situation) in addition to empa- strategy classification auxiliary task indeed helps with
thetic ability. 3) Our MODERN consistently exceeds Mul- guiding the pattern representation learning. And 4)
w/otiESC with the same BART backbone. This suggests that Emo and w/o-KG both perform worse than MODERN,
BART model with large-scale pretraining still requires which demonstrates the importance of change-aware
strategy pattern information and knowledge (emotion emotion and context-related concepts. Notably, w/o-KG
and concepts) to further facilitate supportive response surpasses w/o-Emo. One possible explanation is that
begeneration. ing aware of the dynamic emotional changes during the
conversation facilitates the model to provide empathy
and emotional support accordingly.</p>
        <p>Human Evaluation For human evaluation, we report
the comparison results between our model and the two
best baselines (FADO and MultiESC) in Table 2. In
particular, for each pair of model comparisons and each
metric, we show the number of samples where our model</p>
        <sec id="sec-4-6-1">
          <title>5.6. Case Study</title>
          <p>We illustrate several conversations in the test set to get
an intuitive understanding of our model in Figure 3. We
showed two samples and compare the responses gen- Ethical Considerations
erated by MODERN and two derivatives, w/o-Mem and
w/o-KG. As can be seen in case (a), MODERN fulfills to re- The dataset used in our work is a publicly available
spond with the strategy of Self-disclosure and generates a dataset that has been widely used in the field of emotional
contextually appropriate response. Being equipped with support conversation. Sensitive and personally
identia memory-enhanced strategy modeling module, MOD- fiable information was filtered during the construction
ERN shares a similar experience closely related to the of the dataset. In the work of this paper, our model
foseeker’s problem relationship issue. Whereas the w/o- cuses on informal, emotional support provided between
Mem model generates a plain and monotonous response, friends’ daily chats and does not provide professional
which is not very relevant to the user’s current issue. mental health diagnosis and counseling services. The use
The other case (b) demonstrates how MODERN reaps of this model should be avoided for patients with serious
benefits from external knowledge. Based on the situa- mental illnesses, such as self-harm-related conversations,
tion that the seeker mentions feeling depressed, MODERN in order to prevent triggering serious consequences.
leverages the context-related concepts and associates this
emotion status with practical suggestions taking a walk
or sitting down to write... efectively. While without rele- References
vant knowledge, the response generated by the w/o-KG
derivative is relatively vague and less specific, which is
deficient to benefit the user’s situation.
information and experiences in the ESConv task deserves
the attention of future work.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>In this paper, we propose a novel knowledge-enhanced
Memory mODEl for emotional suppoRt coNversation,
dubbed MODERN, which can perceive fine-grained
emotional changes in the conversation, utilize the concepts
from knowledge graph to facilitate generating responses
with practical suggestions, and model concrete strategy
semantic patterns with memory bank mechanism. Both
automatic and human evaluation results show that our
model surpasses the state-of-the-art methods in
emotional support conversation. In addition, the ablation
study demonstrates the efectiveness of each component
of our model.</p>
    </sec>
    <sec id="sec-6">
      <title>Limitations</title>
      <p>The ESConv task requires the dialogue agent to reveal
some information about itself. For example, one of the
strategies called Self-disclosure, expects the agent to cite
their own experience. However, in our experiments, we
observed that the current model often struggles to
maintain a consistent personality. We speculate that this may
be due to the fact that the supporter role in the full
training sample is provided by multiple individuals, and thus
there is no uniform character experience and story, which
leads to the problem of inconsistent personal experiences
during the conversation. We believe that how to make
the dialogue agent show coherent and unified personal
[7] Y. Ding, J. Liu, X. Zhang, Z. Yang, Dynamic tracking Linguistics, ACL, 2019, pp. 5370–5381.
of state anxiety via multi-modal data and machine [17] L. Shen, Y. Feng, CDL: curriculum dual learning
learning., Frontiers in psychiatry 13 (2022). for emotion-controllable response generation, in:
[8] J. Greene, B. Burleson, Handbook of Communica- Proceedings of the Annual Meeting of the
Association and Social Interaction Skills, American Psycho- tion for Computational Linguistics, ACL, 2020, pp.
logical Association, 2003. 556–566.
[9] A. C. High, K. R. Steuber, An examination of [18] M. Y. Chen, S. Li, Y. Yang, Emphi: Generating
emsupport (in)adequacy: Types, sources, and conse- pathetic responses with human-like intents, in:
quences of social support among infertile women., Proceedings of the Conference of the North
AmeriCommunication Monographs 81 (2014). can Chapter of the Association for Computational
[10] B. Wei, S. Lu, L. Mou, H. Zhou, P. Poupart, G. Li, Linguistics: Human Language Technologies, ACL,
Z. Jin, Why do neural dialog systems generate short 2022, pp. 1063–1074.
and meaningless replies? a comparison between [19] W. Kim, Y. Ahn, D. Kim, K. Lee, Emp-rft:
Empadialog and translation, in: International Conference thetic response generation via recognizing feature
on Acoustics, Speech and Signal Processing, IEEE, transitions between utterances, in: Proceedings
2019, pp. 7290–7294. of the Conference of the North American
Chap[11] Y. Liu, W. Bi, J. Gao, X. Liu, J. Yao, S. Shi, Towards ter of the Association for Computational
Linguisless generic responses in neural conversation mod- tics: Human Language Technologies, ACL, 2022, pp.
els: A statistical re-weighting method, in: E. Rilof, 4118–4128.</p>
      <p>D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceed- [20] Q. Li, H. Chen, Z. Ren, P. Ren, Z. Tu, Z. Chen,
Emings of the 2018 Conference on Empirical Meth- pdg: Multi-resolution interactive empathetic
diaods in Natural Language Processing, Association logue generation, in: Proceedings of the
Internafor Computational Linguistics, Brussels, Belgium, tional Conference on Computational Linguistics,
In2018, pp. 2769–2774. URL: https://aclanthology.org/ ternational Committee on Computational
LinguisD18-1297. doi:10.18653/v1/D18- 1297. tics, 2020, pp. 4454–4466.
[12] C. E. Hill, Helping skills: Facilitating, exploration, [21] N. Majumder, P. Hong, S. Peng, J. Lu, D. Ghosal, A. F.
insight, and action, American Psychological Asso- Gelbukh, R. Mihalcea, S. Poria, MIME: mimicking
ciation, 2009. emotions for empathetic response generation, in:
[13] C. Zheng, Y. Liu, W. Chen, Y. Leng, M. Huang, Co- Proceedings of the Conference on Empirical
Methmae: A multi-factor hierarchical framework for em- ods in Natural Language Processing, ACL, 2020, pp.
pathetic response generation, in: C. Zong, F. Xia, 8968–8979.</p>
      <p>W. Li, R. Navigli (Eds.), Findings of the Association [22] Z. Lin, A. Madotto, J. Shin, P. Xu, P. Fung, Moel:
for Computational Linguistics: ACL/IJCNLP 2021, Mixture of empathetic listeners, in: Proceedings of
Online Event, August 1-6, 2021, volume ACL/IJC- the Conference on Empirical Methods in Natural
NLP 2021 of Findings of ACL, Association for Com- Language Processing and the International Joint
putational Linguistics, 2021, pp. 813–824. Conference on Natural Language Processing, ACL,
[14] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- 2019, pp. 121–132.</p>
      <p>hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: [23] X. Xu, X. Meng, Y. Wang, Poke: Prior knowledge
denoising sequence-to-sequence pre-training for enhanced emotional support conversation with
lanatural language generation, translation, and com- tent variable, CoRR abs/2210.12640 (2022).
prehension, in: Proceedings of the Annual Meeting [24] Q. Tu, Y. Li, J. Cui, B. Wang, J. Wen, R. Yan, MISC:
of the Association for Computational Linguistics, A mixed strategy-aware model integrating COMET
ACL, 2020, pp. 7871–7880. for emotional support conversation, in:
Proceed[15] H. Zhou, M. Huang, T. Zhang, X. Zhu, B. Liu, Emo- ings of the Annual Meeting of the Association for
tional chatting machine: Emotional conversation Computational Linguistics, ACL, 2022, pp. 308–319.
generation with internal and external memory, in: [25] W. Peng, Z. Qin, Y. Hu, Y. Xie, Y. Li, FADO:
feedbackProceedings of the AAAI Conference on Artificial aware double controlling network for emotional
Intelligence, the innovative Applications of Artifi- support conversation, Knowledge-Based Systems
cial Intelligence, and the AAAI Symposium on Ed- 264 (2023) 110340.
ucational Advances in Artificial Intelligence, AAAI [26] W. J. Reynolds, B. Scott, Empathy: a crucial
compoPress, 2018, pp. 730–739. nent of the helping relationship, Journal of
psychi[16] H. Rashkin, E. M. Smith, M. Li, Y. Boureau, Towards atric and mental health nursing 6 (1999) 363–370.
empathetic open-domain conversation models: A [27] B. Liu, S. S. Sundar, Should machines express
symnew benchmark and dataset, in: Proceedings of the pathy and empathy? experiments with a health
Conference of the Association for Computational advice chatbot, Cyberpsychology, Behavior, and
Social Networking 21 (2018) 625–636. ond Workshop on Statistical Machine Translation,
[28] T. Parkin, A. de Looy, P. Farrand, Greater profes- Association for Computational Linguistics, Prague,
sional empathy leads to higher agreement about Czech Republic, 2007, pp. 228–231.
decisions made in the consultation, Patient Educa- [39] R. Vedantam, C. L. Zitnick, D. Parikh, Cider:
tion and Counseling 96 (2014) 144–150. Consensus-based image description evaluation, in:
[29] Q. Li, P. Li, Z. Ren, P. Ren, Z. Chen, Knowledge Conference on Computer Vision and Pattern
Recogbridging for empathetic dialogue generation, in: nition, IEEE, 2015, pp. 4566–4575.
The Conference on Artificial Intelligence, Confer- [40] N. Schluter, The limits of automatic
summarisaence on Innovative Applications of Artificial Intel- tion according to ROUGE, in: Proceedings of the
ligence, The Symposium on Educational Advances Conference of the European Chapter of the
Associin Artificial Intelligence, AAAI Press, 2022, pp. ation for Computational Linguistics, ACL, 2017, pp.
10993–11001. 41–45.
[30] Y. Liu, W. Maier, W. Minker, S. Ultes, Empathetic [41] Y. Zhang, S. Sun, M. Galley, Y.-C. Chen, C.
Brockdialogue generation with pre-trained roberta-gpt2 ett, X. Gao, J. Gao, J. Liu, B. Dolan, DIALOGPT :
and external knowledge, in: Conversational AI for Large-scale generative pre-training for
conversaNatural Human-Centric Interaction International tional response generation, in: Proceedings of the
Workshop on Spoken Dialogue System Technology, 58th Annual Meeting of the Association for
Comvolume 943 of Lecture Notes in Electrical Engineering, putational Linguistics: System Demonstrations,
AsSpringer, 2021, pp. 67–81. sociation for Computational Linguistics, 2020, pp.
[31] L. Jing, X. Song, X. Lin, Z. Zhao, W. Zhou, L. Nie, 270–278.</p>
      <p>Stylized data-to-text generation: A case study in the [42] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson,
e-commerce domain, ACM Trans. Inf. Syst. (2023). Y. Liu, J. Xu, M. Ott, E. M. Smith, Y.-L. Boureau,
[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, J. Weston, Recipes for building an open-domain
L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- chatbot, in: Proceedings of the 16th Conference of
tention is all you need, in: Advances in Neural the European Chapter of the Association for
ComInformation Processing Systems 30: Annual Con- putational Linguistics: Main Volume, Association
ference on Neural Information Processing Systems, for Computational Linguistics, 2021, pp. 300–325.
2017, pp. 5998–6008. [43] K. Sohn, H. Lee, X. Yan, Learning structured
out[33] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- put representation using deep conditional
generlangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- ative models, in: Advances in Neural
Informatowicz, J. Brew, Huggingface’s transformers: State- tion Processing Systems 28: Annual Conference
of-the-art natural language processing, CoRR on Neural Information Processing Systems, 2015,
abs/1910.03771 (2019). pp. 3483–3491.
[34] I. Loshchilov, F. Hutter, Fixing weight decay regu- [44] S. M. Mohammad, Obtaining reliable human ratings
larization in adam, CoRR abs/1711.05101 (2017). of valence, arousal, and dominance for 20, 000
en[35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- glish words, in: Proceedings of the Annual Meeting
bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, of the Association for Computational Linguistics,
L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, ACL, 2018, pp. 174–184.</p>
      <p>Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An
imperative style, high-performance deep learning
library, in: Advances in Neural Information
Processing Systems: Annual Conference on Neural
Information Processing Systems, 2019, pp. 8024–8035.
[36] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: a
method for automatic evaluation of machine
translation, in: Proceedings of the Annual Meeting of
the Association for Computational Linguistics, ACL,
2002, pp. 311–318.
[37] C.-Y. Lin, Rouge: A package for automatic
evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[38] A. Lavie, A. Agarwal, METEOR: An automatic
metric for MT evaluation with high levels of correlation
with human judgments, in: Proceedings of the
Sec</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z. A.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Faizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zadran</surname>
          </string-name>
          ,
          <article-title>Emotional support received moderates academic stress and mental well-being in a sample of afghan university students amid covid-</article-title>
          19,
          <source>International Journal of Social Psychiatry</source>
          <volume>68</volume>
          (
          <year>2022</year>
          )
          <fpage>1748</fpage>
          -
          <lpage>1755</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          , F.-p.
          <article-title>Chen, Relationships of family emotional support and negative family interactions with the quality of life among chinese people with mental illness and the mediating efect of internalized stigma</article-title>
          ,
          <source>Psychiatric Quarterly</source>
          <volume>92</volume>
          (
          <year>2021</year>
          )
          <fpage>375</fpage>
          -
          <lpage>387</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Demasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sabour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Towards emotional support dialog systems</article-title>
          ,
          <source>in: Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing</source>
          , ACL,
          <year>2021</year>
          , pp.
          <fpage>3469</fpage>
          -
          <lpage>3483</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation</article-title>
          ,
          <source>in: Proceedings of the International Joint Conference on Artificial Intelligence, ijcai.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4324</fpage>
          -
          <lpage>4330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <article-title>Improving multi-turn emotional support dialogue generation with lookahead strategy planning</article-title>
          ,
          <source>in: Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , ACL,
          <year>2022</year>
          , pp.
          <fpage>3014</fpage>
          -
          <lpage>3026</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Speer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Havasi</surname>
          </string-name>
          ,
          <article-title>Conceptnet 5.5: An open multilingual graph of general knowledge</article-title>
          ,
          <source>in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence</source>
          , AAAI Press,
          <year>2017</year>
          , pp.
          <fpage>4444</fpage>
          -
          <lpage>4451</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>