-

1613-0073

Support Conversation⋆

Mengzhao Jia

jiamengzhao98@gmail.com 1

Qianglong Chen

chenqianglong@zju.edu.cn 3

Liqiang Jing

jingliqiang6@gmail.com 2

Dawei Fu

dawei.fdw@alibaba-inc.com 0

Renyu Li

renyu.rl@alibaba-inc.com 0 0 Alibaba Group , China 1 Shandong University , China 2 University of Texas at Dallas , TX , USA 3 Zhejiang University , China

The prevalence of mental disorders has become a significant issue, leading to the increased focus on Emotional Support Conversation as an efective supplement for mental health support. Existing methods have achieved compelling results, however, they still face three challenges: 1) variability of emotions, 2) practicality of the response, and 3) intricate strategy modeling. To address these challenges, we propose a novel knowledge-enhanced Memory mODEl for emotional suppoRt coNversation (MODERN). Specifically, we first devise a knowledge-enriched dialogue context encoding to perceive the dynamic emotion change of diferent periods of the conversation for coherent user state modeling and select context-related concepts from ConceptNet for practical response generation. Thereafter, we implement a novel memory-enhanced strategy modeling module to model the semantic patterns behind the strategy categories. Extensive experiments on a widely used large-scale dataset verify the superiority of our model over cutting-edge baselines.

Mental disorders Emotional Support Conversation Mental health support ConceptNet

CEUR ceur-ws.org

1. Introduction

Mental disorders are known for their high burden, with more than 50% of adults experiencing a mental illness or disorder at some point in their lives; yet despite its high prevalence, only one in five patients receive professional treatment1. Recent studies have shown that an efective mental health intervention method is the provi

Emotional Support Conversations (ESConv), as defined

by Liu et al. [ 3 ], has garnered substantial attention in recent years. They have emerged as a promising alternative strategy for mental health intervention, paving the way for the development of neural dialogue systems designed to provide support for those in need.

The ESConv takes place between a help-seeker (user) and a supporter (dialogue model) in a multi-turn manner. It requires the dialogue model to employ a range of supportive strategies efectively, easing the emotional distress of the users and helping them overcome the challenges they face. Prior research primarily concentrated on two aspects. The first aimed to enhance the model’s comprehension of the contextual semantics in the conMachine Learning for Cognitive and Mental Health Workshop (ML4CMH), AAAI 2024, Vancouver, BC, Canada ∗Corresponding author. nEvelop-O (R. Li) versations, such as the user’s situation, emotions, and intentions. An example of these eforts is the work of Peng et al. [ 4 ] who designed a hierarchical graph network to capture the overall emotional problem cause and specific user intentions. The second aspect focused on predicting the dialogue strategy accurately and responding based on the predicted strategy category. For example, Cheng et al. [ 5 ] employed a lookahead heuristics for dialogue

Despite the success of existing studies, this task is nontrivial due to the following three challenges.

• Variability of emotions. As the conversation progresses, the user’s emotional state evolves subtly and constantly. Accurately recognizing the emotional change is indispensable to understanding the user’s real-time state and thus responding empathically [7, 8, 9]. How to model the dynamic emotional change during the dialogue process is the first challenge. • Practicality of the response. In the absence of explicit cues, neural dialogue systems are inclined to make generic responses [10, 11]. As shown in Figure 1, the generic responses are deficient to provide personalized and suitable suggestions for the user’s specific concerns. To resolve this issue, introducing context-related concepts (doctor and recover ) can promote generating more meaningful and actionable suggestions for specific situations. As a result, the integration of appropriate concepts poses a non-trivial challenge in generating practical responses. port strategies. Finally, the third module aims to generate the target response with the BART decoder.

Our contributions can be summarized as follows: 1) We first analyze the current challenges of the ESConv task, and according to that propose a novel knowledgeenhanced Memory mODEl for emotional suppoRt coNversation, named MODERN, which can model complex supportive strategy as well as utilize emotional knowledge and context-related concepts to perceive the variability of emotions and provide practical support advice. 2) We propose a memory-enhanced strategy modeling module, where a unique memory bank is designed to model intricate strategy patterns, and an auxiliary strategy classification task is introduced to distill the strategy pattern information. 3) We present a thorough validation and evaluation of our model, providing an in-depth analysis of the results and a comparison with other models. The extensive experiments on the ESConv dataset [ 3 ] demonstrate that MODERN achieves state-of-the-art performance under both automatic and human evaluations. The code is avaliable at https://projs2release.wixsite.com/modern.

2. Related Work 2.1. Emotional and Empathetic Dialogue Systems

hospital doctor

patient My mother in law is sick with COVID and my husband, when he contracted COVID from her, got terminated from his job.

I can see why you feel anxious, I would be too given the situation.

Yeah, it's just really stressful and I am over anxious about it. She is elderly and has cancer... worried uneasy disease

medicine Cancer is already a hard enough disease to fight against...

Yeah, she went again the other day. recover She is doing quite a bit better... improve Practical FCrOomVImDyakftneorw2letdoge3, pweeoepklse,uysouuallmyirgehctovceornfsriodmer Information having a doctor see your mother if she is still

struggling.

Generic Don’t worry, she will be fine.

Response I’m sorry to hear about it.

• Intricate strategy modeling. Dialogue strategy, as a kind of linguistic pattern, has been reported as a highly complex concept encompassing various intricate linguistic features [12, 13]. Previous work resorted to a single vector (a category indicator) for strategy representation, which is insuficient to fully represent the complex strategy pattern information. Therefore, how to model strategy information suficiently is the third challenge.

With the popularity and growing success of dialogue systems, many research interests have recently endeavored to empower the system to reply with a specific and proper emotion, therefore forming a more human-like conversation. Particularly, two research directions arise

To overcome these challenges, as shown in Figure 2, we researchers’ interest, namely the emotional [15] and emintroduce a novel knowledge-enhanced Memory mODEl pathetic [16] response generation. The former direction for emotional suppoRt coNversation, dubbed MODERN. expects the dialogue agent to respond with a given emoIn particular, MODERN adapts the BART [14] as its back- tion [17, 18, 19, 20]. While the latter requires the dialogue bone and consists of the knowledge-enriched dialogue system to actively detect and understand the users’ emocontext encoding module, the memory-enhanced strat- tions and then respond with an appropriate emotion [21]. egy modeling module, and the response decoding module. For example, Lin et al. utilized multiple decoders as diferTo capture the emotional change as the conversation pro- ent listeners to react to diferent emotions and then softly gresses, the first module detects the emotions for all utter- combine the output states of the decoders appropriately ances and explicitly injects them into the dialogue context based on the recognized user’s emotion. Nevertheless, as a kind of emotional knowledge, thus understanding unlike above directions, the ESConv task concentrates the user’s status coherently. In addition, this module also on alleviating users’ negative emotion intensity and prointroduces the concepts reasoning and selection com- viding supportive instructions to help them overcome ponent to acquire valid context-related concepts from a struggling situations. knowledge graph called ConceptNet [ 6 ] and incorporate them into the dialogue context to fulfill meaningful and practical suggestion generation. Moreover, in contrast 2.2. Emotional Support Conversation to existing studies that depend on simplistic indicators As an emerging research task, emotional support converto represent strategy categories, the memory-enhanced sation has gradually attracted intense attention in recent strategy modeling module learns strategy patterns by a years. Existing works mostly focus on two aspects. The strategy-specific memory bank. In this way, it can detect ifrst is to understand the complicated user emotions and and mimic the intricate patterns in human emotional sup

Memory-enhanced Strategy Modeling

Store

Strategy-specific

Memory Bank ReprPeastetnetrantions I would suggest talking to the professor..[EOS]

Add & Norm Feed Forward Add & Norm Cross-Attention Add & Norm

Self-Attention [BOS] I would suggest talking to the professor..

Training Set Responses keep the organization going. [Providing Suggestions] It may also help to contribute some money to [Self-disclosure] I have done plored context semantic relationships [ 23, 4 ], common- the supportive response with tokens. The goal of the sense knowledge [ 24, 4 ], or emotion causes [ 5 ] to better capture and understand the emotions and intentions of users. The other trend in addressing the task is to predict the strategy category accurately so as to respond in accordance with it [25, 23]. For instance, Tu et al. firstly proposed to predict a strategy probability distribution and generate the response guided by a mixture of multiple strategies. Despite their remarkable achievements, egy modeling, variability of emotions, and practicality of the responses. 3. Task Formulation denoted as = ( 1, 2, … ,

). = ( 1 , 2, … ,

) is and situation as follows, ESConv task is to learn a model ℱ that can generate a supportive response ̂ referring to the input context ̂ = ℱ ( , |Θ),

(1) where Θ refers to the set of to-be-learned parameters of the model ℱ. Notably, the ground truth can be utilized to be predicted in the inference phrase. For brevity, we temporally omit the superscript of the -th sample in the rest of this paper. 4.

Method

existing work still face three challenges: intricate strat- in the model training stage but is not available and need For concise mathematical expression, we first declare some notations in this paper. We use bold capital letters In this section, we detail the proposed model MODERN, (X) and bold lowercase letters (x) to represent matrices and vectors, respectively. We adopt non-bold letters ( ) which comprises three main components: knowledgeenriched dialogue context encoding, memory-enhanced to denote scalars. Greek letters ( ) refer to hyperparame- strategy modeling, and response decoding, demonstrated ters. All the vectors, if not clarified, are in column forms.

In the setting of emotional support conversation, the

dialogue is participated by a help-seeker and a supporter, and the latter tries to comfort and support the help-seeker to lower he/she’s emotional intensity level. The target in Figure 2.

4.1. Knowledge-enriched Dialogue Context Encoding

get response . Therein, contains a sequence of history utterances between the user and the supporter, a dialogue context , a support strategy , and a tar- tion generation. is to generate a response based on the dialogue history. In this section, we first utilize an emotion detector to Besides, the supporter is required to select one of strategies and respond accordingly. Suppose we have a training dataset = {

1, 2, … , } composed of samples. Each sample = { , , , } includes a seeker’s situation , recognize fine-grained emotions of utterances as emotional knowledge for capturing emotional change. In addition, we select related concepts from the ConceptNet knowledge graph for meaningful and practical sugges4.1.1. Change-aware Emotion Detection Psychiatric and mental health studies have proved that empathy is essential to emotionally helping relationships [26, 27, 28]. And fine-grained emotional information is one of the key factors to enhance the empathetic ability [29]. Apart from the static emotional signals, dynamic emotional changes during the conversation progress are also beneficial. Concretely, change-aware emotion information enriches the model to understand the user’s status coherently. Inspired by this, we devise to identify the user’s fine-grained emotions and perceive the dynamic changes of emotions in the dialogue context.

Specifically, we start by obtaining the fine-grained emotion via an of-the-shelf pretrained emotion detector2, which can recognize up to 28 diferent emotional categories. We apply the detector to every utterance in that would interfere with the generative model learn- To acquire the representation of the dialogue context be used to mine the underlying concept information in the dialogue context. We first identify all the concepts in ConceptNet that are mentioned in the dialogue context and remove the top- frequent concepts in the training set because these words are usually too general to provide valid suggestions for a specific situation, such as

help, thing, and feeling. In this way, we derive concepts, denoted as { 1, ⋯ , }.

Thereafter, we leverage the concepts as the anchors to reason the related concepts. Specifically, for each anchor concept , we retrieve all its one-hop neighboring concept-relation pairs from the ConceptNet.

Mathematically, let ()

be the set of neighboring concept-relation pairs of the anchor concept .

Then the context-related concept-relation pairs can be represented as { (

1), ( 2), ⋯ , ( )}, where

, ℎ 1 pairs. Following Liu et al. [30], to

)} is a set of retrieved cepts with excluded relations, Antonym, ExternalURL, and NotCapableOf. Finally, we concatenate the concepts in the filtered pairs and deem them as context-related con< , > cepts . 4.1.3. Knowledge-enriched Dialogue Context

Embedding the dialogue context for emotion recognition as follows, ( ) = {( 11, ℎ11), ⋯ , ( 1 = Emo( ), 0 ≤ ≤ ,

(2) avoid introducing extraneous concepts, we filter out consupport system. Therefore, we mine and associate com- to responding under a strategy category by simply prowhere is the predicted emotional category word representing the detected emotion in . Thereafter, we directly inject the natural language form of emotion category words into the dialogue context as additional emotion knowledge. This practice aligns closely with the input format of the pretrained BART model. Moreover, it also avoids introducing unnecessary parameters ing. Empirically, we concatenate the emotional category into the sequence of dialogue context tokens, denoted as = [ 1; 1; SEP; … ; ; SEP; …

; ]. In this way, the dynamic emotional changes corresponding to the dialogue progress can be coherently exploit by the emotinal support model. 4.1.2. Context-related Concepts Reasoning and

Selection Considering ConceptNet involve abundant general human knowledge, which plays an important role in understanding human situations and associating them with practical suggestions, we select potentially useful contextrelated concepts to enrich the model to generate responses with high informativeness. For instance, the commonsense knowledge database can easily relate failing exam to academic stress. In terms of daily human activities, this knowledge serves as a guide for dealing with daily afairs and problems. Such knowledge is useful for providing advice and guidance in the emotional monsense knowledge of the data to provide potentially useful information for generating instructive responses.

Concretely, the ConceptNet knowledge graph involves 3.1 million concepts and 38 million relations, which can

2https://huggingface.co/arpanghoshal/EmoRoBERTa.

and corresponding knowledge, we first encode the dialogue context sequence (user situation , emotional-aware dialogue context , and concepts ) with a Transformerbased encoder as follows,

H = Enc ([; ; ]), (3) where H ∈ ℝ×

denotes the hidden contextual representation. and refer to the number of tokens in [; ; ] and the representation hidden dimension, respectively.

4.2. Memory-enhanced Strategy Modeling

During the conversation, the supporter adopts diferent strategies for diferent purposes, ultimately achieving the goal of reducing the intensity of the user’s negative emotions. For example, using the Question strategy helps the supporter to explore the concrete situation faced by the user, while the use of the Reflection of Feelings strategy conveys the supporter’s understanding of the user’s current emotions. Existing work constrains the model viding a single vector indicating the strategy’s name or description. However, the semantic patterns of strategies are highly complex, the name or description is not able to contain the specific linguistic patterns (expression manner, wording, and phrasing) of the strategy. Therefore, inspired by [31], we propose to disentangle the strategy patterns from the same-strategy responses to provide more specific guidance for the strategy-constrained generation. 4.2.1. Strategy Pattern Modeling

We first acquire strategy pattern representations of each

responses in the training set via a strategy pattern extractor Enc as follows,

r = MaxPooling(Enc ()), where r ∈ ℝ denotes the strategy pattern representation. generation tasks considering the fact that jointly optimal In order to use the strategy pattern information in the memory bank, the model requires selecting a proper strategy category based on the dialogue context. To achieve these, we leverage a strategy predictor, which aims to capture information relevant to strategy decisions in the context.

The strategy predictor is composed of a Transformerbased encoder and a classification module. The encoder ifrst encodes the dialogue context into a strategy predict (4) representation. It is worth noting that we adopt independent representations for strategy prediction and response solutions may not always exist for diferent tasks. Subsequently, the classification module maps the vector as a dimension vector, which is regarded as the probability distribution of the strategy types. Formally, the strategy prediction can be written as follows, { s = MaxPooling(Enc ( )), =̂ argmax( ( MLP(s)), where Enc is a Transformer-based encoder. The strategy prediction representation is denoted as s ∈ ℝ . MLP and (⋅) are a multi-layer perceptron and the Sigmoid function, respectively. The argmax operation is used to obtain the predicted strategy category ̂. We use the following objective to optimize the strategy prediction task,

ℒ = − log (| s) . 4.2.4. Memory-enhanced Encoding After predicting the strategy category ̂, instead of directly using ̂ as an indicator to constrain the response ing memory bank matrix M

and the context representation, so as to fully exploit the abundant pattern informa

Empirically, we fuse the matrix and the context representation with a multi-head cross-attention module [32] as follows, m = MaxPooling(CrossAtt(H, M )), (10) (8) (9) where H and M act as the query and the key-value pair (7) in the cross-attention, respectively, m ∈ ℝ memory-enhanced strategy modeling feature.

denotes the Meanwhile, in order to accurately capture the strategy pattern information and avoid irrelevant disturbance, we design an auxiliary strategy classification task to guide the extractor to map more strategy-related information into the pattern representation. The auxiliary objective ℒ is derived by the following loss function, ℒ = − log (| r) .

(5) 4.2.2. Strategy-specific Memory Bank To utilize detail and ample strategy pattern information, instead of using a single representation vector, we devise a memory bank mechanism to store multiple pattern representations according to their strategy categories. In particular, we denote the memory bank as ℳ = {M1, … , M }, in which M

pattern representations corresponding to -th strategy category, and is the total number of strategy categories. is 0 at the initial training step. As the training progresses, continues to increase until the maximum threshold is reached. In particular, we store pattern

∈ ℝ × is a matrix of sponding M

by concatenation as follows, representation of -th strategy category into the corre- generation, we integrate the aforementioned correspond M ⟵ [M ; r ],

(6) tion of the particular strategy. where r denotes a representation belongs to the -th strategy category and [⋅; ⋅] refers to the concatenation operation. As the representations are optimized along with the classification training process, we dynamically update M in a first-in-first-out

manner as follows, M = {

M M , [ − ∶ ]

, otherwise

if > where and denote the maximum storage limit and the current storage volume of each memory matrix, respectively. The algorithm of the memory storing and updating operation is summarized in the appendix.

4.3. Response Decoding

In order to generate the emotional supportive response, we input the encoded features, the memory-enhanced strategy modeling feature m and the knowledge-enriched dialogue context embedding H, into the Transformer ( ̂ ∣ <̂ , E) = Dec( <̂ , E), where E = [m; H]. <̂ refers to the previous generated − 1 tokens of the target response. Dec(⋅) denotes the decoder module. Notably, to avoid the accumulated error, we replace <̂ by < in the training phase. For optimization, we introduce the standard cross-entropy loss function for response generation as follows, ℒ = − 1

∑ log ( ∣ < , E) , =1 where denotes the length of the target response.

[t] Training Procedure.

Input: training set for optimizing the model ℱ,

hyperparameters { 1, 2}.

Output: Parameters Θ. [ 1 ] Initialize parameters: Θ Initialize memory bank: ℳ Randomly sample a batch from . each sample ( , , , ) Add strategy pattern representation into the memory bank ℳ by Eqn. (6). Update the memory bank ℳ by Eqn. (7). Update Θ by optimizing the loss function in Eqn. (13). ℱ converges. decoder. The generation process aims to predict the conditional probability distribution ( ̂ | <̂ , m, H) in an auto-regressive manner, which means the decoder generates the -th word conditioned on all previous generated words as well as the encoded representation. Formally, we deploy the decoding process, which predicts the conditional probability distribution over each token in the target response in an auto-regressive manner as follows, strategy category used in every supporter’s response. The dataset contains 1,300 long conversations and overall 38,350 utterances, with an average of 29.5 utterances in each dialogue. For the data split, we followed the same setting of previous work [ 3, 5 ].

5.2. Implementation Details

Following the setting of the previous study [ 5 ], we also (11) utilized encoder and decoder of the pretrained BART3 provided by HuggingFace [33] to initialize the parameters of the context encoder, the strategy predictor, and the decoder, respectively. The number of layers in the encoder and decoder are both 6. The dimension of hidden feature equals to 512. To form a mini-batch, the input sequence length is unified to 512. The hidden dimension is 768. The category number equals 8.

The maximum memory storage is set as 64. equals (12) 20. 1 and 2 are 0.3 and 0.1, respectively. The batch size is 16. We use the PPL metric on the validation set to monitor the training progress. Empirically, it takes around 15 epochs to get the peak performance. During the generation stage, we use a maximum of 64 steps to decode the responses. We adopt AdamW [34] optimizer with 1 = 0.9 and 2 = 0.999. The initial learning rate is 2 -5 and a linear learning rate scheduler with 100 warmup steps is used to reduce the learning rate progressively. For the framework, we use the PyTorch [35] version 1.8.1 to implement the experiment codes. All experiments were conducted on an NVIDIA Tesla V100 32GB.

5.3. Evaluation Metrics

For the comprehensive evaluation, we conducted both automatic and human evaluations.

4.4. Model Training

Ultimately, we combine all the loss functions for optimizing our model as follows, Automatic Evaluation For the automatic evaluation, we adopt several mainstream metrics commonly used ℒ = ℒ + 1ℒ + 2ℒ , (13) in dialogue response tasks, PPL (Perplexity), BLEU{1,2,3,4} (B-{1,2,3,4}) [36], ROUGE-L (R-L) [37], METEOR where ℒ is the final training objective and 1 and 2 are (MT) [38], and CIDEr [39]. the non-negative hyperparameters used for balancing the efect of each loss function on the entire training process. The overall algorithm of the optimization is briefly summarized in the algorithm 4.3.

Human Evaluation Apart from the automatic evaluation, we also consider the human evaluation, as it has been reported [40] that the automatic evaluation is sometimes unreliable in generation tasks. We first randomly 5. Experiments select 70 testing samples for evaluation. Then, we employ 2 volunteers to manually choose which response 5.1. Dataset outperforms the other one. Every sample is annotated twice. For each case, the volunteers need to compare We conducted experiments on the ESConv dataset [ 3 ]. the generated texts of diferent models according to the Each sample in the dataset is a dialogue between a help- following four dimensions: 1) Fluency: which response seeker and a supporter. In addition to the context, it also contains rich information, the situation that the helpseeker’s is facing. It also provides the annotation of the 3https://huggingface.co/facebook/bart-base. MT Ground Truth: (Self-disclosure) I've dealt with the same problem Ground Truth: (Self-disclosure) I believe there are special lamps with my partner more than once. I love him very much and I found that you could get to help. I personally have been trying to get up a him to be a 9 to my 5 as well. little earlier in the morning and enjoy my coffee with the sunrise... w/o-Mem: (Reflection of Feelings) I am sorry to hear that. I am sorry w/o-KG: (Providing Suggestions) I've been in that situation myself, you are feeling that way. I know how you feel. and I've found that it's very easy to get depressed. MODERN: (Self-disclosure) I can understand how that would be MODERN: (Providing Suggestions) I've found that taking a walk or difficult. I had a similar situation with my ex - boyfriend I know sitting down to write out a list of things that you'd like to do helps to how hard it can be to let go of someone. clear your mind.

(a) is more fluent, correct, and coherent in grammar and syntax; 2) Relevance: which response talks more relevantly regarding current dialogue context; 3) Empathy: which response is better to react with appropriate emotion according to the user’s emotional state; 4) Information: which response provides more suggestive information to help solve the problem. To further control the quality of the evaluation, we also invite an inspector to randomly sample and double-check 10% rating results.

5.4. Model Comparison & Analysis To validate the efectiveness of our model, we compare it

with several state-of-the-art baselines: • MoEL [22]. This method adopts multiple decoders as diferent listeners for diferent emotions.

The outputs of decoders are softly combined to generate the response. • MIME [21]. This model shares the same architecture as MoEL and extends it to mimic the speaker’s emotion. • DialoGPT-Joint [ 3 ]. This model is built on a pre-trained dialog agent DialoGPT [41]. It first predicts a strategy and prepends a special token before the response sentence to control the generation under that strategy. • BlenderBot-Joint [ 3 ]. This model adopts the same strategy prediction and generation scheme as DialoGPT. Difer from the former one, it is built on a pre-trained conversational response generation model named BlenderBot [42]. • MISC [24]. This model also adopts BlenderBot as the backbone. It leverages common sense achieves better (denoted as “Win”), equal (denoted as “Tie”), and worse performance (denoted as “Lose”) compared with the baselines. As seen, MODERN outperforms all baselines across diferent evaluation metrics, as the number of “Win” cases is always significantly larger than that of “Lose” cases in each pair of model comparisons, which is consistent with the results in Table 1. In addition, the number of “Win” cases is the largest for the Information metric compared with other metrics, which demonstrates that integrating context-related concepts can supply meaningful information for emotional support. knowledge to enhance the understanding of the speaker’s emotional state, and the response is generated conditioned on a mixture of strategy distribution.

5.5. Ablation Study • GLHG [ 4 ]. This model has a global-to-local hierarchical graph structure. It models the global We compare the original MODERN model with the folcause and the local intention of the speaker to lowing derivatives to demonstrate that all the designed provide more supportive responses. modules are essential for the ESConv task. 1) w/o-ℒ . • PoKE [23]. This work utilized Conditional Varia- To show the benefit of constraint for strategy prediction, tional Autoencoder [43] to model the mixed strat- we removed the corresponding loss function by setting egy. 1 = 0 in Eqn.(13). 2) w/o-ℒ . To show the efect of • FADO [25]. This work devises a dual-level feed- the auxiliary strategy classification task, we removed the back strategy selector to encourage or penalize corresponding loss function by setting 2 = 0 in Eqn.(13). strategies during the strategy selection process. 3) w/o-Mem. In this derivative, we disabled the memory • MultiESC [ 5 ] This work proposes lookahead module, which stores pattern representations of diferent heuristics to estimate the future strategy and cap- strategies. 4) w/o-Emo. In this derivative, we removed ture users’ subtle emotional expressions with the the change-aware emotion detection module. And 5) w/oNRC VAD lexicon [44] for user state modeling. KG. We discarded the context-relate concepts reasoning and selection component.

We provided the ablation study results on the ESConv Automatic Evaluation We compared our model with dataset in Table 3 in terms of all metrics. From this tathe above baselines using automatic metrics and results ble, we have the following observations: 1) Our MODare reported in Table 1. As we can see, 1) MODERN out- ERN consistently outperforms w/o-Mem, especially on performs the baselines in most metrics, which is a power- the BLUE metrics (B-{1,2,3,4}), which suggests that the ful proof of the efectiveness of the proposed method. 2) memory-enhanced strategy modeling module can proThe models with BART backbone (MultiESC and MOD- vide suficient linguistic pattern references and hence ERN) surpass those baselines with BlenderBot [42] back- boost the performance of generating responses in acbone (BlenderBot-Joint, MISC, GLHG, FADO, and PokE) cordance with specific strategy categories. 2) MODERN across most of the metrics despite the latter being pre- exceeds w/o-ℒ across all metrics. This verifies it is indistrained on empathetic-related data. One possible expla- pensable to constrain the model to predict and respond nation is that the ESConv task requires sophisticated under proper strategies. 3) w/o-ℒ obtains a slightly linguistic knowledge (e.g. correct grammar and wording worse result than MODERN, which demonstrates the appropriate to the current situation) in addition to empa- strategy classification auxiliary task indeed helps with thetic ability. 3) Our MODERN consistently exceeds Mul- guiding the pattern representation learning. And 4) w/otiESC with the same BART backbone. This suggests that Emo and w/o-KG both perform worse than MODERN, BART model with large-scale pretraining still requires which demonstrates the importance of change-aware strategy pattern information and knowledge (emotion emotion and context-related concepts. Notably, w/o-KG and concepts) to further facilitate supportive response surpasses w/o-Emo. One possible explanation is that begeneration. ing aware of the dynamic emotional changes during the conversation facilitates the model to provide empathy and emotional support accordingly.

Human Evaluation For human evaluation, we report the comparison results between our model and the two best baselines (FADO and MultiESC) in Table 2. In particular, for each pair of model comparisons and each metric, we show the number of samples where our model

5.6. Case Study

We illustrate several conversations in the test set to get an intuitive understanding of our model in Figure 3. We showed two samples and compare the responses gen- Ethical Considerations erated by MODERN and two derivatives, w/o-Mem and w/o-KG. As can be seen in case (a), MODERN fulfills to re- The dataset used in our work is a publicly available spond with the strategy of Self-disclosure and generates a dataset that has been widely used in the field of emotional contextually appropriate response. Being equipped with support conversation. Sensitive and personally identia memory-enhanced strategy modeling module, MOD- fiable information was filtered during the construction ERN shares a similar experience closely related to the of the dataset. In the work of this paper, our model foseeker’s problem relationship issue. Whereas the w/o- cuses on informal, emotional support provided between Mem model generates a plain and monotonous response, friends’ daily chats and does not provide professional which is not very relevant to the user’s current issue. mental health diagnosis and counseling services. The use The other case (b) demonstrates how MODERN reaps of this model should be avoided for patients with serious benefits from external knowledge. Based on the situa- mental illnesses, such as self-harm-related conversations, tion that the seeker mentions feeling depressed, MODERN in order to prevent triggering serious consequences. leverages the context-related concepts and associates this emotion status with practical suggestions taking a walk or sitting down to write... efectively. While without rele- References vant knowledge, the response generated by the w/o-KG derivative is relatively vague and less specific, which is deficient to benefit the user’s situation. information and experiences in the ESConv task deserves the attention of future work.

6. Conclusions

In this paper, we propose a novel knowledge-enhanced Memory mODEl for emotional suppoRt coNversation, dubbed MODERN, which can perceive fine-grained emotional changes in the conversation, utilize the concepts from knowledge graph to facilitate generating responses with practical suggestions, and model concrete strategy semantic patterns with memory bank mechanism. Both automatic and human evaluation results show that our model surpasses the state-of-the-art methods in emotional support conversation. In addition, the ablation study demonstrates the efectiveness of each component of our model.

Limitations

The ESConv task requires the dialogue agent to reveal some information about itself. For example, one of the strategies called Self-disclosure, expects the agent to cite their own experience. However, in our experiments, we observed that the current model often struggles to maintain a consistent personality. We speculate that this may be due to the fact that the supporter role in the full training sample is provided by multiple individuals, and thus there is no uniform character experience and story, which leads to the problem of inconsistent personal experiences during the conversation. We believe that how to make the dialogue agent show coherent and unified personal [7] Y. Ding, J. Liu, X. Zhang, Z. Yang, Dynamic tracking Linguistics, ACL, 2019, pp. 5370–5381. of state anxiety via multi-modal data and machine [17] L. Shen, Y. Feng, CDL: curriculum dual learning learning., Frontiers in psychiatry 13 (2022). for emotion-controllable response generation, in: [8] J. Greene, B. Burleson, Handbook of Communica- Proceedings of the Annual Meeting of the Association and Social Interaction Skills, American Psycho- tion for Computational Linguistics, ACL, 2020, pp. logical Association, 2003. 556–566. [9] A. C. High, K. R. Steuber, An examination of [18] M. Y. Chen, S. Li, Y. Yang, Emphi: Generating emsupport (in)adequacy: Types, sources, and conse- pathetic responses with human-like intents, in: quences of social support among infertile women., Proceedings of the Conference of the North AmeriCommunication Monographs 81 (2014). can Chapter of the Association for Computational [10] B. Wei, S. Lu, L. Mou, H. Zhou, P. Poupart, G. Li, Linguistics: Human Language Technologies, ACL, Z. Jin, Why do neural dialog systems generate short 2022, pp. 1063–1074. and meaningless replies? a comparison between [19] W. Kim, Y. Ahn, D. Kim, K. Lee, Emp-rft: Empadialog and translation, in: International Conference thetic response generation via recognizing feature on Acoustics, Speech and Signal Processing, IEEE, transitions between utterances, in: Proceedings 2019, pp. 7290–7294. of the Conference of the North American Chap[11] Y. Liu, W. Bi, J. Gao, X. Liu, J. Yao, S. Shi, Towards ter of the Association for Computational Linguisless generic responses in neural conversation mod- tics: Human Language Technologies, ACL, 2022, pp. els: A statistical re-weighting method, in: E. Rilof, 4118–4128.

D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceed- [20] Q. Li, H. Chen, Z. Ren, P. Ren, Z. Tu, Z. Chen, Emings of the 2018 Conference on Empirical Meth- pdg: Multi-resolution interactive empathetic diaods in Natural Language Processing, Association logue generation, in: Proceedings of the Internafor Computational Linguistics, Brussels, Belgium, tional Conference on Computational Linguistics, In2018, pp. 2769–2774. URL: https://aclanthology.org/ ternational Committee on Computational LinguisD18-1297. doi:10.18653/v1/D18- 1297. tics, 2020, pp. 4454–4466. [12] C. E. Hill, Helping skills: Facilitating, exploration, [21] N. Majumder, P. Hong, S. Peng, J. Lu, D. Ghosal, A. F. insight, and action, American Psychological Asso- Gelbukh, R. Mihalcea, S. Poria, MIME: mimicking ciation, 2009. emotions for empathetic response generation, in: [13] C. Zheng, Y. Liu, W. Chen, Y. Leng, M. Huang, Co- Proceedings of the Conference on Empirical Methmae: A multi-factor hierarchical framework for em- ods in Natural Language Processing, ACL, 2020, pp. pathetic response generation, in: C. Zong, F. Xia, 8968–8979.

W. Li, R. Navigli (Eds.), Findings of the Association [22] Z. Lin, A. Madotto, J. Shin, P. Xu, P. Fung, Moel: for Computational Linguistics: ACL/IJCNLP 2021, Mixture of empathetic listeners, in: Proceedings of Online Event, August 1-6, 2021, volume ACL/IJC- the Conference on Empirical Methods in Natural NLP 2021 of Findings of ACL, Association for Com- Language Processing and the International Joint putational Linguistics, 2021, pp. 813–824. Conference on Natural Language Processing, ACL, [14] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- 2019, pp. 121–132.

hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: [23] X. Xu, X. Meng, Y. Wang, Poke: Prior knowledge denoising sequence-to-sequence pre-training for enhanced emotional support conversation with lanatural language generation, translation, and com- tent variable, CoRR abs/2210.12640 (2022). prehension, in: Proceedings of the Annual Meeting [24] Q. Tu, Y. Li, J. Cui, B. Wang, J. Wen, R. Yan, MISC: of the Association for Computational Linguistics, A mixed strategy-aware model integrating COMET ACL, 2020, pp. 7871–7880. for emotional support conversation, in: Proceed[15] H. Zhou, M. Huang, T. Zhang, X. Zhu, B. Liu, Emo- ings of the Annual Meeting of the Association for tional chatting machine: Emotional conversation Computational Linguistics, ACL, 2022, pp. 308–319. generation with internal and external memory, in: [25] W. Peng, Z. Qin, Y. Hu, Y. Xie, Y. Li, FADO: feedbackProceedings of the AAAI Conference on Artificial aware double controlling network for emotional Intelligence, the innovative Applications of Artifi- support conversation, Knowledge-Based Systems cial Intelligence, and the AAAI Symposium on Ed- 264 (2023) 110340. ucational Advances in Artificial Intelligence, AAAI [26] W. J. Reynolds, B. Scott, Empathy: a crucial compoPress, 2018, pp. 730–739. nent of the helping relationship, Journal of psychi[16] H. Rashkin, E. M. Smith, M. Li, Y. Boureau, Towards atric and mental health nursing 6 (1999) 363–370. empathetic open-domain conversation models: A [27] B. Liu, S. S. Sundar, Should machines express symnew benchmark and dataset, in: Proceedings of the pathy and empathy? experiments with a health Conference of the Association for Computational advice chatbot, Cyberpsychology, Behavior, and Social Networking 21 (2018) 625–636. ond Workshop on Statistical Machine Translation, [28] T. Parkin, A. de Looy, P. Farrand, Greater profes- Association for Computational Linguistics, Prague, sional empathy leads to higher agreement about Czech Republic, 2007, pp. 228–231. decisions made in the consultation, Patient Educa- [39] R. Vedantam, C. L. Zitnick, D. Parikh, Cider: tion and Counseling 96 (2014) 144–150. Consensus-based image description evaluation, in: [29] Q. Li, P. Li, Z. Ren, P. Ren, Z. Chen, Knowledge Conference on Computer Vision and Pattern Recogbridging for empathetic dialogue generation, in: nition, IEEE, 2015, pp. 4566–4575. The Conference on Artificial Intelligence, Confer- [40] N. Schluter, The limits of automatic summarisaence on Innovative Applications of Artificial Intel- tion according to ROUGE, in: Proceedings of the ligence, The Symposium on Educational Advances Conference of the European Chapter of the Associin Artificial Intelligence, AAAI Press, 2022, pp. ation for Computational Linguistics, ACL, 2017, pp. 10993–11001. 41–45. [30] Y. Liu, W. Maier, W. Minker, S. Ultes, Empathetic [41] Y. Zhang, S. Sun, M. Galley, Y.-C. Chen, C. Brockdialogue generation with pre-trained roberta-gpt2 ett, X. Gao, J. Gao, J. Liu, B. Dolan, DIALOGPT : and external knowledge, in: Conversational AI for Large-scale generative pre-training for conversaNatural Human-Centric Interaction International tional response generation, in: Proceedings of the Workshop on Spoken Dialogue System Technology, 58th Annual Meeting of the Association for Comvolume 943 of Lecture Notes in Electrical Engineering, putational Linguistics: System Demonstrations, AsSpringer, 2021, pp. 67–81. sociation for Computational Linguistics, 2020, pp. [31] L. Jing, X. Song, X. Lin, Z. Zhao, W. Zhou, L. Nie, 270–278.

Stylized data-to-text generation: A case study in the [42] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, e-commerce domain, ACM Trans. Inf. Syst. (2023). Y. Liu, J. Xu, M. Ott, E. M. Smith, Y.-L. Boureau, [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, J. Weston, Recipes for building an open-domain L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- chatbot, in: Proceedings of the 16th Conference of tention is all you need, in: Advances in Neural the European Chapter of the Association for ComInformation Processing Systems 30: Annual Con- putational Linguistics: Main Volume, Association ference on Neural Information Processing Systems, for Computational Linguistics, 2021, pp. 300–325. 2017, pp. 5998–6008. [43] K. Sohn, H. Lee, X. Yan, Learning structured out[33] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- put representation using deep conditional generlangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- ative models, in: Advances in Neural Informatowicz, J. Brew, Huggingface’s transformers: State- tion Processing Systems 28: Annual Conference of-the-art natural language processing, CoRR on Neural Information Processing Systems, 2015, abs/1910.03771 (2019). pp. 3483–3491. [34] I. Loshchilov, F. Hutter, Fixing weight decay regu- [44] S. M. Mohammad, Obtaining reliable human ratings larization in adam, CoRR abs/1711.05101 (2017). of valence, arousal, and dominance for 20, 000 en[35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- glish words, in: Proceedings of the Annual Meeting bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, of the Association for Computational Linguistics, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, ACL, 2018, pp. 174–184.

Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems, 2019, pp. 8024–8035. [36] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the Annual Meeting of the Association for Computational Linguistics, ACL, 2002, pp. 311–318. [37] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81. [38] A. Lavie, A. Agarwal, METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments, in: Proceedings of the Sec

[1]

Z. A.

Green ,

Faizi ,

Jalal ,

Zadran , Emotional support received moderates academic stress and mental well-being in a sample of afghan university students amid covid- 19, International Journal of Social Psychiatry 68 ( 2022 ) 1748 - 1755 .

[2]

C.-W.

Chang , F.-p. Chen, Relationships of family emotional support and negative family interactions with the quality of life among chinese people with mental illness and the mediating efect of internalized stigma , Psychiatric Quarterly 92 ( 2021 ) 375 - 387 .

[3]

Liu ,

Zheng ,

Demasi ,

Sabour ,

Li ,

Yu ,

Jiang ,

Huang , Towards emotional support dialog systems , in: Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing , ACL, 2021 , pp. 3469 - 3483 .

[4]

Peng ,

Hu ,

Xing ,

Xie ,

Sun ,

Li , Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation , in: Proceedings of the International Joint Conference on Artificial Intelligence, ijcai.org , 2022 , pp. 4324 - 4330 .

[5]

Cheng , W. Liu,

Li ,

Wang ,

Zhao ,

Liu ,

Liang ,

Zheng , Improving multi-turn emotional support dialogue generation with lookahead strategy planning , in: Proceedings of the Conference on Empirical Methods in Natural Language Processing , ACL, 2022 , pp. 3014 - 3026 .

[6]

Speer ,

Chin ,

Havasi , Conceptnet 5.5: An open multilingual graph of general knowledge , in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence , AAAI Press, 2017 , pp. 4444 - 4451 .