=Paper= {{Paper |id=Vol-3649/paper2 |storemode=property |title=Knowledge-enhanced Memory Model for Emotional Support Conversation |pdfUrl=https://ceur-ws.org/Vol-3649/Paper2.pdf |volume=Vol-3649 |authors=Mengzhao Jia,Qianglong Chen,Liqiang Jing,Dawei Fu,Renyu Li |dblpUrl=https://dblp.org/rec/conf/aaai/JiaCJFL24 }} ==Knowledge-enhanced Memory Model for Emotional Support Conversation== https://ceur-ws.org/Vol-3649/Paper2.pdf

Knowledge-enhanced Memory Model for Emotional
Support Conversation⋆
Mengzhao Jia1 , Qianglong Chen2 , Liqiang Jing3 , Dawei Fu4 and Renyu Li4,∗
1
Shandong University, China
2
Zhejiang University, China
3
University of Texas at Dallas, TX, USA
4
Alibaba Group, China

Abstract
The prevalence of mental disorders has become a significant issue, leading to the increased focus on Emotional Support
Conversation as an effective supplement for mental health support. Existing methods have achieved compelling results,
however, they still face three challenges: 1) variability of emotions, 2) practicality of the response, and 3) intricate strategy
modeling. To address these challenges, we propose a novel knowledge-enhanced Memory mODEl for emotional suppoRt
coNversation (MODERN). Specifically, we first devise a knowledge-enriched dialogue context encoding to perceive the
dynamic emotion change of different periods of the conversation for coherent user state modeling and select context-related
concepts from ConceptNet for practical response generation. Thereafter, we implement a novel memory-enhanced strategy
modeling module to model the semantic patterns behind the strategy categories. Extensive experiments on a widely used
large-scale dataset verify the superiority of our model over cutting-edge baselines.

Keywords
Mental disorders, Emotional Support Conversation, Mental health support, ConceptNet

1. Introduction versations, such as the user’s situation, emotions, and in-
tentions. An example of these efforts is the work of Peng
Mental disorders are known for their high burden, with et al. [4] who designed a hierarchical graph network to
more than 50% of adults experiencing a mental illness capture the overall emotional problem cause and specific
or disorder at some point in their lives; yet despite its user intentions. The second aspect focused on predicting
high prevalence, only one in five patients receive profes- the dialogue strategy accurately and responding based
sional treatment1 . Recent studies have shown that an on the predicted strategy category. For example, Cheng
effective mental health intervention method is the provi- et al. [5] employed a lookahead heuristics for dialogue
sion of emotional support conversations [1, 2]. As such, strategy planning and selection.
Emotional Support Conversations (ESConv), as defined Despite the success of existing studies, this task is non-
by Liu et al. [3], has garnered substantial attention in trivial due to the following three challenges.
recent years. They have emerged as a promising alter-
• Variability of emotions. As the conversation
native strategy for mental health intervention, paving
progresses, the user’s emotional state evolves sub-
the way for the development of neural dialogue systems
tly and constantly. Accurately recognizing the
designed to provide support for those in need.
emotional change is indispensable to understand-
The ESConv takes place between a help-seeker (user)
ing the user’s real-time state and thus responding
and a supporter (dialogue model) in a multi-turn man-
empathically [7, 8, 9]. How to model the dynamic
ner. It requires the dialogue model to employ a range
emotional change during the dialogue process is
of supportive strategies effectively, easing the emotional
the first challenge.
distress of the users and helping them overcome the chal-
• Practicality of the response. In the absence of
lenges they face. Prior research primarily concentrated
explicit cues, neural dialogue systems are inclined
on two aspects. The first aimed to enhance the model’s
to make generic responses [10, 11]. As shown in
comprehension of the contextual semantics in the con-
Figure 1, the generic responses are deficient to
Machine Learning for Cognitive and Mental Health Workshop provide personalized and suitable suggestions for
(ML4CMH), AAAI 2024, Vancouver, BC, Canada the user’s specific concerns. To resolve this issue,
∗
Corresponding author. introducing context-related concepts (doctor and
Envelope-Open jiamengzhao98@gmail.com (M. Jia); chenqianglong@zju.edu.cn recover) can promote generating more meaning-
(Q. Chen); jingliqiang6@gmail.com (L. Jing);
ful and actionable suggestions for specific situa-
dawei.fdw@alibaba-inc.com (D. Fu); renyu.rl@alibaba-inc.com
(R. Li) tions. As a result, the integration of appropriate
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
concepts poses a non-trivial challenge in gener-
1
https://tinyurl.com/4r8svsst. ating practical responses.

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
hospital doctor patient port strategies. Finally, the third module aims to generate
the target response with the BART decoder.
My mother in law is sick with COVID and my
husband, when he contracted COVID from her, Our contributions can be summarized as follows: 1)
got terminated from his job.
We first analyze the current challenges of the ESConv
I can see why you feel anxious, I would be too
given the situation. task, and according to that propose a novel knowledge-
Yeah, it's just really stressful and I am over
enhanced Memory mODEl for emotional suppoRt coN-
anxious about it. She is elderly and has cancer... versation, named MODERN, which can model complex
worried uneasy disease medicine supportive strategy as well as utilize emotional knowl-
Cancer is already a hard enough disease to fight edge and context-related concepts to perceive the vari-
against...
ability of emotions and provide practical support ad-
recover
Yeah, she went again the other day. vice. 2) We propose a memory-enhanced strategy model-
She is doing quite a bit better... improve
ing module, where a unique memory bank is designed
From my knowledge, people usually recover from
Practical COVID after 2 to 3 weeks, you might consider to model intricate strategy patterns, and an auxiliary
Information having a doctor see your mother if she is still strategy classification task is introduced to distill the
struggling.

Don’t worry, she will be fine. strategy pattern information. 3) We present a thor-
Generic
Response ough validation and evaluation of our model, provid-
I’m sorry to hear about it.
ing an in-depth analysis of the results and a compar-
ison with other models. The extensive experiments
Figure 1: Illustration of an emotional support conversation
example. The words with a green background are the retrieved
on the ESConv dataset [3] demonstrate that MODERN
concepts from ConceptNet [6]. achieves state-of-the-art performance under both auto-
matic and human evaluations. The code is avaliable at
https://projs2release.wixsite.com/modern.
• Intricate strategy modeling. Dialogue strategy,
as a kind of linguistic pattern, has been reported 2. Related Work
as a highly complex concept encompassing vari-
ous intricate linguistic features [12, 13]. Previous 2.1. Emotional and Empathetic Dialogue
work resorted to a single vector (a category in- Systems
dicator) for strategy representation, which is in-
sufficient to fully represent the complex strategy With the popularity and growing success of dialogue
pattern information. Therefore, how to model systems, many research interests have recently endeav-
strategy information sufficiently is the third chal- ored to empower the system to reply with a specific and
lenge. proper emotion, therefore forming a more human-like
conversation. Particularly, two research directions arise
To overcome these challenges, as shown in Figure 2, we researchers’ interest, namely the emotional [15] and em-
introduce a novel knowledge-enhanced Memory mODEl pathetic [16] response generation. The former direction
for emotional suppoRt coNversation, dubbed MODERN. expects the dialogue agent to respond with a given emo-
In particular, MODERN adapts the BART [14] as its back- tion [17, 18, 19, 20]. While the latter requires the dialogue
bone and consists of the knowledge-enriched dialogue system to actively detect and understand the users’ emo-
context encoding module, the memory-enhanced strat- tions and then respond with an appropriate emotion [21].
egy modeling module, and the response decoding module. For example, Lin et al. utilized multiple decoders as differ-
To capture the emotional change as the conversation pro- ent listeners to react to different emotions and then softly
gresses, the first module detects the emotions for all utter- combine the output states of the decoders appropriately
ances and explicitly injects them into the dialogue context based on the recognized user’s emotion. Nevertheless,
as a kind of emotional knowledge, thus understanding unlike above directions, the ESConv task concentrates
the user’s status coherently. In addition, this module also on alleviating users’ negative emotion intensity and pro-
introduces the concepts reasoning and selection com- viding supportive instructions to help them overcome
ponent to acquire valid context-related concepts from a struggling situations.
knowledge graph called ConceptNet [6] and incorporate
them into the dialogue context to fulfill meaningful and
practical suggestion generation. Moreover, in contrast 2.2. Emotional Support Conversation
to existing studies that depend on simplistic indicators As an emerging research task, emotional support conver-
to represent strategy categories, the memory-enhanced sation has gradually attracted intense attention in recent
strategy modeling module learns strategy patterns by a years. Existing works mostly focus on two aspects. The
strategy-specific memory bank. In this way, it can detect first is to understand the complicated user emotions and
and mimic the intricate patterns in human emotional sup-
Memory-enhanced Strategy Modeling Response Decoding
Strategy-specific I would suggest talking to
Training Set Responses the professor..[EOS]
Store Memory Bank
[Providing Suggestions] It may also Strategy Pattern
help to contribute some money to Extractor
keep the organization going. Pattern

Attention
Representations
Add & Norm

Pooling
[Self-disclosure] I have done

Cross
volunteer work; people want to be

Max
friends with those who care and...

...
[Question] What's going on that's
making you feel that way? ...
Feed Forward
Strategy Probability
Predictor Distribution
Add & Norm

Dialogue Context Emotion Cross-Attention
Detector grief caring relief
Knowledge-enriched
Dialogue Context

Dialogue
Add & Norm

Encoder

Select

Self-Attention
ConceptNet

[lack; absent;
...; remove] [BOS] I would suggest
talking to the professor..
Knowledge-enriched Dialogue Context Encoding

Figure 2: Illustration of the proposed MODERN framework, which consists of three key components: Strategy Memory-
enhanced Dialogue Context Encoding, Multi-source Knowledge Injection, and Response Decoding.

intentions in the dialogue context. Specifically, they ex- denoted as 𝒟𝑖 = (𝑢1𝑖 , 𝑢2𝑖 , … , 𝑢𝑁
𝑖 ). 𝑅 = (𝑟 𝑖 , 𝑟 𝑖 , … , 𝑟 𝑖 ) is
𝑖 𝑖 1 2 𝑁𝑖
𝑢 𝑟
plored context semantic relationships [23, 4], common- the supportive response with 𝑁𝑟𝑖 tokens. The goal of the
sense knowledge [24, 4], or emotion causes [5] to better ESConv task is to learn a model ℱ that can generate a
capture and understand the emotions and intentions of supportive response 𝑅̂ 𝑖 referring to the input context 𝒟𝑖
users. The other trend in addressing the task is to pre- and situation 𝑡𝑖 as follows,
dict the strategy category accurately so as to respond in
accordance with it [25, 23]. For instance, Tu et al. firstly 𝑅̂ 𝑖 = ℱ (𝒟𝑖 , 𝑡𝑖 |Θ), (1)
proposed to predict a strategy probability distribution
and generate the response guided by a mixture of mul- where Θ refers to the set of to-be-learned parameters of
tiple strategies. Despite their remarkable achievements, the model ℱ. Notably, the ground truth 𝑔𝑖 can be utilized
existing work still face three challenges: intricate strat- in the model training stage but is not available and need
egy modeling, variability of emotions, and practicality of to be predicted in the inference phrase. For brevity, we
the responses. temporally omit the superscript 𝑖 of the 𝑖-th sample in
the rest of this paper.

3. Task Formulation
4. Method
For concise mathematical expression, we first declare
some notations in this paper. We use bold capital letters In this section, we detail the proposed model MODERN,
(X) and bold lowercase letters (x) to represent matrices which comprises three main components: knowledge-
and vectors, respectively. We adopt non-bold letters (𝑁) enriched dialogue context encoding, memory-enhanced
to denote scalars. Greek letters (𝜆) refer to hyperparame- strategy modeling, and response decoding, demonstrated
ters. All the vectors, if not clarified, are in column forms. in Figure 2.
In the setting of emotional support conversation, the
dialogue is participated by a help-seeker and a supporter, 4.1. Knowledge-enriched Dialogue
and the latter tries to comfort and support the help-seeker
to lower he/she’s emotional intensity level. The target
Context Encoding
is to generate a response based on the dialogue history. In this section, we first utilize an emotion detector to
Besides, the supporter is required to select one of 𝐺 strate- recognize fine-grained emotions of utterances as emo-
gies and respond accordingly. Suppose we have a training tional knowledge for capturing emotional change. In
dataset 𝒫 = {𝑝1 , 𝑝2 , … , 𝑝𝑉 } composed of 𝑉 samples. Each addition, we select related concepts from the ConceptNet
sample 𝑝𝑖 = {𝑡𝑖 , 𝒟𝑖 , 𝑔𝑖 , 𝑅𝑖 } includes a seeker’s situation 𝑡𝑖 , knowledge graph for meaningful and practical sugges-
a dialogue context 𝒟𝑖 , a support strategy 𝑔𝑖 , and a tar- tion generation.
get response 𝑅𝑖 . Therein, 𝒟𝑖 contains a sequence of 𝑁𝑢𝑖
history utterances between the user and the supporter,
4.1.1. Change-aware Emotion Detection be used to mine the underlying concept information in
the dialogue context. We first identify all the concepts in
Psychiatric and mental health studies have proved that
ConceptNet that are mentioned in the dialogue context
empathy is essential to emotionally helping relation-
and remove the top-𝐾 frequent concepts in the training
ships [26, 27, 28]. And fine-grained emotional infor-
set because these words are usually too general to provide
mation is one of the key factors to enhance the empa-
valid suggestions for a specific situation, such as help,
thetic ability [29]. Apart from the static emotional sig-
thing, and feeling. In this way, we derive 𝑁𝑐 concepts,
nals, dynamic emotional changes during the conversation
denoted as {𝑐1 , ⋯ , 𝑐𝑁𝑐 }.
progress are also beneficial. Concretely, change-aware
Thereafter, we leverage the 𝑁𝑐 concepts as the an-
emotion information enriches the model to understand
chors to reason the related concepts. Specifically, for
the user’s status coherently. Inspired by this, we devise
each anchor concept 𝑐, we retrieve all its one-hop
to identify the user’s fine-grained emotions and perceive
neighboring concept-relation pairs from the Concept-
the dynamic changes of emotions in the dialogue context.
Net. Mathematically, let 𝒩 (𝑐) be the set of neigh-
Specifically, we start by obtaining the fine-grained
boring concept-relation pairs of the anchor concept
emotion via an off-the-shelf pretrained emotion detec-
𝑐. Then the context-related concept-relation pairs
tor2 , which can recognize up to 28 different emotional
can be represented as {𝒩 (𝑐1 ), 𝒩 (𝑐2 ), ⋯ , 𝒩 (𝑐𝑁𝑐 )}, where
categories. We apply the detector to every utterance in 𝑁𝑐𝑖 𝑁𝑐𝑖
the dialogue context for emotion recognition as follows, 𝒩 (𝑐𝑖 ) = {(𝑎𝑐11 , ℎ1𝑐1 ), ⋯ , (𝑎𝑐1 , ℎ𝑐1 )} is a set of 𝑁𝑐𝑖 retrieved
< 𝑐𝑜𝑛𝑐𝑒𝑝𝑡, 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 > pairs. Following Liu et al. [30], to
𝑒𝑗 = Emo(𝑢𝑗 ), 0 ≤ 𝑗 ≤ 𝑁𝑢 , (2) avoid introducing extraneous concepts, we filter out con-
cepts with excluded relations, Antonym, ExternalURL, and
where 𝑒𝑗 is the predicted emotional category word rep-
NotCapableOf. Finally, we concatenate the concepts in
resenting the detected emotion in 𝑢𝑗 . Thereafter, we
the filtered pairs and deem them as context-related con-
directly inject the natural language form of emotion
cepts 𝐶.
category words into the dialogue context as additional
emotion knowledge. This practice aligns closely with
the input format of the pretrained BART model. More- 4.1.3. Knowledge-enriched Dialogue Context
over, it also avoids introducing unnecessary parameters Embedding
that would interfere with the generative model learn- To acquire the representation of the dialogue context
ing. Empirically, we concatenate the emotional category and corresponding knowledge, we first encode the dia-
into the sequence of dialogue context tokens, denoted as logue context sequence (user situation 𝑡, emotional-aware
𝐼 = [𝑢1 ; 𝑒1 ; SEP ; … 𝑢𝑗 ; 𝑒𝑗 ; SEP ; … 𝑢𝑁𝑢 ; 𝑒𝑁𝑢 ]. In this way, the dialogue context 𝐼, and concepts 𝐶) with a Transformer-
dynamic emotional changes corresponding to the dia- based encoder as follows,
logue progress can be coherently exploit by the emotinal
support model. H = Enc𝑐 ([𝑡; 𝐼 ; 𝐶]), (3)

4.1.2. Context-related Concepts Reasoning and where H ∈ ℝ𝐿×𝑑 denotes the hidden contextual represen-
Selection tation. 𝐿 and 𝑑 refer to the number of tokens in [𝑡; 𝐼 ; 𝐶]
and the representation hidden dimension, respectively.
Considering ConceptNet involve abundant general hu-
man knowledge, which plays an important role in un-
4.2. Memory-enhanced Strategy Modeling
derstanding human situations and associating them with
practical suggestions, we select potentially useful context- During the conversation, the supporter adopts different
related concepts to enrich the model to generate re- strategies for different purposes, ultimately achieving
sponses with high informativeness. For instance, the the goal of reducing the intensity of the user’s negative
commonsense knowledge database can easily relate fail- emotions. For example, using the Question strategy helps
ing exam to academic stress. In terms of daily human the supporter to explore the concrete situation faced by
activities, this knowledge serves as a guide for dealing the user, while the use of the Reflection of Feelings strat-
with daily affairs and problems. Such knowledge is use- egy conveys the supporter’s understanding of the user’s
ful for providing advice and guidance in the emotional current emotions. Existing work constrains the model
support system. Therefore, we mine and associate com- to responding under a strategy category by simply pro-
monsense knowledge of the data to provide potentially viding a single vector indicating the strategy’s name or
useful information for generating instructive responses. description. However, the semantic patterns of strategies
Concretely, the ConceptNet knowledge graph involves are highly complex, the name or description is not able to
3.1 million concepts and 38 million relations, which can contain the specific linguistic patterns (expression man-
2 ner, wording, and phrasing) of the strategy. Therefore,
https://huggingface.co/arpanghoshal/EmoRoBERTa.
inspired by [31], we propose to disentangle the strategy 4.2.3. Strategy Prediction
patterns from the same-strategy responses to provide
In order to use the strategy pattern information in the
more specific guidance for the strategy-constrained gen-
memory bank, the model requires selecting a proper strat-
eration.
egy category based on the dialogue context. To achieve
these, we leverage a strategy predictor, which aims to
4.2.1. Strategy Pattern Modeling capture information relevant to strategy decisions in the
We first acquire strategy pattern representations of each context.
responses in the training set via a strategy pattern ex- The strategy predictor is composed of a Transformer-
tractor Enc𝑟 as follows, based encoder and a classification module. The encoder
first encodes the dialogue context into a strategy predict
r = MaxPooling(Enc𝑟 (𝑅)), (4) representation. It is worth noting that we adopt indepen-
dent representations for strategy prediction and response
where r ∈ ℝ𝑑 denotes the strategy pattern representation. generation tasks considering the fact that jointly optimal
Meanwhile, in order to accurately capture the strategy solutions may not always exist for different tasks. Sub-
pattern information and avoid irrelevant disturbance, we sequently, the classification module maps the vector as
design an auxiliary strategy classification task to guide a 𝐺 dimension vector, which is regarded as the proba-
the extractor to map more strategy-related information bility distribution of the 𝐺 strategy types. Formally, the
into the pattern representation. The auxiliary objective strategy prediction can be written as follows,
ℒ𝑟 is derived by the following loss function,
s = MaxPooling(Enc𝑠 (𝐼 )),
ℒ𝑟 = − log 𝑝 (𝑔|r) . (5) { (8)
𝑔̂ = argmax(𝜎(MLP(s)),

4.2.2. Strategy-specific Memory Bank where Enc𝑠 is a Transformer-based encoder. The strategy
prediction representation is denoted as s ∈ ℝ𝑑 . MLP
To utilize detail and ample strategy pattern information,
and 𝜎(⋅) are a multi-layer perceptron and the Sigmoid
instead of using a single representation vector, we de-
function, respectively. The argmax operation is used to
vise a memory bank mechanism to store multiple pat-
obtain the predicted strategy category 𝑔.̂ We use the
tern representations according to their strategy cate-
following objective to optimize the strategy prediction
gories. In particular, we denote the memory bank as
𝑔 task,
ℳ = {M1 , … , M𝐺 }, in which M𝑔 ∈ ℝ𝑁𝑠 ×𝑑 is a matrix of ℒ𝑠 = − log 𝑝 (𝑔|s) . (9)
𝑔
𝑁𝑠 pattern representations corresponding to 𝑔-th strat-
egy category, and 𝐺 is the total number of strategy cate-
𝑔 4.2.4. Memory-enhanced Encoding
gories. 𝑁𝑠 is 0 at the initial training step. As the training
𝑔
progresses, 𝑁𝑠 continues to increase until the maximum After predicting the strategy category 𝑔,̂ instead of di-
threshold 𝑁𝑚 is reached. In particular, we store pattern rectly using 𝑔̂ as an indicator to constrain the response
representation of 𝑔-th strategy category into the corre- generation, we integrate the aforementioned correspond-
sponding M𝑔 by concatenation as follows, ing memory bank matrix M𝑔 and the context representa-
tion, so as to fully exploit the abundant pattern informa-
M𝑔 ⟵ [M𝑔 ; r𝑔 ], (6) tion of the particular strategy.
Empirically, we fuse the matrix and the context repre-
where r𝑔 denotes a representation belongs to the 𝑔-th sentation with a multi-head cross-attention module [32]
strategy category and [⋅; ⋅] refers to the concatenation as follows,
operation. As the representations are optimized along
with the classification training process, we dynamically m = MaxPooling(CrossAtt(H, M𝑔 )), (10)
update M𝑔 in a first-in-first-out manner as follows,
𝑔 𝑔
where H and M𝑔 act as the query and the key-value pair
M[𝑁 𝑔 −𝑁 ∶𝑁 𝑔 ] , if 𝑁𝑠 > 𝑁𝑚 in the cross-attention, respectively, m ∈ ℝ𝑑 denotes the
M𝑔 = { 𝑠 𝑚 𝑠
(7) memory-enhanced strategy modeling feature.
M𝑔 , otherwise
𝑔
where 𝑁𝑚 and 𝑁𝑠 denote the maximum storage limit 4.3. Response Decoding
and the current storage volume of each memory matrix,
In order to generate the emotional supportive response,
respectively. The algorithm of the memory storing and
we input the encoded features, the memory-enhanced
updating operation is summarized in the appendix.
strategy modeling feature m and the knowledge-enriched
dialogue context embedding H, into the Transformer
decoder. The generation process aims to predict the strategy category used in every supporter’s response.
conditional probability distribution 𝑝(𝑟𝑙̂ |𝑟<𝑙
̂ , m, H) in an The dataset contains 1,300 long conversations and overall
auto-regressive manner, which means the decoder gener- 38,350 utterances, with an average of 29.5 utterances in
ates the 𝑙-th word conditioned on all previous generated each dialogue. For the data split, we followed the same
words as well as the encoded representation. Formally, setting of previous work [3, 5].
we deploy the decoding process, which predicts the con-
ditional probability distribution over each token in the 5.2. Implementation Details
target response in an auto-regressive manner as follows,
Following the setting of the previous study [5], we also
𝑝 (𝑟𝑙̂ ∣ 𝑟<𝑙
̂ , E) = Dec(𝑟<𝑙
̂ , E), (11) utilized encoder and decoder of the pretrained BART3
provided by HuggingFace [33] to initialize the parame-
where E = [m; H]. 𝑟<𝑙 ̂ refers to the previous generated
ters of the context encoder, the strategy predictor, and
𝑙 − 1 tokens of the target response. Dec(⋅) denotes the
the decoder, respectively. The number of layers in the
decoder module. Notably, to avoid the accumulated er-
encoder and decoder are both 6. The dimension of hid-
ror, we replace 𝑟<𝑙̂ by 𝑟<𝑙 in the training phase. For opti-
den feature 𝑑 equals to 512. To form a mini-batch, the
mization, we introduce the standard cross-entropy loss
input sequence length 𝐿 is unified to 512. The hidden
function for response generation as follows,
dimension 𝑑 is 768. The category number 𝐺 equals 8.
𝑁𝑟 The maximum memory storage 𝑁𝑚 is set as 64. 𝐾 equals
1
ℒ𝑔 = − ∑ log 𝑝 (𝑟𝑙 ∣ 𝑟<𝑙 , E) , (12) 20. 𝜆1 and 𝜆2 are 0.3 and 0.1, respectively. The batch
𝑁𝑟 𝑙=1 size is 16. We use the PPL metric on the validation set
to monitor the training progress. Empirically, it takes
where 𝑁𝑟 denotes the length of the target response.
around 15 epochs to get the peak performance. During
[t] Training Procedure.
the generation stage, we use a maximum of 64 steps to
Input: training set 𝒫 for optimizing the model ℱ, decode the responses. We adopt AdamW [34] optimizer
hyperparameters {𝜆1 , 𝜆2 }. with 𝛽1 = 0.9 and 𝛽2 = 0.999. The initial learning rate is
Output: Parameters Θ. 2𝑒-5 and a linear learning rate scheduler with 100 warm-
up steps is used to reduce the learning rate progressively.
[1] Initialize parameters: Θ Initialize memory bank: For the framework, we use the PyTorch [35] version 1.8.1
ℳ Randomly sample a batch from 𝒫. each sample to implement the experiment codes. All experiments
(𝒟 , 𝑅, 𝑔, 𝑠) Add strategy pattern representation into the were conducted on an NVIDIA Tesla V100 32GB.
memory bank ℳ by Eqn. (6). Update the memory bank
ℳ by Eqn. (7). Update Θ by optimizing the loss function
in Eqn. (13). ℱ converges. 5.3. Evaluation Metrics
For the comprehensive evaluation, we conducted both
4.4. Model Training automatic and human evaluations.
Ultimately, we combine all the loss functions for optimiz-
Automatic Evaluation For the automatic evaluation,
ing our model as follows,
we adopt several mainstream metrics commonly used
ℒ = ℒ 𝑔 + 𝜆 1 ℒ𝑠 + 𝜆 2 ℒ𝑟 , (13) in dialogue response tasks, PPL (Perplexity), BLEU-
{1,2,3,4} (B-{1,2,3,4}) [36], ROUGE-L (R-L) [37], METEOR
where ℒ is the final training objective and 𝜆1 and 𝜆2 are (MT) [38], and CIDEr [39].
the non-negative hyperparameters used for balancing
the effect of each loss function on the entire training Human Evaluation Apart from the automatic evalu-
process. The overall algorithm of the optimization is ation, we also consider the human evaluation, as it has
briefly summarized in the algorithm 4.3. been reported [40] that the automatic evaluation is some-
times unreliable in generation tasks. We first randomly
select 70 testing samples for evaluation. Then, we em-
5. Experiments ploy 2 volunteers to manually choose which response
outperforms the other one. Every sample is annotated
5.1. Dataset
twice. For each case, the volunteers need to compare
We conducted experiments on the ESConv dataset [3]. the generated texts of different models according to the
Each sample in the dataset is a dialogue between a help- following four dimensions: 1) Fluency: which response
seeker and a supporter. In addition to the context, it also
contains rich information, the situation that the help-
seeker’s is facing. It also provides the annotation of the 3 https://huggingface.co/facebook/bart-base.
Table 1
Performance comparison under automatic evaluations. The best results are highlighted in bold.
Model PPL B-1 B-2 B-3 B-4 R-L MT CIDEr
MoEL 264.11 19.04 6.47 2.91 1.51 15.95 7.96 10.95
MIME 69.28 15.24 5.56 2.64 1.50 16.12 6.43 10.66
DialoGPT-Joint 15.71 17.39 5.59 2.03 1.18 16.93 7.55 11.86
BlenderBot-Joint 16.79 17.62 6.91 2.81 1.66 17.94 7.54 18.04
MISC 16.16 - 7.31 - 2.20 17.91 11.05 -
GLHG 15.67 19.66 7.57 3.74 2.13 16.37 - -
FADO 15.72 - 8.00 4.00 2.32 17.53 - -
PoKE 15.84 18.41 6.79 3.24 1.78 15.84 - -
MultiESC 15.41 21.65 9.18 4.99 3.09 20.41 8.84 29.98
MODERN 14.99 23.19 10.13 5.53 3.39 20.86 9.26 30.08

Dialogue Context Dialogue Context
Supporter: Good Afternoon. How are you doing troday? Supporter: Hello, what can I help you with this evening?
User: I am Ok thanks but have an unusual issue. I think my User: I am feeling very depressed lately. Like a constant pressure
girlfriend may be cheating on me but I'm too scared to do anything that i don't know exactly what it is.
about it as I don't want her to leave me. I feel ashamed of myself but Supporter: Depression is extremely rough to deal with, I'm very
I can't help it. sorry to hear you're going through that.
... ...
User: Yes it is but I should be leaving her and moving on... User: Do you have any tips on how to overcome it?
Ground Truth: (Self-disclosure) I've dealt with the same problem Ground Truth: (Self-disclosure) I believe there are special lamps
with my partner more than once. I love him very much and I found that you could get to help. I personally have been trying to get up a
him to be a 9 to my 5 as well. little earlier in the morning and enjoy my coffee with the sunrise...
w/o-Mem: (Reflection of Feelings) I am sorry to hear that. I am sorry w/o-KG: (Providing Suggestions) I've been in that situation myself,
you are feeling that way. I know how you feel. and I've found that it's very easy to get depressed.
MODERN: (Self-disclosure) I can understand how that would be MODERN: (Providing Suggestions) I've found that taking a walk or
difficult. I had a similar situation with my ex - boyfriend I know sitting down to write out a list of things that you'd like to do helps to
how hard it can be to let go of someone. clear your mind.
(a) (b)

Figure 3: Intuitive comparison of the MODERN and two derivatives.

Table 2 5.4. Model Comparison & Analysis
The human evaluation results in four dimensions.
To validate the effectiveness of our model, we compare it
Comparisons Aspects Win Tie Lose with several state-of-the-art baselines:
Fluency 37.0 24.0 9.0
• MoEL [22]. This method adopts multiple de-
Relevance 35.5 20.5 14.0
vs. FADO Empathy 34.0 16.5 19.5 coders as different listeners for different emotions.
Information 37.5 17.0 15.5 The outputs of decoders are softly combined to
Fluency 30.5 24.5 15.0 generate the response.
Relevance 32.5 20.5 17.0 • MIME [21]. This model shares the same archi-
vs. MultiESC Empathy 35.0 19.5 15.5 tecture as MoEL and extends it to mimic the
Information 37.5 21.0 11.5 speaker’s emotion.
• DialoGPT-Joint [3]. This model is built on a
pre-trained dialog agent DialoGPT [41]. It first
is more fluent, correct, and coherent in grammar and syn- predicts a strategy and prepends a special token
tax; 2) Relevance: which response talks more relevantly before the response sentence to control the gen-
regarding current dialogue context; 3) Empathy: which eration under that strategy.
response is better to react with appropriate emotion ac- • BlenderBot-Joint [3]. This model adopts the
cording to the user’s emotional state; 4) Information: same strategy prediction and generation scheme
which response provides more suggestive information to as DialoGPT. Differ from the former one, it is
help solve the problem. To further control the quality of built on a pre-trained conversational response
the evaluation, we also invite an inspector to randomly generation model named BlenderBot [42].
sample and double-check 10% rating results. • MISC [24]. This model also adopts BlenderBot
as the backbone. It leverages common sense
Table 3 achieves better (denoted as “Win”), equal (denoted as
Experimental results of ablation study. “Tie”), and worse performance (denoted as “Lose”) com-
Model PPL B-1 B-2 B-3 B-4 R-L MT CIDEr
pared with the baselines. As seen, MODERN outperforms
w/o-ℒ𝑠 15.88 20.25 8.61 4.68 2.91 20.11 8.44 24.32 all baselines across different evaluation metrics, as the
w/o-ℒ𝑟 15.32 21.69 9.31 5.06 3.16 20.57 8.87 28.91 number of “Win” cases is always significantly larger than
w/o-Mem 15.84 21.24 8.90 4.70 2.87 20.21 8.63 27.55
w/o-Emo 15.91 20.35 8.46 4.48 2.72 19.79 8.27 24.01 that of “Lose” cases in each pair of model comparisons,
w/o-KG 15.58 21.56 9.08 4.82 2.96 19.98 8.74 26.83 which is consistent with the results in Table 1. In ad-
MODERN 14.99 23.19 10.13 5.53 3.39 20.86 9.26 30.08
dition, the number of “Win” cases is the largest for the
Information metric compared with other metrics, which
demonstrates that integrating context-related concepts
knowledge to enhance the understanding of the can supply meaningful information for emotional sup-
speaker’s emotional state, and the response is port.
generated conditioned on a mixture of strategy
distribution.
5.5. Ablation Study
• GLHG [4]. This model has a global-to-local hi-
erarchical graph structure. It models the global We compare the original MODERN model with the fol-
cause and the local intention of the speaker to lowing derivatives to demonstrate that all the designed
provide more supportive responses. modules are essential for the ESConv task. 1) w/o-ℒ𝑠 .
• PoKE [23]. This work utilized Conditional Varia- To show the benefit of constraint for strategy prediction,
tional Autoencoder [43] to model the mixed strat- we removed the corresponding loss function by setting
egy. 𝜆1 = 0 in Eqn.(13). 2) w/o-ℒ𝑟 . To show the effect of
• FADO [25]. This work devises a dual-level feed- the auxiliary strategy classification task, we removed the
back strategy selector to encourage or penalize corresponding loss function by setting 𝜆2 = 0 in Eqn.(13).
strategies during the strategy selection process. 3) w/o-Mem. In this derivative, we disabled the memory
• MultiESC [5] This work proposes lookahead module, which stores pattern representations of different
heuristics to estimate the future strategy and cap- strategies. 4) w/o-Emo. In this derivative, we removed
ture users’ subtle emotional expressions with the the change-aware emotion detection module. And 5) w/o-
NRC VAD lexicon [44] for user state modeling. KG. We discarded the context-relate concepts reasoning
and selection component.
We provided the ablation study results on the ESConv
Automatic Evaluation We compared our model with dataset in Table 3 in terms of all metrics. From this ta-
the above baselines using automatic metrics and results ble, we have the following observations: 1) Our MOD-
are reported in Table 1. As we can see, 1) MODERN out- ERN consistently outperforms w/o-Mem, especially on
performs the baselines in most metrics, which is a power- the BLUE metrics (B-{1,2,3,4}), which suggests that the
ful proof of the effectiveness of the proposed method. 2) memory-enhanced strategy modeling module can pro-
The models with BART backbone (MultiESC and MOD- vide sufficient linguistic pattern references and hence
ERN) surpass those baselines with BlenderBot [42] back- boost the performance of generating responses in ac-
bone (BlenderBot-Joint, MISC, GLHG, FADO, and PokE) cordance with specific strategy categories. 2) MODERN
across most of the metrics despite the latter being pre- exceeds w/o-ℒ𝑠 across all metrics. This verifies it is indis-
trained on empathetic-related data. One possible expla- pensable to constrain the model to predict and respond
nation is that the ESConv task requires sophisticated under proper strategies. 3) w/o-ℒ𝑟 obtains a slightly
linguistic knowledge (e.g. correct grammar and wording worse result than MODERN, which demonstrates the
appropriate to the current situation) in addition to empa- strategy classification auxiliary task indeed helps with
thetic ability. 3) Our MODERN consistently exceeds Mul- guiding the pattern representation learning. And 4) w/o-
tiESC with the same BART backbone. This suggests that Emo and w/o-KG both perform worse than MODERN,
BART model with large-scale pretraining still requires which demonstrates the importance of change-aware
strategy pattern information and knowledge (emotion emotion and context-related concepts. Notably, w/o-KG
and concepts) to further facilitate supportive response surpasses w/o-Emo. One possible explanation is that be-
generation. ing aware of the dynamic emotional changes during the
conversation facilitates the model to provide empathy
Human Evaluation For human evaluation, we report and emotional support accordingly.
the comparison results between our model and the two
best baselines (FADO and MultiESC) in Table 2. In partic-
ular, for each pair of model comparisons and each met-
ric, we show the number of samples where our model
5.6. Case Study information and experiences in the ESConv task deserves
the attention of future work.
We illustrate several conversations in the test set to get
an intuitive understanding of our model in Figure 3. We
showed two samples and compare the responses gen- Ethical Considerations
erated by MODERN and two derivatives, w/o-Mem and
w/o-KG. As can be seen in case (a), MODERN fulfills to re- The dataset used in our work is a publicly available
spond with the strategy of Self-disclosure and generates a dataset that has been widely used in the field of emotional
contextually appropriate response. Being equipped with support conversation. Sensitive and personally identi-
a memory-enhanced strategy modeling module, MOD- fiable information was filtered during the construction
ERN shares a similar experience closely related to the of the dataset. In the work of this paper, our model fo-
seeker’s problem relationship issue. Whereas the w/o- cuses on informal, emotional support provided between
Mem model generates a plain and monotonous response, friends’ daily chats and does not provide professional
which is not very relevant to the user’s current issue. mental health diagnosis and counseling services. The use
The other case (b) demonstrates how MODERN reaps of this model should be avoided for patients with serious
benefits from external knowledge. Based on the situa- mental illnesses, such as self-harm-related conversations,
tion that the seeker mentions feeling depressed, MODERN in order to prevent triggering serious consequences.
leverages the context-related concepts and associates this
emotion status with practical suggestions taking a walk
or sitting down to write... effectively. While without rele- References
vant knowledge, the response generated by the w/o-KG
[1] Z. A. Green, F. Faizi, R. Jalal, Z. Zadran, Emotional
derivative is relatively vague and less specific, which is
support received moderates academic stress and
deficient to benefit the user’s situation.
mental well-being in a sample of afghan university
students amid covid-19, International Journal of
6. Conclusions Social Psychiatry 68 (2022) 1748–1755.
[2] C.-W. Chang, F.-p. Chen, Relationships of family
In this paper, we propose a novel knowledge-enhanced emotional support and negative family interactions
Memory mODEl for emotional suppoRt coNversation, with the quality of life among chinese people with
dubbed MODERN, which can perceive fine-grained emo- mental illness and the mediating effect of inter-
tional changes in the conversation, utilize the concepts nalized stigma, Psychiatric Quarterly 92 (2021)
from knowledge graph to facilitate generating responses 375–387.
with practical suggestions, and model concrete strategy [3] S. Liu, C. Zheng, O. Demasi, S. Sabour, Y. Li, Z. Yu,
semantic patterns with memory bank mechanism. Both Y. Jiang, M. Huang, Towards emotional support dia-
automatic and human evaluation results show that our log systems, in: Proceedings of the Annual Meeting
model surpasses the state-of-the-art methods in emo- of the Association for Computational Linguistics
tional support conversation. In addition, the ablation and the International Joint Conference on Natural
study demonstrates the effectiveness of each component Language Processing, ACL, 2021, pp. 3469–3483.
of our model. [4] W. Peng, Y. Hu, L. Xing, Y. Xie, Y. Sun, Y. Li, Con-
trol globally, understand locally: A global-to-local
hierarchical graph network for emotional support
Limitations conversation, in: Proceedings of the International
Joint Conference on Artificial Intelligence, ijcai.org,
The ESConv task requires the dialogue agent to reveal
2022, pp. 4324–4330.
some information about itself. For example, one of the
[5] Y. Cheng, W. Liu, W. Li, J. Wang, R. Zhao, B. Liu,
strategies called Self-disclosure, expects the agent to cite
X. Liang, Y. Zheng, Improving multi-turn emotional
their own experience. However, in our experiments, we
support dialogue generation with lookahead strat-
observed that the current model often struggles to main-
egy planning, in: Proceedings of the Conference on
tain a consistent personality. We speculate that this may
Empirical Methods in Natural Language Processing,
be due to the fact that the supporter role in the full train-
ACL, 2022, pp. 3014–3026.
ing sample is provided by multiple individuals, and thus
[6] R. Speer, J. Chin, C. Havasi, Conceptnet 5.5: An
there is no uniform character experience and story, which
open multilingual graph of general knowledge, in:
leads to the problem of inconsistent personal experiences
Proceedings of the Thirty-First AAAI Conference
during the conversation. We believe that how to make
on Artificial Intelligence, AAAI Press, 2017, pp.
the dialogue agent show coherent and unified personal
4444–4451.
[7] Y. Ding, J. Liu, X. Zhang, Z. Yang, Dynamic tracking Linguistics, ACL, 2019, pp. 5370–5381.
of state anxiety via multi-modal data and machine [17] L. Shen, Y. Feng, CDL: curriculum dual learning
learning., Frontiers in psychiatry 13 (2022). for emotion-controllable response generation, in:
[8] J. Greene, B. Burleson, Handbook of Communica- Proceedings of the Annual Meeting of the Associa-
tion and Social Interaction Skills, American Psycho- tion for Computational Linguistics, ACL, 2020, pp.
logical Association, 2003. 556–566.
[9] A. C. High, K. R. Steuber, An examination of [18] M. Y. Chen, S. Li, Y. Yang, Emphi: Generating em-
support (in)adequacy: Types, sources, and conse- pathetic responses with human-like intents, in:
quences of social support among infertile women., Proceedings of the Conference of the North Ameri-
Communication Monographs 81 (2014). can Chapter of the Association for Computational
[10] B. Wei, S. Lu, L. Mou, H. Zhou, P. Poupart, G. Li, Linguistics: Human Language Technologies, ACL,
Z. Jin, Why do neural dialog systems generate short 2022, pp. 1063–1074.
and meaningless replies? a comparison between [19] W. Kim, Y. Ahn, D. Kim, K. Lee, Emp-rft: Empa-
dialog and translation, in: International Conference thetic response generation via recognizing feature
on Acoustics, Speech and Signal Processing, IEEE, transitions between utterances, in: Proceedings
2019, pp. 7290–7294. of the Conference of the North American Chap-
[11] Y. Liu, W. Bi, J. Gao, X. Liu, J. Yao, S. Shi, Towards ter of the Association for Computational Linguis-
less generic responses in neural conversation mod- tics: Human Language Technologies, ACL, 2022, pp.
els: A statistical re-weighting method, in: E. Riloff, 4118–4128.
D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceed- [20] Q. Li, H. Chen, Z. Ren, P. Ren, Z. Tu, Z. Chen, Em-
ings of the 2018 Conference on Empirical Meth- pdg: Multi-resolution interactive empathetic dia-
ods in Natural Language Processing, Association logue generation, in: Proceedings of the Interna-
for Computational Linguistics, Brussels, Belgium, tional Conference on Computational Linguistics, In-
2018, pp. 2769–2774. URL: https://aclanthology.org/ ternational Committee on Computational Linguis-
D18-1297. doi:10.18653/v1/D18- 1297 . tics, 2020, pp. 4454–4466.
[12] C. E. Hill, Helping skills: Facilitating, exploration, [21] N. Majumder, P. Hong, S. Peng, J. Lu, D. Ghosal, A. F.
insight, and action, American Psychological Asso- Gelbukh, R. Mihalcea, S. Poria, MIME: mimicking
ciation, 2009. emotions for empathetic response generation, in:
[13] C. Zheng, Y. Liu, W. Chen, Y. Leng, M. Huang, Co- Proceedings of the Conference on Empirical Meth-
mae: A multi-factor hierarchical framework for em- ods in Natural Language Processing, ACL, 2020, pp.
pathetic response generation, in: C. Zong, F. Xia, 8968–8979.
W. Li, R. Navigli (Eds.), Findings of the Association [22] Z. Lin, A. Madotto, J. Shin, P. Xu, P. Fung, Moel:
for Computational Linguistics: ACL/IJCNLP 2021, Mixture of empathetic listeners, in: Proceedings of
Online Event, August 1-6, 2021, volume ACL/IJC- the Conference on Empirical Methods in Natural
NLP 2021 of Findings of ACL, Association for Com- Language Processing and the International Joint
putational Linguistics, 2021, pp. 813–824. Conference on Natural Language Processing, ACL,
[14] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- 2019, pp. 121–132.
hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: [23] X. Xu, X. Meng, Y. Wang, Poke: Prior knowledge
denoising sequence-to-sequence pre-training for enhanced emotional support conversation with la-
natural language generation, translation, and com- tent variable, CoRR abs/2210.12640 (2022).
prehension, in: Proceedings of the Annual Meeting [24] Q. Tu, Y. Li, J. Cui, B. Wang, J. Wen, R. Yan, MISC:
of the Association for Computational Linguistics, A mixed strategy-aware model integrating COMET
ACL, 2020, pp. 7871–7880. for emotional support conversation, in: Proceed-
[15] H. Zhou, M. Huang, T. Zhang, X. Zhu, B. Liu, Emo- ings of the Annual Meeting of the Association for
tional chatting machine: Emotional conversation Computational Linguistics, ACL, 2022, pp. 308–319.
generation with internal and external memory, in: [25] W. Peng, Z. Qin, Y. Hu, Y. Xie, Y. Li, FADO: feedback-
Proceedings of the AAAI Conference on Artificial aware double controlling network for emotional
Intelligence, the innovative Applications of Artifi- support conversation, Knowledge-Based Systems
cial Intelligence, and the AAAI Symposium on Ed- 264 (2023) 110340.
ucational Advances in Artificial Intelligence, AAAI [26] W. J. Reynolds, B. Scott, Empathy: a crucial compo-
Press, 2018, pp. 730–739. nent of the helping relationship, Journal of psychi-
[16] H. Rashkin, E. M. Smith, M. Li, Y. Boureau, Towards atric and mental health nursing 6 (1999) 363–370.
empathetic open-domain conversation models: A [27] B. Liu, S. S. Sundar, Should machines express sym-
new benchmark and dataset, in: Proceedings of the pathy and empathy? experiments with a health
Conference of the Association for Computational advice chatbot, Cyberpsychology, Behavior, and
Social Networking 21 (2018) 625–636. ond Workshop on Statistical Machine Translation,
[28] T. Parkin, A. de Looy, P. Farrand, Greater profes- Association for Computational Linguistics, Prague,
sional empathy leads to higher agreement about Czech Republic, 2007, pp. 228–231.
decisions made in the consultation, Patient Educa- [39] R. Vedantam, C. L. Zitnick, D. Parikh, Cider:
tion and Counseling 96 (2014) 144–150. Consensus-based image description evaluation, in:
[29] Q. Li, P. Li, Z. Ren, P. Ren, Z. Chen, Knowledge Conference on Computer Vision and Pattern Recog-
bridging for empathetic dialogue generation, in: nition, IEEE, 2015, pp. 4566–4575.
The Conference on Artificial Intelligence, Confer- [40] N. Schluter, The limits of automatic summarisa-
ence on Innovative Applications of Artificial Intel- tion according to ROUGE, in: Proceedings of the
ligence, The Symposium on Educational Advances Conference of the European Chapter of the Associ-
in Artificial Intelligence, AAAI Press, 2022, pp. ation for Computational Linguistics, ACL, 2017, pp.
10993–11001. 41–45.
[30] Y. Liu, W. Maier, W. Minker, S. Ultes, Empathetic [41] Y. Zhang, S. Sun, M. Galley, Y.-C. Chen, C. Brock-
dialogue generation with pre-trained roberta-gpt2 ett, X. Gao, J. Gao, J. Liu, B. Dolan, DIALOGPT :
and external knowledge, in: Conversational AI for Large-scale generative pre-training for conversa-
Natural Human-Centric Interaction International tional response generation, in: Proceedings of the
Workshop on Spoken Dialogue System Technology, 58th Annual Meeting of the Association for Com-
volume 943 of Lecture Notes in Electrical Engineering, putational Linguistics: System Demonstrations, As-
Springer, 2021, pp. 67–81. sociation for Computational Linguistics, 2020, pp.
[31] L. Jing, X. Song, X. Lin, Z. Zhao, W. Zhou, L. Nie, 270–278.
Stylized data-to-text generation: A case study in the [42] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson,
e-commerce domain, ACM Trans. Inf. Syst. (2023). Y. Liu, J. Xu, M. Ott, E. M. Smith, Y.-L. Boureau,
[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, J. Weston, Recipes for building an open-domain
L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- chatbot, in: Proceedings of the 16th Conference of
tention is all you need, in: Advances in Neural the European Chapter of the Association for Com-
Information Processing Systems 30: Annual Con- putational Linguistics: Main Volume, Association
ference on Neural Information Processing Systems, for Computational Linguistics, 2021, pp. 300–325.
2017, pp. 5998–6008. [43] K. Sohn, H. Lee, X. Yan, Learning structured out-
[33] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- put representation using deep conditional gener-
langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- ative models, in: Advances in Neural Informa-
towicz, J. Brew, Huggingface’s transformers: State- tion Processing Systems 28: Annual Conference
of-the-art natural language processing, CoRR on Neural Information Processing Systems, 2015,
abs/1910.03771 (2019). pp. 3483–3491.
[34] I. Loshchilov, F. Hutter, Fixing weight decay regu- [44] S. M. Mohammad, Obtaining reliable human ratings
larization in adam, CoRR abs/1711.05101 (2017). of valence, arousal, and dominance for 20, 000 en-
[35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- glish words, in: Proceedings of the Annual Meeting
bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, of the Association for Computational Linguistics,
L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, ACL, 2018, pp. 174–184.
Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An
imperative style, high-performance deep learning
library, in: Advances in Neural Information Pro-
cessing Systems: Annual Conference on Neural In-
formation Processing Systems, 2019, pp. 8024–8035.
[36] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: a
method for automatic evaluation of machine trans-
lation, in: Proceedings of the Annual Meeting of
the Association for Computational Linguistics, ACL,
2002, pp. 311–318.
[37] C.-Y. Lin, Rouge: A package for automatic eval-
uation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[38] A. Lavie, A. Agarwal, METEOR: An automatic met-
ric for MT evaluation with high levels of correlation
with human judgments, in: Proceedings of the Sec-