<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Analogies for Evaluating Emotion in LLM-Generated Utterances</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sadegh Jafari</string-name>
          <email>sadegh.jafari@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Els Lefever</string-name>
          <email>els.lefever@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Véronique Hoste</string-name>
          <email>veronique.hoste@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LT3, Language and Translation Technology Team, Ghent University</institution>
          ,
          <addr-line>Groot-Brittanniëlaan 45, 9000 Ghent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Emotion plays a vital role in human communication, shaping not only language but also vocal tone, facial expression, and body posture. In the context of emotionally expressive text generation, the lack of reliable evaluation metrics still remains a key challenge. This paper introduces a two-step evaluation framework using embedding analogy-based metrics to assess the emotional expressiveness of large language models (LLMs). In the first step, we evaluate the model's ability to neutralize emotional content from a given text while preserving its semantic meaning. In the second step, we test the model's capacity to reinject the intended emotion back into the neutralized text. Our experiments demonstrate that GPT-4.1 outperforms other models in both semantic retention and emotional reconstruction, while llama-3.3-70b-instruct performs best among open-source models. This work lays the foundation for future research on cross-modal afective computing, aiming to build emotionally intelligent agents capable of nuanced and empathetic communication across text, speech, and video.</p>
      </abstract>
      <kwd-group>
        <kwd>Emotionally expressive text generation</kwd>
        <kwd>Evaluation metrics</kwd>
        <kwd>Large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Understanding and responding to human emotions is critical for AI systems operating in professional
settings, particularly in education, where teachers and students engage in complex emotional
interactions. In second language (L2) learning environments, emotionally supportive conversational agents
can help teachers foster a safe and motivating atmosphere, alleviating workload and enhancing the
student learning experience. Such systems require robust emotional understanding and generation
capabilities, which are still underdeveloped due to fundamental challenges in emotion evaluation.</p>
      <p>
        To function efectively in such roles, these systems must be capable of detecting and generating
emotional content in real-life, unscripted scenarios. This ability is especially important in high-stakes
domains such as healthcare, education, and crisis management. In such contexts, the ability to recognize
and respond to genuine human emotions, rather than acted or exaggerated afect, is crucial for building
trust, ensuring user well-being, and improving decision-making [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Recent eforts to build empathically
aware AI systems rely heavily on the generation and interpretation of afective content [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. However,
evaluating the emotional quality of text generated by LLMs remains a fundamental challenge. Current
evaluation methods for emotionally expressive text are either expensive, when relying on human
annotations, or inadequate in quality and generalization when using existing automatic metrics [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ].
This limits their usefulness for scalable and robust assessment of emotion generation models.
      </p>
      <p>
        In this paper, we address the gap in efective and eficient evaluation of emotional text generation.
We propose an embedding-based evaluation pipeline that measures emotional alignment in
LLMgenerated text without requiring human labels. Our method builds on analogical reasoning in emotion
embedding spaces, incorporating steps of emotion neutralization and re-injection to isolate and assess
the emotional expressiveness of diferent LLMs. We apply our evaluation framework to a range of
stateof-the-art LLMs and find that GPT-4.1 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] consistently produces the most emotionally aligned outputs.
Among open-source models, LLaMA-3.3-70B-Instruct [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] performs best. Our results demonstrate
that embedding-based emotion evaluation is a practical and scalable alternative to existing methods,
2025 Workshop on ’AI for understanding human behavior in professional settings’ (BEHAIV) - ECAI-2025 Bologna, Italy
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
providing a reliable benchmark for future emotion generation tasks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Recent research has explored emotional text generation using LLMs, with a growing interest in
evaluating their ability to generate afectively aligned content. In this section, we review state-of-the-art
models and evaluation strategies for emotional control in LLMs. Dong et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] introduced continuous
emotion vectors to steer LLM outputs toward target afective states. For evaluation, they generated two
synthetic datasets using GPT-4o-mini [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and assessed performance using perplexity, topic adherence
(via prompt engineering), emotion probability score (using the zero-shot classifier
facebook/bartlarge-mnli [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]), and an emotion absolute score derived from prompt-based heuristics. However, the
prompt-based scores were not evaluated or validated, as they simply relied on the LLM’s own response
to a scoring prompt. Ishikawa and Yoshino [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] explored emotional expression in LLMs using the
circumplex model of afect. They fine-tuned a model on the GoEmotions dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], but the resulting
classifier, sentimentmodel-sample-27go-emotion [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], achieved 58.9% accuracy, which was deemed
insuficient for further use in evaluation. To circumvent the limitations of discrete emotion classification,
they instead projected the generated outputs into the arousal–valence space. This alternative approach
was implemented to simplify the evaluation task, though it did not aim primarily at improving reliability.
      </p>
      <p>
        To improve emotional appropriateness in generation, Li et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposed emotional
chain-ofthought prompting, grounded in Goleman’s emotional intelligence framework [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. They argued that
current emotion recognizers are inadequate for evaluation and introduced the Emotional Generation
Score (EGS), a prompt-based metric evaluated via GPT-3.5 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], supplemented by a small-scale human
study with three annotators. Wang et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] incorporated commonsense reasoning to enhance
empathetic dialogue generation in LLMs. Using the EmpatheticDialogues [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and Emotional Support
Conversation datasets [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], they employed traditional metrics, BLEU [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], ROUGE-L [20], METEOR [21],
Distinct-n [22], and CIDEr [23], along with cosine similarity and human evaluation. Human evaluation
is valuable but costly and lacks repeatability. A disadvantage of existing automatic metrics is that they
often fall short, as lexical overlap between gold-standard and generated emotional expressions remains
high regardless of the actual emotional efectiveness. Janssens et al. [24] show that even advanced
models struggle to detect miscommunications from facial expressions in natural human-robot dialogue,
performing no better than chance. Their findings reveal that users often do not express confusion
in visibly detectable ways, highlighting the limitations of current afect recognition tools, which are
predominantly trained or fine-tuned on corpora of acted, non-naturalistic emotions and reinforcing the
need for more robust, context-aware emotion evaluation strategies.
      </p>
      <p>While these studies propose creative methods for controlling and evaluating emotional content, their
reliance on unstable, non-repeatable, or costly approaches leaves the quality assessment of generated
emotions an open challenge. Popular metrics like BLEU and ROUGE-L are often inadequate, as lexical
overlap between gold-standard and generated emotion expressions remains high regardless of emotional
success, rendering these metrics non-discriminative. Prompt-based LLM evaluation (e.g., using
GPT4 to judge GPT-3) also sufers from bias and circularity, especially when assessing commercial or
closed-source systems. Lastly, human evaluation, while insightful, is costly and non-repeatable.</p>
      <p>Our study addresses these gaps by highlighting the urgent need for robust, repeatable, and
modelagnostic emotion evaluation strategies that can generalize across diverse generation setups. Unlike
prior works, we initiate a neutralization–reinjection process: first stripping emotions from the original
dataset, then prompting models to regenerate emotional variants. This setup enables us to evaluate
models based on their capacity to reintroduce appropriate emotions while preserving semantic content.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>In recent years, a growing number of multimodal emotion recognition datasets have been introduced to
support research in afective computing and emotionally intelligent systems. Notable among these is
the MELD dataset [25], which comprises multi-party conversations extracted from the Friends TV show.
Although MELD provides valuable dialogic emotion labels, it is based on acted and scripted television
content, which may not generalize well to spontaneous emotional behaviors. Similarly, the IEMOCAP
dataset [26] features dyadic interactions between professional actors performing scripted and
semiscripted scenarios, ofering rich annotations across modalities, but again lacks true spontaneity. Similar
corpora for Chinese include EmotionTalk [27] and M3ED [28], introducing large-scale, multimodal
emotion data from Chinese TV dramas and controlled dialogues. To address the lack of spontaneous
emotion data, the K-EmoCon dataset [29] captured natural interactions during real-time debates and
provided multi-perspective annotations, including physiological signals, but is limited in scale and does
not cover monologue settings. While these datasets advance the field significantly, they still reflect
contextual and cultural biases, often rely on acted emotions, and typically do not isolate modalities
during annotation, which limits their utility for fine-grained unimodal vs. multimodal analysis.</p>
      <p>These limitations, namely, the lack of spontaneous, non-acted emotional expressions, limited diversity
of monologue data, and insuficient attention to isolated modality annotations, motivates the use of
new datasets designed to better reflect natural emotional communication. The UniC [30] dataset is
a multimodal emotion dataset comprising 965 video clips sourced from YouTube, selected to capture
natural, spontaneous emotional expressions rather than acted performances. The videos primarily
include monologues such as book and movie reviews, where a single visible speaker expresses emotions
clearly in both speech and facial expressions. The dataset was constructed through a multi-step filtering
process using keyword searches, sentiment-based subtitle filtering, and manual validation. Each clip,
approximately 10 seconds long, was annotated independently across four modalities: text, audio,
silent video, and all modalities combined. Emotion annotations use both categorical and dimensional
frameworks. Initially based on 26 categorical emotion labels from Shaver et al. [31], these were reduced
to seven emotion clusters (joy, contentment, surprise, confusion, neutral, disappointment, and disgust)
via clustering analysis, alongside valence and arousal scores. Figure 1 shows a sample from the UniC
dataset.</p>
      <p>For our experiments, we focused on the text modality as a stepping stone to multimodal emotion
expression generation in follow-up research. Noteworthy to mention is that for this text modality, the
inter-annotator agreement (IAA) was highest, reaching a Fleiss’ kappa of 0.47 after annotator training
and emotion clustering. Among the diferent labeled emotions, emotions such as confusion and surprise
were less reliably detected from text alone, highlighting the added value of multimodal signals. We
evaluated the text modality of the UniC dataset using several baseline models, for which we used 100%
of the dataset for testing. Due to the limited size of the dataset, we employed 5-fold cross-validation for
training and evaluating our custom model.</p>
      <p>As shown in Table 1, our model does not achieve the highest performance across any metric. Among
the evaluated models, michellejieli and j-hartmann are fine-tuned emotion classifiers based on the
DistilRoBERTa-base [34] architecture. The bart-large-mnli model, a zero-shot classifier built on the
BART-large [35] transformer, is used without fine-tuning. The gpt-4o-mini model, on the other hand, is
an LLM that predicts emotions through prompt-based reasoning. Notably, michellejieli achieves the
highest accuracy (0.4492) and precision (0.4205), while gpt-4o-mini performs best in recall (0.4496) and
F1 score (0.3579). Our approach, which combines BAAI-bge-m3 embeddings [36] with a tuned Random
Forest classifier [ 37], yields moderate but consistent results across all metrics because it just trained
on UniC dataset(772 training samples). The classifier’s hyperparameters are shown in Table 2. It is
important to highlight that these relatively low performance scores are primarily due to the nature of
the dataset, which consists of natural, non-acted emotional expressions.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Emotional text generation and its evaluation have been less explored through analogical methods,
despite their proven utility in measuring structured semantic relations. Chen et al. [38] systematically
analyzed vector-based analogies, confirming their reliability in capturing such relations, and Zhu and
De Melo [39] extended analogical reasoning to contextualized sentence embeddings, showing that
some models preserve analogical structures at the sentence level. To our knowledge, no prior work
has applied analogy-based evaluation specifically to the assessment of emotional expressiveness in
generated text.</p>
      <p>Building on these insights, our methodology employs analogy-based evaluation to quantify the
emotional expressiveness of LLMs. To rigorously isolate the model’s generative capabilities, we begin
by neutralizing the emotional content of each ground-truth (GS) text in our dataset using an LLM.
Following neutralization, the model is prompted to regenerate the emotional version of each text. The
neutralization step is crucial: by comparing the regenerated emotional outputs with the original GS
emotions, we ensure that any observed afective content arises from the model’s learned patterns rather
than residual cues in the input. Finally, we compute embedding-based similarity and analogy metrics
between the GS and regenerated texts, enabling quantification of both semantic fidelity and emotional
alignment.</p>
      <sec id="sec-4-1">
        <title>4.1. Embedding Evaluation Metric</title>
        <p>Before focusing on the embedding evaluation metric, we should mention that all embeddings were
calculated using the BGE-M3 [36] language model, and the 2D space was generated using the t-SNE
[40] method applied to the BGE-M3 embedding space.</p>
        <p>In our embedding evaluation metric, we draw inspiration from the well-known linguistic analogy:
“king - man + woman ≈ queen”. This example illustrates how word embeddings can capture semantic
relationships through vector arithmetic [41]. By representing words as vectors in a high-dimensional
space, operations such as subtraction and addition can reveal underlying relationships, such as gender
or emotional tone. This property enables the assessment of emotional quality in generated text by
analyzing geometric relationships between word vectors, ofering a quantitative measure of emotional
expressiveness in language models. Figure 2 visually demonstrates this concept, showing how vector
operations can encode semantic relationships in the embedding space.</p>
        <p>In Figure 2, the length and direction of the vectors  ( king) −  ( queen) and  ( man) −  ( woman)
appear to be the same. However, this does not reflect reality. In a realistic scenario, we would expect
the vector  ( king) −  ( man) +  ( woman) to be close to  ( queen). Using BGE-M3, we calculated
the embeddings for queen, king, man, and woman. As shown in Figure 3, the expression  ( king) −
 ( man) +  ( woman) is not exactly equal to  ( queen), but it is close.
4.1.1. Cosine Similarity vs. Manhattan Distance
A common method for measuring similarity between two vectors is the cosine similarity metric. However,
in analogy tasks, this method has a major limitation: the results can vary based on the operation order.
Consider the analogy: king is to queen as man is to woman. The similarity and distance scores for
various formulations are summarized in Table 3.</p>
        <p>As shown in Table 3, diferent operation orders produce varying cosine similarity scores, revealing
inconsistency in the evaluation of the cosine-based analogy. In contrast, the Manhattan distance
produces stable results across all permutations, indicating its robustness for analogy reasoning tasks.
Due to its consistent behavior, we further used the Manhattan distance for the analogy evaluation in
our experiments.
4.1.2. Real Emotional Example
To better understand the role of emotional analogy in our framework, we illustrate a representative
example from our experiments. The goal is to analyze how vector arithmetic in the embedding space
can capture shifts in emotional expression between sentences. Figure 4 visualizes this example. The
corresponding text for each variable in the figure is as follows:
• joy = “joy”
• neutral = “neutral”
• neutral_sent = “It’s my first day as a student”
• joy_sent = “I’m so happy, it’s my first day as a student!”</p>
        <p>In the Figure 4, we observe that the distance between the neutral and joy emotion embeddings is
relatively large. This discrepancy poses a challenge for emotional analogy, as the semantic distance
between the two sentence embeddings (neutral_sent and joy_sent) is significantly smaller than the
distance between their corresponding emotion labels. To mitigate this, we construct an analogy vector
using the following equation:
analogy_vector = neutral − neutral_sent + joy_sent
(1)</p>
        <p>This vector is then compared with the joy embedding. As shown in the Figure 4, the analogy vector
lies closer to joy than neutral, indicating that the analogy operation efectively captures the intended
emotional shift.</p>
        <p>Recognizing emotions in real user utterances is particularly challenging due to their subtle and
nuanced nature. As shown in Table 1, the best model achieves an F1-score of only 35.79%,
significantly lower than the 60.25% observed on acted datasets like MELD [25]. To further investigate this
phenomenon, we visualized the semantic structure of emotion representations using the BGE-M3
embedding model. Figure 5 shows a 2D projection of both the emotion label embeddings and the
average embeddings of real user utterances associated with each emotion. In this plot, each circle
represents an emotion label (e.g., joy, disgust, neutral), and each square denotes the average embedding
of utterances tagged with that emotion. Two sets of relationships are highlighted:
• Red lines connect the embedding of the label neutral to other emotion labels.
• Green lines connect the average embedding of utterances labeled as neutral to the average
embeddings of utterances for other emotions.</p>
        <p>The figure reveals that while the emotion labels are well-separated in the embedding space, indicating
clear semantic distinctions, the average embeddings of real user expressions are clustered more closely
together, especially around the neutral region. This supports the idea that emotional language in real
interactions is often more subtle, making automatic emotion detection more challenging in natural
contexts.</p>
        <p>To better understand how emotional meaning is encoded in sentence embeddings, we explore the
relationship between labeled and unlabeled emotional expressions. Specifically, we aim to approximate
the embedding of an emotionally tagged utterance using its neutral version and the emotional shift
encoded in a semantically aligned sentence. Here, labeled emotion refers to utterances that include
direct emotion labels from the gold-standard data in UniC dataset (e.g., “I’m so happy, it’s my first day
as a student! (joy emotion)”), while unlabeled emotion refers to emotionally expressive content without
such tags but still conveying afect (e.g., “I’m so happy, it’s my first day as a student!”). Neutral versions
are afectively flat and omit emotional cues.</p>
        <p>Our approach applies an analogy-style vector transformation of the form: neutral − neutral_sent
+ joy_sent, where neutral_sent and joy_sent are the neutral and emotionally expressive versions of
the same utterance. This transformation enriches the afective content of the neutral-tagged embedding
by injecting the emotional variation from the unlabeled expression, while preserving the shared semantic
structure. The goal is to reduce the distance between the synthesized embedding and its explicitly
emotional counterpart, efectively revealing how emotional meaning can be reconstructed through
compositional operations. Figure 6 visualizes this transformation. The green arrow illustrates the
analogy vector described above, and the dashed lines indicate the proximity between the predicted and
actual emotion embeddings. The text associated with each vector in the figure is as follows:
• joy = “I’m so happy, it’s my first day as a student! (joy emotion)”</p>
        <p>• neutral = “It’s my first day as a student (neutral emotion)”
• neutral_sent = “It’s my first day as a student”
• joy_sent = “I’m so happy, it’s my first day as a student!”
4.2. Emotion Embedding Extraction Using Prompted Text Templates
As discussed in Section 4.1.1, we use Manhattan distance as our similarity metric due to its sensitivity
to subtle semantic variations in the embedding space. This metric is essential for evaluating how
emotional content can be manipulated while preserving the original meaning. Our goal is to identify
the most efective prompt template for extracting emotion embeddings from textual descriptions.
These embeddings, denoted as   ,   , and   , represent the original, neutral, and target emotional
states, respectively. By inserting emotion-related phrases into structured prompt templates, we derive
these embeddings for use in analogy-based transformations. The transformation involves two steps:
neutralization and emotionalization. Let   ,   , and   be the sentence embeddings for the original,
neutral, and target emotional versions of the same sentence. Let MD(, ) denote the Manhattan
distance between embeddings  and  . The neutralization step tests whether removing the original
emotion and inserting the neutral emotion embedding moves it closer to   :</p>
        <p>MD(  ,   ) ≥ MD(  −   +   ,   )</p>
        <p>The emotionalization step checks whether inserting the target emotion into the neutral embedding
moves it closer to   :</p>
        <p>MD(  ,   ) ≥ MD(  −   +   ,   )</p>
        <p>These conditions validate whether modifying sentence embeddings via emotional vectors steers them
toward the intended emotional states. A transformation is deemed successful when both inequalities
are satisfied.
(2)
(3)
System Prompt 1: Text Neutralization
Your task is to neutralize the text by removing emotional expressions.</p>
        <p>The text is a transcription of a video.</p>
        <p>The text may contain emotional expressions.</p>
        <p>The text should be neutral and not contain any emotional expressions.</p>
        <p>The text should be in the same language, format, style, tone, and context as the input text.</p>
        <p>Please try to change the text as little as possible.</p>
        <p>Please neutralize the following text: {text}
The original emotion of the text is: {emotion}
Please make sure to remove all emotional expressions from the text.</p>
        <p>System Prompt 2: Emotional Text Generation
Your task is to make the text more emotional by adding emotional expressions.</p>
        <p>The text is a transcription of a video.</p>
        <p>The text should be in the same language, format, style, tone, and context as the input text.</p>
        <p>Please try to change the text as little as possible.</p>
        <p>Don’t mention the emotion in the text directly.</p>
        <p>Please add emotional expressions to the following text: {text}
The current emotion of the text is: neutral.</p>
        <p>The target emotion of the text should be: {emotion}.</p>
        <p>To identify the most efective prompt template for extracting emotion embeddings, we evaluated five
candidate prompt formulations across several LLMs. These templates vary in how they contextualize
emotion labels with respect to the text, ranging from labeled structures (e.g., “joy emotion: {text}”) to
minimal expressions (e.g., just “joy”).</p>
        <p>Our evaluation follows a two-step analogy-based framework. In the neutralization step, we generated
neutral versions of emotional sentences using each LLM with a fixed system instruction based on
System Prompt 1. To extract the emotion embeddings   and   used in Equation 2, we tested the five
emotion prompt templates by plugging them into an embedding encoder. In the emotionalization step,
we used System Prompt 2 to generate emotionalized sentences from neutral ones and evaluated how
well each emotion prompt template performed using Equation 3 with the target emotion embedding   .
The following are the details about the emotion embedding prompts:
• Prompt 1: {emotion} emotion: {text}</p>
        <p>As shown in Table 4, we identify the best-performing emotion prompt template for each step of the
evaluation. Using the entire text-only UniC dataset for evaluation, we conduct experiments on two
tasks: neutralization and emotionalization. For neutralization, Prompt 3 achieves the highest analogy
satisfaction rates across most models. For emotionalization, Prompt 1 performs best, indicating its
efectiveness in reintroducing emotional content through embedding manipulation. These findings
suggest that diferent prompt styles may be optimal for extracting emotion embeddings depending on
the specific transformation goal.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Analysis and Results</title>
      <p>
        Having decided on the Manhattan Distance to compare the embedding vectors (Section 4.1.1) and on
using distinct prompt templates for extracting emotion embeddings depending on the transformation
stage (Section 4.2), we set up an experiment in which our goal was to evaluate the impact of emotion
generation by comparing the original emotional data with the emotionally re-generated text. Specifically,
we used Prompt 3 for the neutralization stage and Prompt 1 for the emotionalization stage, as each
achieved the highest analogy satisfaction rates for their respective tasks across most models. To enable
a broad comparison, we evaluated a range of LLMs, including open-source models such as Gemma
[42], LLaMA-3 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and Mistral-NeMo [43], as well as commercial models like GPT-4.1 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and
GPT-4o-Mini [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] from OpenAI. This mix allowed us to assess the efectiveness of emotion embedding
manipulation across both accessible, community-driven models and state-of-the-art proprietary systems.
All evaluations are on the UniC dataset’s text modality. The process consisted of the following two
main steps:
      </p>
      <sec id="sec-5-1">
        <title>5.1. Neutralization</title>
        <p>We used an LLM to neutralize the emotional content of the original text samples. This step aimed to
remove any labeled or unlabeled emotional signals, resulting in emotionally flat, semantically preserved
text. In this experiment, we used System Prompt 1. The following formulas were used in the tables to
evaluate the performance of diferent models. In these equations,  denotes the analogy vector.</p>
        <p>=   −   +  
1  = cos(  ,   ), 2  = cos(  , )
1  = ‖  −   ‖1, 2  = ‖  − ‖ 1</p>
        <p>As shown in Table 5, we evaluate each LLM’s ability to perform emotional neutralization based on
how well the transformed sentence embedding aligns with the original emotional context vector. The
evaluation uses both cosine similarity and Manhattan distance to capture diferent aspects of embedding
relationships. In both cosine similarity and Manhattan distance metrics, GPT-4.1 demonstrates the most
controlled and semantically faithful emotion neutralization among all evaluated models. While
llama3.3-70b-instruct achieves the highest post-neutralization cosine similarity (2 c = 0.9746) and lowest
Manhattan distance (2 m = 5.33), GPT-4.1 yields the smallest changes in both cosine (Δ c = 0.0410)
and Manhattan metrics (Δ m = −3.297), indicating minimal semantic distortion during transformation.
This suggests that GPT-4.1 preserves original sentence meaning more efectively while removing
emotional content. Overall, while high-capacity open-source models like mistral-nemo-12b-instruct
are increasingly competitive, commercial models such as GPT-4.1 still lead in performance when
performing nuanced tasks like emotion neutralization.</p>
        <p>To evaluate the consistency between diferent similarity metrics, we computed the Pearson correlation
[44] between the Manhattan distance and Cosine similarity values for both R2 and R1 scores. As shown in
Figure 7, there is a very strong negative correlation between the two measures for both R2 ( = −0.99365 )
and R1 ( = −0.99501 ). These results indicate that as the Manhattan distance increases, the Cosine
similarity decreases almost linearly, suggesting that both metrics are capturing highly similar trends in
evaluating the transcripts, albeit in opposite directions due to their diferent mathematical formulations.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Emotion Injection</title>
        <p>In the emotion injection phase, we used an LLM to reintroduce a target emotion (e.g., joy) into the
neutralized text. To guide this process, we prompted the model using system-level instructions and
emotion-specific cues. The goal was to generate emotionally expressive text that closely resembles
the original emotional content while preserving the core semantics of the neutralized version. In this
experiment, we used System Prompt 2. To perform the re-injection, we compute the analogy vector 
based on the following relationship:</p>
        <p>=   −   +  
1  = cos(  ,   ),
2  = cos(  , ),
3  = cos(  ,   ),</p>
        <p>1  = ‖  −   ‖1
2</p>
        <p>= ‖  − ‖ 1
3  = ‖  −   ‖1</p>
        <p>Table 7 shows the performance of various LLMs in emotion injection. GPT-4.1 achieves the best
results overall, with the lowest distances and highest cosine similarities (e.g., R3m = 9.99, R3c = 0.9130),
indicating strong emotional alignment and reinjection ability. In contrast, gemma-3-1b-it performs the
weakest, especially in re-injection quality (R3c = 0.6990). While commercial models like GPT-4.1 and
GPT-4o-mini outperform others due to superior training and architecture, larger open-source models
such as LLaMA-3.3-70B and Mistral-Nemo-12B show competitive performance, suggesting that open
models can still be efective in emotion-aware tasks.</p>
        <p>Table 8 reports emotion-wise performance of GPT-4.1 on the emotion injection task. The model
performs consistently across all emotions, with strong alignment scores (e.g., R3c &gt; 0.88) and small
Manhattan distances. Notably, the neutral class achieves the best results (R3c = 0.9280, R1m = 2.30),
which is expected since the model is converting a neutralized utterance back to a neutral form, making
the reinjection task considerably easier in this case.</p>
        <p>To further validate the metric alignment, we conducted a correlation analysis between Manhattan
distance and cosine similarity across the three relations (R1, R2, R3). All pairs exhibit strong negative
correlations below −0.9819, confirming the inverse relationship between the two metrics (see Figure 8).
These results confirm that increased directional similarity corresponds closely with reduced embedding
distance, validating the use of both metrics to quantify emotional fidelity in the reinjection process.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Works</title>
      <p>In this study, we explored the capability of LLMs to manipulate and generate emotionally expressive
text through a two-step process: emotional neutralization followed by targeted emotion injection. Using
embedding-based similarity metrics such as Manhattan distance and cosine similarity, we quantitatively
evaluated the extent to which LLMs can remove and reintroduce specific emotions while preserving the
semantic core of the original text. Our findings indicate that GPT-4.1, a commercial model, consistently
outperforms other models in maintaining semantic fidelity and accurately reconstructing emotional
nuances. Among open-source models, LLaMA-3.3-70B-Instruct demonstrates the best performance in
our experiments, making it a strong candidate for accessible, open research in emotion-aware language
generation. These results underscore the efectiveness of large-scale LLMs for emotion control and
expression in text and provide a foundation for broader afective computing applications. Although our
current focus is on the text modality, the proposed framework is explicitly designed to extend to speech
and visual channels by leveraging shared embedding spaces. In particular, recent work by Jha et al. [45],
which builds upon the Platonic Representation Hypothesis introduced by Huh et al. [46], demonstrates
that as neural networks scale, internal representations across modalities converge toward a shared
statistical model of reality. This convergence enables cross-modal afective analysis without requiring
paired training data, providing a strong theoretical and practical foundation for our future work.</p>
      <p>In addition to aligning emotional content across text, speech, and visual modalities within unified
embedding spaces, our future eforts will also involve improved prompt engineering and the development
of more expressive embedding models to enhance emotional transformation capabilities. As a concrete
application, we aim to develop a multimodal empathetic conversational agent for second language
(L2) learning. By engaging students in emotionally supportive interactions, such agents can foster
psychologically safe and motivating learning environments while assisting teachers in managing
afective dynamics in the classroom.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research received funding from the Flemish Government under the Flanders Artificial Intelligence
Research program (FAIR) (174K02325).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT-4 and Grammarly in order to: Grammar
and spelling check.
translation, in: Proceedings of the 40th annual meeting of the Association for Computational
Linguistics, 2002, pp. 311–318.
[20] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[21] S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation
with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation
measures for machine translation and/or summarization, 2005, pp. 65–72.
[22] J. Li, M. Galley, C. Brockett, J. Gao, B. Dolan, A diversity-promoting objective function for neural
conversation models, arXiv preprint arXiv:1510.03055 (2015).
[23] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: Consensus-based image description evaluation,
in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.
4566–4575.
[24] R. Janssens, J. De Bock, S. Labat, E. Verhelst, V. Hoste, T. Belpaeme, Why robots are bad at detecting
their mistakes: Limitations of miscommunication detection in human-robot dialogue, in: IEEE
RO-MAN 2025 conference, 2025.
[25] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, R. Mihalcea, Meld: A multimodal
multi-party dataset for emotion recognition in conversations, arXiv preprint arXiv:1810.02508
(2018).
[26] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan,
Iemocap: Interactive emotional dyadic motion capture database, Language resources and
evaluation 42 (2008) 335–359.
[27] H. Sun, X. Wang, J. Zhao, S. Zhao, J. Zhou, H. Wang, J. He, A. Kong, X. Yang, Y. Wang, et al.,
Emotiontalk: An interactive chinese multimodal emotion dataset with rich annotations, arXiv
preprint arXiv:2505.23018 (2025).
[28] J. Zhao, T. Zhang, J. Hu, Y. Liu, Q. Jin, X. Wang, H. Li, M3ed: Multi-modal multi-scene multi-label
emotional dialogue database, arXiv preprint arXiv:2205.10237 (2022).
[29] C. Y. Park, N. Cha, S. Kang, A. Kim, A. H. Khandoker, L. Hadjileontiadis, A. Oh, Y. Jeong, U. Lee,
K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic
conversations, Scientific Data 7 (2020) 293.
[30] Q. Du, S. Labat, T. Demeester, V. Hoste, Unic: A dataset for emotion analysis of videos with
multimodal and unimodal labels, Language Resources and Evaluation (2025) 1–36.
[31] P. Shaver, J. Schwartz, D. Kirson, C. O’connor, Emotion knowledge: further exploration of a
prototype approach., Journal of personality and social psychology 52 (1987) 1061.
[32] J. Hartmann, Fine-tuned DistilRoBERTa-base for Emotion Classification, https://huggingface.co/
michellejieli/emotion_text_classifier/, 2022. Accessed: 2025-05-07.
[33] J. Hartmann, Emotion english distilroberta-base, https://huggingface.co/j-hartmann/
emotion-english-distilroberta-base/, 2022.
[34] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[35] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer,
BART: denoising sequence-to-sequence pre-training for natural language generation,
translation, and comprehension, CoRR abs/1910.13461 (2019). URL: http://arxiv.org/abs/1910.13461.
arXiv:1910.13461.
[36] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, Bge m3-embedding: Multi-lingual,
multifunctionality, multi-granularity text embeddings through self-knowledge distillation, 2024.
arXiv:2402.03216.
[37] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
[38] D. Chen, J. C. Peterson, T. L. Grifiths, Evaluating vector-space models of analogy, arXiv preprint
arXiv:1705.04416 (2017).
[39] X. Zhu, G. De Melo, Sentence analogies: Linguistic regularities in sentence embeddings, in:</p>
      <p>Proceedings of the 28th international conference on computational linguistics, 2020, pp. 3389–3400.
[40] L. Van der Maaten, G. Hinton, Visualizing data using t-sne., Journal of machine learning research
9 (2008).
[41] C. Allen, T. Hospedales, Analogies explained: Towards understanding word embeddings, in:</p>
      <p>International Conference on Machine Learning, PMLR, 2019, pp. 223–231.
[42] G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé,</p>
      <p>M. Rivière, et al., Gemma 3 technical report, arXiv preprint arXiv:2503.19786 (2025).
[43] M. AI, Mistral nemo, https://mistral.ai/news/mistral-nemo/, 2024. Accessed: September 23, 2024.
[44] K. Pearson, Vii. note on regression and inheritance in the case of two parents, proceedings of the
royal society of London 58 (1895) 240–242.
[45] R. Jha, C. Zhang, V. Shmatikov, J. X. Morris, Harnessing the universal geometry of embeddings,
arXiv preprint arXiv:2505.12540 (2025).
[46] M. Huh, B. Cheung, T. Wang, P. Isola, The platonic representation hypothesis, arXiv preprint
arXiv:2405.07987 (2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ortega-Ochoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arguedas</surname>
          </string-name>
          , T. Daradoumis,
          <article-title>Empathic pedagogical conversational agents: a systematic literature review</article-title>
          ,
          <source>British Journal of Educational Technology</source>
          <volume>55</volume>
          (
          <year>2024</year>
          )
          <fpage>886</fpage>
          -
          <lpage>909</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sanjeewa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Apputhurai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wickramasinghe</surname>
          </string-name>
          , D. Meyer,
          <article-title>Empathic conversational agent platform designs and their evaluation in the context of mental health: Systematic review</article-title>
          ,
          <source>JMIR Mental Health</source>
          <volume>11</volume>
          (
          <year>2024</year>
          )
          <article-title>e58974</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mandelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Algesheimer</surname>
          </string-name>
          ,
          <article-title>Empathic voice assistants: Enhancing consumer responses in voice commerce</article-title>
          ,
          <source>Journal of Business Research</source>
          <volume>175</volume>
          (
          <year>2024</year>
          )
          <fpage>114566</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Arnon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Huppert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perry</surname>
          </string-name>
          , et al.,
          <article-title>Considering the role of human empathy in ai-driven therapy</article-title>
          ,
          <source>JMIR Mental Health</source>
          <volume>11</volume>
          (
          <year>2024</year>
          )
          <article-title>e56529</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Controllable emotion generation with emotion vectors</article-title>
          ,
          <source>arXiv preprint arXiv:2502.04075</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.-n.</given-names>
            <surname>Ishikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yoshino</surname>
          </string-name>
          ,
          <article-title>Ai with emotions: Exploring emotional expressions in large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2504.14706</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Sibyl:
          <article-title>Empowering empathetic dialogue generation in large language models via sensible and visionary commonsense inference</article-title>
          ,
          <source>in: Proceedings of the 31st International Conference on Computational Linguistics</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] OpenAI,
          <source>Introducing gpt-4</source>
          .1, https://platform.openai.com/docs/models/gpt-4.1,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -06-04.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models, arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Introducing gpt-4o-mini, https://platform.openai.com/docs/models/gpt-4o-mini,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -06-04.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Facebook-AI</surname>
          </string-name>
          , facebook/bart-large-mnli, https://huggingface.co/facebook/bart-large-mnli,
          <year>2020</year>
          . Accessed:
          <fpage>2025</fpage>
          -06-12.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Demszky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Movshovitz-Attias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cowen</surname>
          </string-name>
          , G. Nemade,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ravi</surname>
          </string-name>
          ,
          <article-title>Goemotions: A dataset of ifne-grained emotions</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>00547</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Khan</surname>
          </string-name>
          , sentiment
          <article-title>-model-sample-</article-title>
          27goemotion, https://huggingface.co/jkhan447/ sentiment-model-sample
          <string-name>
            <surname>-</surname>
          </string-name>
          27go-emotion,
          <year>2022</year>
          . Accessed: February 11,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , L. Nie,
          <article-title>Enhancing emotional generation capability of large language models via emotional chain-of-thought</article-title>
          ,
          <source>arXiv preprint arXiv:2401.06836</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Goleman</surname>
          </string-name>
          ,
          <article-title>Emotional intelligence: Why it can matter more than IQ</article-title>
          , Bantam,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          ,
          <source>Introducing gpt-3</source>
          .5, https://platform.openai.com/docs/models/gpt-3.5-turbo,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -06-04.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Rashkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-L.</given-names>
            <surname>Boureau</surname>
          </string-name>
          ,
          <article-title>Towards empathetic open-domain conversation models: A new benchmark and dataset</article-title>
          , arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>00207</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Demasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sabour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Towards emotional support dialog systems</article-title>
          ,
          <source>arXiv preprint arXiv:2106.01144</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>