1. Introduction

1613-0073

Analogies for Evaluating Emotion in LLM-Generated Utterances

Sadegh Jafari

sadegh.jafari@ugent.be 0

Els Lefever

els.lefever@ugent.be 0

Véronique Hoste

veronique.hoste@ugent.be 0

Workshop

0 LT3, Language and Translation Technology Team, Ghent University , Groot-Brittanniëlaan 45, 9000 Ghent , Belgium

Emotion plays a vital role in human communication, shaping not only language but also vocal tone, facial expression, and body posture. In the context of emotionally expressive text generation, the lack of reliable evaluation metrics still remains a key challenge. This paper introduces a two-step evaluation framework using embedding analogy-based metrics to assess the emotional expressiveness of large language models (LLMs). In the first step, we evaluate the model's ability to neutralize emotional content from a given text while preserving its semantic meaning. In the second step, we test the model's capacity to reinject the intended emotion back into the neutralized text. Our experiments demonstrate that GPT-4.1 outperforms other models in both semantic retention and emotional reconstruction, while llama-3.3-70b-instruct performs best among open-source models. This work lays the foundation for future research on cross-modal afective computing, aiming to build emotionally intelligent agents capable of nuanced and empathetic communication across text, speech, and video.

Emotionally expressive text generation Evaluation metrics Large language models

1. Introduction

Understanding and responding to human emotions is critical for AI systems operating in professional settings, particularly in education, where teachers and students engage in complex emotional interactions. In second language (L2) learning environments, emotionally supportive conversational agents can help teachers foster a safe and motivating atmosphere, alleviating workload and enhancing the student learning experience. Such systems require robust emotional understanding and generation capabilities, which are still underdeveloped due to fundamental challenges in emotion evaluation.

To function efectively in such roles, these systems must be capable of detecting and generating emotional content in real-life, unscripted scenarios. This ability is especially important in high-stakes domains such as healthcare, education, and crisis management. In such contexts, the ability to recognize and respond to genuine human emotions, rather than acted or exaggerated afect, is crucial for building trust, ensuring user well-being, and improving decision-making [ 1 ]. Recent eforts to build empathically aware AI systems rely heavily on the generation and interpretation of afective content [ 2, 3, 4 ]. However, evaluating the emotional quality of text generated by LLMs remains a fundamental challenge. Current evaluation methods for emotionally expressive text are either expensive, when relying on human annotations, or inadequate in quality and generalization when using existing automatic metrics [ 5, 6, 7 ]. This limits their usefulness for scalable and robust assessment of emotion generation models.

In this paper, we address the gap in efective and eficient evaluation of emotional text generation. We propose an embedding-based evaluation pipeline that measures emotional alignment in LLMgenerated text without requiring human labels. Our method builds on analogical reasoning in emotion embedding spaces, incorporating steps of emotion neutralization and re-injection to isolate and assess the emotional expressiveness of diferent LLMs. We apply our evaluation framework to a range of stateof-the-art LLMs and find that GPT-4.1 [ 8 ] consistently produces the most emotionally aligned outputs. Among open-source models, LLaMA-3.3-70B-Instruct [ 9 ] performs best. Our results demonstrate that embedding-based emotion evaluation is a practical and scalable alternative to existing methods, 2025 Workshop on ’AI for understanding human behavior in professional settings’ (BEHAIV) - ECAI-2025 Bologna, Italy

CEUR

ceur-ws.org providing a reliable benchmark for future emotion generation tasks.

2. Related Works

Recent research has explored emotional text generation using LLMs, with a growing interest in evaluating their ability to generate afectively aligned content. In this section, we review state-of-the-art models and evaluation strategies for emotional control in LLMs. Dong et al. [ 5 ] introduced continuous emotion vectors to steer LLM outputs toward target afective states. For evaluation, they generated two synthetic datasets using GPT-4o-mini [ 10 ] and assessed performance using perplexity, topic adherence (via prompt engineering), emotion probability score (using the zero-shot classifier facebook/bartlarge-mnli [ 11 ]), and an emotion absolute score derived from prompt-based heuristics. However, the prompt-based scores were not evaluated or validated, as they simply relied on the LLM’s own response to a scoring prompt. Ishikawa and Yoshino [ 6 ] explored emotional expression in LLMs using the circumplex model of afect. They fine-tuned a model on the GoEmotions dataset [ 12 ], but the resulting classifier, sentimentmodel-sample-27go-emotion [ 13 ], achieved 58.9% accuracy, which was deemed insuficient for further use in evaluation. To circumvent the limitations of discrete emotion classification, they instead projected the generated outputs into the arousal–valence space. This alternative approach was implemented to simplify the evaluation task, though it did not aim primarily at improving reliability.

To improve emotional appropriateness in generation, Li et al. [ 14 ] proposed emotional chain-ofthought prompting, grounded in Goleman’s emotional intelligence framework [ 15 ]. They argued that current emotion recognizers are inadequate for evaluation and introduced the Emotional Generation Score (EGS), a prompt-based metric evaluated via GPT-3.5 [ 16 ], supplemented by a small-scale human study with three annotators. Wang et al. [ 7 ] incorporated commonsense reasoning to enhance empathetic dialogue generation in LLMs. Using the EmpatheticDialogues [ 17 ] and Emotional Support Conversation datasets [ 18 ], they employed traditional metrics, BLEU [ 19 ], ROUGE-L [20], METEOR [21], Distinct-n [22], and CIDEr [23], along with cosine similarity and human evaluation. Human evaluation is valuable but costly and lacks repeatability. A disadvantage of existing automatic metrics is that they often fall short, as lexical overlap between gold-standard and generated emotional expressions remains high regardless of the actual emotional efectiveness. Janssens et al. [24] show that even advanced models struggle to detect miscommunications from facial expressions in natural human-robot dialogue, performing no better than chance. Their findings reveal that users often do not express confusion in visibly detectable ways, highlighting the limitations of current afect recognition tools, which are predominantly trained or fine-tuned on corpora of acted, non-naturalistic emotions and reinforcing the need for more robust, context-aware emotion evaluation strategies.

While these studies propose creative methods for controlling and evaluating emotional content, their reliance on unstable, non-repeatable, or costly approaches leaves the quality assessment of generated emotions an open challenge. Popular metrics like BLEU and ROUGE-L are often inadequate, as lexical overlap between gold-standard and generated emotion expressions remains high regardless of emotional success, rendering these metrics non-discriminative. Prompt-based LLM evaluation (e.g., using GPT4 to judge GPT-3) also sufers from bias and circularity, especially when assessing commercial or closed-source systems. Lastly, human evaluation, while insightful, is costly and non-repeatable.

Our study addresses these gaps by highlighting the urgent need for robust, repeatable, and modelagnostic emotion evaluation strategies that can generalize across diverse generation setups. Unlike prior works, we initiate a neutralization–reinjection process: first stripping emotions from the original dataset, then prompting models to regenerate emotional variants. This setup enables us to evaluate models based on their capacity to reintroduce appropriate emotions while preserving semantic content.

3. Dataset

In recent years, a growing number of multimodal emotion recognition datasets have been introduced to support research in afective computing and emotionally intelligent systems. Notable among these is the MELD dataset [25], which comprises multi-party conversations extracted from the Friends TV show. Although MELD provides valuable dialogic emotion labels, it is based on acted and scripted television content, which may not generalize well to spontaneous emotional behaviors. Similarly, the IEMOCAP dataset [26] features dyadic interactions between professional actors performing scripted and semiscripted scenarios, ofering rich annotations across modalities, but again lacks true spontaneity. Similar corpora for Chinese include EmotionTalk [27] and M3ED [28], introducing large-scale, multimodal emotion data from Chinese TV dramas and controlled dialogues. To address the lack of spontaneous emotion data, the K-EmoCon dataset [29] captured natural interactions during real-time debates and provided multi-perspective annotations, including physiological signals, but is limited in scale and does not cover monologue settings. While these datasets advance the field significantly, they still reflect contextual and cultural biases, often rely on acted emotions, and typically do not isolate modalities during annotation, which limits their utility for fine-grained unimodal vs. multimodal analysis.

These limitations, namely, the lack of spontaneous, non-acted emotional expressions, limited diversity of monologue data, and insuficient attention to isolated modality annotations, motivates the use of new datasets designed to better reflect natural emotional communication. The UniC [30] dataset is a multimodal emotion dataset comprising 965 video clips sourced from YouTube, selected to capture natural, spontaneous emotional expressions rather than acted performances. The videos primarily include monologues such as book and movie reviews, where a single visible speaker expresses emotions clearly in both speech and facial expressions. The dataset was constructed through a multi-step filtering process using keyword searches, sentiment-based subtitle filtering, and manual validation. Each clip, approximately 10 seconds long, was annotated independently across four modalities: text, audio, silent video, and all modalities combined. Emotion annotations use both categorical and dimensional frameworks. Initially based on 26 categorical emotion labels from Shaver et al. [31], these were reduced to seven emotion clusters (joy, contentment, surprise, confusion, neutral, disappointment, and disgust) via clustering analysis, alongside valence and arousal scores. Figure 1 shows a sample from the UniC dataset.

For our experiments, we focused on the text modality as a stepping stone to multimodal emotion expression generation in follow-up research. Noteworthy to mention is that for this text modality, the inter-annotator agreement (IAA) was highest, reaching a Fleiss’ kappa of 0.47 after annotator training and emotion clustering. Among the diferent labeled emotions, emotions such as confusion and surprise were less reliably detected from text alone, highlighting the added value of multimodal signals. We evaluated the text modality of the UniC dataset using several baseline models, for which we used 100% of the dataset for testing. Due to the limited size of the dataset, we employed 5-fold cross-validation for training and evaluating our custom model.

As shown in Table 1, our model does not achieve the highest performance across any metric. Among the evaluated models, michellejieli and j-hartmann are fine-tuned emotion classifiers based on the DistilRoBERTa-base [34] architecture. The bart-large-mnli model, a zero-shot classifier built on the BART-large [35] transformer, is used without fine-tuning. The gpt-4o-mini model, on the other hand, is an LLM that predicts emotions through prompt-based reasoning. Notably, michellejieli achieves the highest accuracy (0.4492) and precision (0.4205), while gpt-4o-mini performs best in recall (0.4496) and F1 score (0.3579). Our approach, which combines BAAI-bge-m3 embeddings [36] with a tuned Random Forest classifier [ 37], yields moderate but consistent results across all metrics because it just trained on UniC dataset(772 training samples). The classifier’s hyperparameters are shown in Table 2. It is important to highlight that these relatively low performance scores are primarily due to the nature of the dataset, which consists of natural, non-acted emotional expressions.

4. Methodology

Emotional text generation and its evaluation have been less explored through analogical methods, despite their proven utility in measuring structured semantic relations. Chen et al. [38] systematically analyzed vector-based analogies, confirming their reliability in capturing such relations, and Zhu and De Melo [39] extended analogical reasoning to contextualized sentence embeddings, showing that some models preserve analogical structures at the sentence level. To our knowledge, no prior work has applied analogy-based evaluation specifically to the assessment of emotional expressiveness in generated text.

Building on these insights, our methodology employs analogy-based evaluation to quantify the emotional expressiveness of LLMs. To rigorously isolate the model’s generative capabilities, we begin by neutralizing the emotional content of each ground-truth (GS) text in our dataset using an LLM. Following neutralization, the model is prompted to regenerate the emotional version of each text. The neutralization step is crucial: by comparing the regenerated emotional outputs with the original GS emotions, we ensure that any observed afective content arises from the model’s learned patterns rather than residual cues in the input. Finally, we compute embedding-based similarity and analogy metrics between the GS and regenerated texts, enabling quantification of both semantic fidelity and emotional alignment.

4.1. Embedding Evaluation Metric

Before focusing on the embedding evaluation metric, we should mention that all embeddings were calculated using the BGE-M3 [36] language model, and the 2D space was generated using the t-SNE [40] method applied to the BGE-M3 embedding space.

In our embedding evaluation metric, we draw inspiration from the well-known linguistic analogy: “king - man + woman ≈ queen”. This example illustrates how word embeddings can capture semantic relationships through vector arithmetic [41]. By representing words as vectors in a high-dimensional space, operations such as subtraction and addition can reveal underlying relationships, such as gender or emotional tone. This property enables the assessment of emotional quality in generated text by analyzing geometric relationships between word vectors, ofering a quantitative measure of emotional expressiveness in language models. Figure 2 visually demonstrates this concept, showing how vector operations can encode semantic relationships in the embedding space.

In Figure 2, the length and direction of the vectors ( king) − ( queen) and ( man) − ( woman) appear to be the same. However, this does not reflect reality. In a realistic scenario, we would expect the vector ( king) − ( man) + ( woman) to be close to ( queen). Using BGE-M3, we calculated the embeddings for queen, king, man, and woman. As shown in Figure 3, the expression ( king) − ( man) + ( woman) is not exactly equal to ( queen), but it is close. 4.1.1. Cosine Similarity vs. Manhattan Distance A common method for measuring similarity between two vectors is the cosine similarity metric. However, in analogy tasks, this method has a major limitation: the results can vary based on the operation order. Consider the analogy: king is to queen as man is to woman. The similarity and distance scores for various formulations are summarized in Table 3.

As shown in Table 3, diferent operation orders produce varying cosine similarity scores, revealing inconsistency in the evaluation of the cosine-based analogy. In contrast, the Manhattan distance produces stable results across all permutations, indicating its robustness for analogy reasoning tasks. Due to its consistent behavior, we further used the Manhattan distance for the analogy evaluation in our experiments. 4.1.2. Real Emotional Example To better understand the role of emotional analogy in our framework, we illustrate a representative example from our experiments. The goal is to analyze how vector arithmetic in the embedding space can capture shifts in emotional expression between sentences. Figure 4 visualizes this example. The corresponding text for each variable in the figure is as follows: • joy = “joy” • neutral = “neutral” • neutral_sent = “It’s my first day as a student” • joy_sent = “I’m so happy, it’s my first day as a student!”

In the Figure 4, we observe that the distance between the neutral and joy emotion embeddings is relatively large. This discrepancy poses a challenge for emotional analogy, as the semantic distance between the two sentence embeddings (neutral_sent and joy_sent) is significantly smaller than the distance between their corresponding emotion labels. To mitigate this, we construct an analogy vector using the following equation: analogy_vector = neutral − neutral_sent + joy_sent (1)

This vector is then compared with the joy embedding. As shown in the Figure 4, the analogy vector lies closer to joy than neutral, indicating that the analogy operation efectively captures the intended emotional shift.

Recognizing emotions in real user utterances is particularly challenging due to their subtle and nuanced nature. As shown in Table 1, the best model achieves an F1-score of only 35.79%, significantly lower than the 60.25% observed on acted datasets like MELD [25]. To further investigate this phenomenon, we visualized the semantic structure of emotion representations using the BGE-M3 embedding model. Figure 5 shows a 2D projection of both the emotion label embeddings and the average embeddings of real user utterances associated with each emotion. In this plot, each circle represents an emotion label (e.g., joy, disgust, neutral), and each square denotes the average embedding of utterances tagged with that emotion. Two sets of relationships are highlighted: • Red lines connect the embedding of the label neutral to other emotion labels. • Green lines connect the average embedding of utterances labeled as neutral to the average embeddings of utterances for other emotions.

The figure reveals that while the emotion labels are well-separated in the embedding space, indicating clear semantic distinctions, the average embeddings of real user expressions are clustered more closely together, especially around the neutral region. This supports the idea that emotional language in real interactions is often more subtle, making automatic emotion detection more challenging in natural contexts.

To better understand how emotional meaning is encoded in sentence embeddings, we explore the relationship between labeled and unlabeled emotional expressions. Specifically, we aim to approximate the embedding of an emotionally tagged utterance using its neutral version and the emotional shift encoded in a semantically aligned sentence. Here, labeled emotion refers to utterances that include direct emotion labels from the gold-standard data in UniC dataset (e.g., “I’m so happy, it’s my first day as a student! (joy emotion)”), while unlabeled emotion refers to emotionally expressive content without such tags but still conveying afect (e.g., “I’m so happy, it’s my first day as a student!”). Neutral versions are afectively flat and omit emotional cues.

Our approach applies an analogy-style vector transformation of the form: neutral − neutral_sent + joy_sent, where neutral_sent and joy_sent are the neutral and emotionally expressive versions of the same utterance. This transformation enriches the afective content of the neutral-tagged embedding by injecting the emotional variation from the unlabeled expression, while preserving the shared semantic structure. The goal is to reduce the distance between the synthesized embedding and its explicitly emotional counterpart, efectively revealing how emotional meaning can be reconstructed through compositional operations. Figure 6 visualizes this transformation. The green arrow illustrates the analogy vector described above, and the dashed lines indicate the proximity between the predicted and actual emotion embeddings. The text associated with each vector in the figure is as follows: • joy = “I’m so happy, it’s my first day as a student! (joy emotion)”

• neutral = “It’s my first day as a student (neutral emotion)” • neutral_sent = “It’s my first day as a student” • joy_sent = “I’m so happy, it’s my first day as a student!” 4.2. Emotion Embedding Extraction Using Prompted Text Templates As discussed in Section 4.1.1, we use Manhattan distance as our similarity metric due to its sensitivity to subtle semantic variations in the embedding space. This metric is essential for evaluating how emotional content can be manipulated while preserving the original meaning. Our goal is to identify the most efective prompt template for extracting emotion embeddings from textual descriptions. These embeddings, denoted as , , and , represent the original, neutral, and target emotional states, respectively. By inserting emotion-related phrases into structured prompt templates, we derive these embeddings for use in analogy-based transformations. The transformation involves two steps: neutralization and emotionalization. Let , , and be the sentence embeddings for the original, neutral, and target emotional versions of the same sentence. Let MD(, ) denote the Manhattan distance between embeddings and . The neutralization step tests whether removing the original emotion and inserting the neutral emotion embedding moves it closer to :

MD( , ) ≥ MD( − + , )

The emotionalization step checks whether inserting the target emotion into the neutral embedding moves it closer to :

MD( , ) ≥ MD( − + , )

These conditions validate whether modifying sentence embeddings via emotional vectors steers them toward the intended emotional states. A transformation is deemed successful when both inequalities are satisfied. (2) (3) System Prompt 1: Text Neutralization Your task is to neutralize the text by removing emotional expressions.

The text is a transcription of a video.

The text may contain emotional expressions.

The text should be neutral and not contain any emotional expressions.

The text should be in the same language, format, style, tone, and context as the input text.

Please try to change the text as little as possible.

Please neutralize the following text: {text} The original emotion of the text is: {emotion} Please make sure to remove all emotional expressions from the text.

System Prompt 2: Emotional Text Generation Your task is to make the text more emotional by adding emotional expressions.

The text is a transcription of a video.

The text should be in the same language, format, style, tone, and context as the input text.

Please try to change the text as little as possible.

Don’t mention the emotion in the text directly.

Please add emotional expressions to the following text: {text} The current emotion of the text is: neutral.

The target emotion of the text should be: {emotion}.

To identify the most efective prompt template for extracting emotion embeddings, we evaluated five candidate prompt formulations across several LLMs. These templates vary in how they contextualize emotion labels with respect to the text, ranging from labeled structures (e.g., “joy emotion: {text}”) to minimal expressions (e.g., just “joy”).

Our evaluation follows a two-step analogy-based framework. In the neutralization step, we generated neutral versions of emotional sentences using each LLM with a fixed system instruction based on System Prompt 1. To extract the emotion embeddings and used in Equation 2, we tested the five emotion prompt templates by plugging them into an embedding encoder. In the emotionalization step, we used System Prompt 2 to generate emotionalized sentences from neutral ones and evaluated how well each emotion prompt template performed using Equation 3 with the target emotion embedding . The following are the details about the emotion embedding prompts: • Prompt 1: {emotion} emotion: {text}

As shown in Table 4, we identify the best-performing emotion prompt template for each step of the evaluation. Using the entire text-only UniC dataset for evaluation, we conduct experiments on two tasks: neutralization and emotionalization. For neutralization, Prompt 3 achieves the highest analogy satisfaction rates across most models. For emotionalization, Prompt 1 performs best, indicating its efectiveness in reintroducing emotional content through embedding manipulation. These findings suggest that diferent prompt styles may be optimal for extracting emotion embeddings depending on the specific transformation goal.

5. Analysis and Results

Having decided on the Manhattan Distance to compare the embedding vectors (Section 4.1.1) and on using distinct prompt templates for extracting emotion embeddings depending on the transformation stage (Section 4.2), we set up an experiment in which our goal was to evaluate the impact of emotion generation by comparing the original emotional data with the emotionally re-generated text. Specifically, we used Prompt 3 for the neutralization stage and Prompt 1 for the emotionalization stage, as each achieved the highest analogy satisfaction rates for their respective tasks across most models. To enable a broad comparison, we evaluated a range of LLMs, including open-source models such as Gemma [42], LLaMA-3 [ 9 ], and Mistral-NeMo [43], as well as commercial models like GPT-4.1 [ 8 ] and GPT-4o-Mini [ 10 ] from OpenAI. This mix allowed us to assess the efectiveness of emotion embedding manipulation across both accessible, community-driven models and state-of-the-art proprietary systems. All evaluations are on the UniC dataset’s text modality. The process consisted of the following two main steps:

5.1. Neutralization

We used an LLM to neutralize the emotional content of the original text samples. This step aimed to remove any labeled or unlabeled emotional signals, resulting in emotionally flat, semantically preserved text. In this experiment, we used System Prompt 1. The following formulas were used in the tables to evaluate the performance of diferent models. In these equations, denotes the analogy vector.

= − + 1 = cos( , ), 2 = cos( , ) 1 = ‖ − ‖1, 2 = ‖ − ‖ 1

As shown in Table 5, we evaluate each LLM’s ability to perform emotional neutralization based on how well the transformed sentence embedding aligns with the original emotional context vector. The evaluation uses both cosine similarity and Manhattan distance to capture diferent aspects of embedding relationships. In both cosine similarity and Manhattan distance metrics, GPT-4.1 demonstrates the most controlled and semantically faithful emotion neutralization among all evaluated models. While llama3.3-70b-instruct achieves the highest post-neutralization cosine similarity (2 c = 0.9746) and lowest Manhattan distance (2 m = 5.33), GPT-4.1 yields the smallest changes in both cosine (Δ c = 0.0410) and Manhattan metrics (Δ m = −3.297), indicating minimal semantic distortion during transformation. This suggests that GPT-4.1 preserves original sentence meaning more efectively while removing emotional content. Overall, while high-capacity open-source models like mistral-nemo-12b-instruct are increasingly competitive, commercial models such as GPT-4.1 still lead in performance when performing nuanced tasks like emotion neutralization.

To evaluate the consistency between diferent similarity metrics, we computed the Pearson correlation [44] between the Manhattan distance and Cosine similarity values for both R2 and R1 scores. As shown in Figure 7, there is a very strong negative correlation between the two measures for both R2 ( = −0.99365 ) and R1 ( = −0.99501 ). These results indicate that as the Manhattan distance increases, the Cosine similarity decreases almost linearly, suggesting that both metrics are capturing highly similar trends in evaluating the transcripts, albeit in opposite directions due to their diferent mathematical formulations.

5.2. Emotion Injection

In the emotion injection phase, we used an LLM to reintroduce a target emotion (e.g., joy) into the neutralized text. To guide this process, we prompted the model using system-level instructions and emotion-specific cues. The goal was to generate emotionally expressive text that closely resembles the original emotional content while preserving the core semantics of the neutralized version. In this experiment, we used System Prompt 2. To perform the re-injection, we compute the analogy vector based on the following relationship:

= − + 1 = cos( , ), 2 = cos( , ), 3 = cos( , ),

1 = ‖ − ‖1 2

= ‖ − ‖ 1 3 = ‖ − ‖1

Table 7 shows the performance of various LLMs in emotion injection. GPT-4.1 achieves the best results overall, with the lowest distances and highest cosine similarities (e.g., R3m = 9.99, R3c = 0.9130), indicating strong emotional alignment and reinjection ability. In contrast, gemma-3-1b-it performs the weakest, especially in re-injection quality (R3c = 0.6990). While commercial models like GPT-4.1 and GPT-4o-mini outperform others due to superior training and architecture, larger open-source models such as LLaMA-3.3-70B and Mistral-Nemo-12B show competitive performance, suggesting that open models can still be efective in emotion-aware tasks.

Table 8 reports emotion-wise performance of GPT-4.1 on the emotion injection task. The model performs consistently across all emotions, with strong alignment scores (e.g., R3c > 0.88) and small Manhattan distances. Notably, the neutral class achieves the best results (R3c = 0.9280, R1m = 2.30), which is expected since the model is converting a neutralized utterance back to a neutral form, making the reinjection task considerably easier in this case.

To further validate the metric alignment, we conducted a correlation analysis between Manhattan distance and cosine similarity across the three relations (R1, R2, R3). All pairs exhibit strong negative correlations below −0.9819, confirming the inverse relationship between the two metrics (see Figure 8). These results confirm that increased directional similarity corresponds closely with reduced embedding distance, validating the use of both metrics to quantify emotional fidelity in the reinjection process.

6. Conclusion and Future Works

In this study, we explored the capability of LLMs to manipulate and generate emotionally expressive text through a two-step process: emotional neutralization followed by targeted emotion injection. Using embedding-based similarity metrics such as Manhattan distance and cosine similarity, we quantitatively evaluated the extent to which LLMs can remove and reintroduce specific emotions while preserving the semantic core of the original text. Our findings indicate that GPT-4.1, a commercial model, consistently outperforms other models in maintaining semantic fidelity and accurately reconstructing emotional nuances. Among open-source models, LLaMA-3.3-70B-Instruct demonstrates the best performance in our experiments, making it a strong candidate for accessible, open research in emotion-aware language generation. These results underscore the efectiveness of large-scale LLMs for emotion control and expression in text and provide a foundation for broader afective computing applications. Although our current focus is on the text modality, the proposed framework is explicitly designed to extend to speech and visual channels by leveraging shared embedding spaces. In particular, recent work by Jha et al. [45], which builds upon the Platonic Representation Hypothesis introduced by Huh et al. [46], demonstrates that as neural networks scale, internal representations across modalities converge toward a shared statistical model of reality. This convergence enables cross-modal afective analysis without requiring paired training data, providing a strong theoretical and practical foundation for our future work.

In addition to aligning emotional content across text, speech, and visual modalities within unified embedding spaces, our future eforts will also involve improved prompt engineering and the development of more expressive embedding models to enhance emotional transformation capabilities. As a concrete application, we aim to develop a multimodal empathetic conversational agent for second language (L2) learning. By engaging students in emotionally supportive interactions, such agents can foster psychologically safe and motivating learning environments while assisting teachers in managing afective dynamics in the classroom.

Acknowledgments

This research received funding from the Flemish Government under the Flanders Artificial Intelligence Research program (FAIR) (174K02325).

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT-4 and Grammarly in order to: Grammar and spelling check. translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. [20] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81. [21] S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72. [22] J. Li, M. Galley, C. Brockett, J. Gao, B. Dolan, A diversity-promoting objective function for neural conversation models, arXiv preprint arXiv:1510.03055 (2015). [23] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: Consensus-based image description evaluation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575. [24] R. Janssens, J. De Bock, S. Labat, E. Verhelst, V. Hoste, T. Belpaeme, Why robots are bad at detecting their mistakes: Limitations of miscommunication detection in human-robot dialogue, in: IEEE RO-MAN 2025 conference, 2025. [25] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, R. Mihalcea, Meld: A multimodal multi-party dataset for emotion recognition in conversations, arXiv preprint arXiv:1810.02508 (2018). [26] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan, Iemocap: Interactive emotional dyadic motion capture database, Language resources and evaluation 42 (2008) 335–359. [27] H. Sun, X. Wang, J. Zhao, S. Zhao, J. Zhou, H. Wang, J. He, A. Kong, X. Yang, Y. Wang, et al., Emotiontalk: An interactive chinese multimodal emotion dataset with rich annotations, arXiv preprint arXiv:2505.23018 (2025). [28] J. Zhao, T. Zhang, J. Hu, Y. Liu, Q. Jin, X. Wang, H. Li, M3ed: Multi-modal multi-scene multi-label emotional dialogue database, arXiv preprint arXiv:2205.10237 (2022). [29] C. Y. Park, N. Cha, S. Kang, A. Kim, A. H. Khandoker, L. Hadjileontiadis, A. Oh, Y. Jeong, U. Lee, K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations, Scientific Data 7 (2020) 293. [30] Q. Du, S. Labat, T. Demeester, V. Hoste, Unic: A dataset for emotion analysis of videos with multimodal and unimodal labels, Language Resources and Evaluation (2025) 1–36. [31] P. Shaver, J. Schwartz, D. Kirson, C. O’connor, Emotion knowledge: further exploration of a prototype approach., Journal of personality and social psychology 52 (1987) 1061. [32] J. Hartmann, Fine-tuned DistilRoBERTa-base for Emotion Classification, https://huggingface.co/ michellejieli/emotion_text_classifier/, 2022. Accessed: 2025-05-07. [33] J. Hartmann, Emotion english distilroberta-base, https://huggingface.co/j-hartmann/ emotion-english-distilroberta-base/, 2022. [34] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019). [35] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, CoRR abs/1910.13461 (2019). URL: http://arxiv.org/abs/1910.13461. arXiv:1910.13461. [36] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, Bge m3-embedding: Multi-lingual, multifunctionality, multi-granularity text embeddings through self-knowledge distillation, 2024. arXiv:2402.03216. [37] L. Breiman, Random forests, Machine learning 45 (2001) 5–32. [38] D. Chen, J. C. Peterson, T. L. Grifiths, Evaluating vector-space models of analogy, arXiv preprint arXiv:1705.04416 (2017). [39] X. Zhu, G. De Melo, Sentence analogies: Linguistic regularities in sentence embeddings, in:

Proceedings of the 28th international conference on computational linguistics, 2020, pp. 3389–3400. [40] L. Van der Maaten, G. Hinton, Visualizing data using t-sne., Journal of machine learning research 9 (2008). [41] C. Allen, T. Hospedales, Analogies explained: Towards understanding word embeddings, in:

International Conference on Machine Learning, PMLR, 2019, pp. 223–231. [42] G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé,

M. Rivière, et al., Gemma 3 technical report, arXiv preprint arXiv:2503.19786 (2025). [43] M. AI, Mistral nemo, https://mistral.ai/news/mistral-nemo/, 2024. Accessed: September 23, 2024. [44] K. Pearson, Vii. note on regression and inheritance in the case of two parents, proceedings of the royal society of London 58 (1895) 240–242. [45] R. Jha, C. Zhang, V. Shmatikov, J. X. Morris, Harnessing the universal geometry of embeddings, arXiv preprint arXiv:2505.12540 (2025). [46] M. Huh, B. Cheung, T. Wang, P. Isola, The platonic representation hypothesis, arXiv preprint arXiv:2405.07987 (2024).

[1]

Ortega-Ochoa ,

Arguedas , T. Daradoumis, Empathic pedagogical conversational agents: a systematic literature review , British Journal of Educational Technology 55 ( 2024 ) 886 - 909 .

[2]

Sanjeewa ,

Iyer ,

Apputhurai ,

Wickramasinghe , D. Meyer, Empathic conversational agent platform designs and their evaluation in the context of mental health: Systematic review , JMIR Mental Health 11 ( 2024 ) e58974 .

[3]

Mari ,

Mandelli ,

Algesheimer , Empathic voice assistants: Enhancing consumer responses in voice commerce , Journal of Business Research 175 ( 2024 ) 114566 .

[4]

Rubin ,

Arnon ,

J. D.

Huppert ,

Perry , et al., Considering the role of human empathy in ai-driven therapy , JMIR Mental Health 11 ( 2024 ) e56529 .

[5]

Dong ,

Jin ,

Yang ,

Lu ,

Yang ,

Liu , Controllable emotion generation with emotion vectors , arXiv preprint arXiv:2502.04075 ( 2025 ).

[6]

S.-n.

Ishikawa ,

Yoshino , Ai with emotions: Exploring emotional expressions in large language models , arXiv preprint arXiv:2504.14706 ( 2025 ).

[7]

Wang ,

Li ,

Yang ,

Lin ,

Tang , H. Liu,

Cao ,

Wang ,

Wang , Sibyl: Empowering empathetic dialogue generation in large language models via sensible and visionary commonsense inference , in: Proceedings of the 31st International Conference on Computational Linguistics , 2025 , pp. 123 - 140 .

[8] OpenAI, Introducing gpt-4 .1, https://platform.openai.com/docs/models/gpt-4.1, 2025 . Accessed: 2025 -06-04.

[9]

Grattafiori ,

Dubey ,

Jauhri ,

Pandey ,

Kadian ,

Al-Dahle ,

Letman ,

Mathur ,

Schelten ,

Vaughan , et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 ( 2024 ).

[10] OpenAI , Introducing gpt-4o-mini, https://platform.openai.com/docs/models/gpt-4o-mini, 2025 . Accessed: 2025 -06-04.

[11] Facebook-AI , facebook/bart-large-mnli, https://huggingface.co/facebook/bart-large-mnli, 2020 . Accessed: 2025 -06-12.

[12]

Demszky ,

Movshovitz-Attias ,

Ko ,

Cowen , G. Nemade,

Ravi , Goemotions: A dataset of ifne-grained emotions , arXiv preprint arXiv: 2005 . 00547 ( 2020 ).

[13]

Khan , sentiment -model-sample- 27goemotion, https://huggingface.co/jkhan447/ sentiment-model-sample - 27go-emotion, 2022 . Accessed: February 11, 2024 .

[14]

Li ,

Chen ,

Shao ,

Xie ,

Jiang , L. Nie, Enhancing emotional generation capability of large language models via emotional chain-of-thought , arXiv preprint arXiv:2401.06836 ( 2024 ).

[15]

Goleman , Emotional intelligence: Why it can matter more than IQ , Bantam, 2005 .

[16] OpenAI , Introducing gpt-3 .5, https://platform.openai.com/docs/models/gpt-3.5-turbo, 2025 . Accessed: 2025 -06-04.

[17]

Rashkin ,

E. M.

Smith ,

Li ,

Y.-L.

Boureau , Towards empathetic open-domain conversation models: A new benchmark and dataset , arXiv preprint arXiv: 1811 . 00207 ( 2018 ).

[18]

Liu ,

Zheng ,

Demasi ,

Sabour ,

Li ,

Yu ,

Jiang ,

Huang , Towards emotional support dialog systems , arXiv preprint arXiv:2106.01144 ( 2021 ).

[19]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu: a method for automatic evaluation of machine