1. Introduction

M. Gozzi);

Bidirectional Emotional Influence in Human-LLM Interaction: Empirical Analysis and Methodological Framework

Manuel Gozzi

0 2

Francesca Fallucchi

0 1 0 Department of Engineering Sciences, Guglielmo Marconi University , 00193 Roma , Italy 1 Leibniz Institute for Educational Media, Georg Eckert Institute , Freisestraße 1, 38118 Braunschweig , Germany 2 Leithà - Unipol Group , 40128 Bologna , Italy

2025

000 0 0002

Recent advances in natural language processing have highlighted the potential of Large Language Models (LLMs) to adapt to diverse communicative contexts, yet their sensitivity to emotional framing remains underexplored. Prior work has examined stylistic adaptation and sentiment control, but limited attention has been paid to how emotional tone in prompts influences both model behavior and human interpretation. We investigate the role of emotional tone in shaping interactions between humans and LLMs, with a focus on model performance and user perception. We propose a dual-experiment setup: (1) Experiment Alpha evaluates how emotional prompt framing (joy, apathy, anger, fear) impacts LLM performance across SuperGLUE tasks; (2) Experiment Omega introduces a validated experimental framework to study how emotion-conditioned LLM responses afect human comprehension and perception, within an educational setting involving Italian-speaking participants. The Alpha results show that prompts framed with joy and apathy lead to better task performance, with gains of up to 4.48 percentage points. In Omega, fine-tuned models generated a 19% increase in joy-aligned responses, demonstrating the feasibility of afect-conditioned generation. These findings suggest promising applications for emotion-aware LLMs in education, virtual assistants, and afective computing.

eol>Large Language Model Prompt Engineering Afective Computing Human-Computer Interaction Fine-Tuning Emotionconditioned Models

1. Introduction

tionally expressive LLM outputs in educational settings. As LLMs are increasingly deployed in emotionally senAlthough Omega has not yet been deployed to end users, sitive domains (education, therapy, virtual assistance) the infrastructure and corresponding fine-tuned Velvet- understanding their afective capabilities is critical. Ef14B model [6] variants were developed and evaluated for fective HCI depends not only on semantic accuracy but emotion-conditioned generation. also on the model’s ability to handle emotional context

Our central hypothesis challenges the assumption that in a way that promotes trust and cognitive alignment [4]. emotional neutrality is optimal for task performance or An important line of research has investigated the user engagement. Instead, we posit that emotionally capacity of dialogue models to recognize and respond charged inputs may better align with the model’s train- to users’ emotions in a contextually appropriate maning distribution and that expressive outputs could en- ner. Rashkin et al. introduced the EmpatheticDialogues hance user trust, attention, and retention, particularly in dataset, a collection of 25,000 emotionally grounded conpedagogical or assistive applications. versations designed to foster empathetic behavior in AI The main contributions of this paper are: systems [7]. Their findings demonstrate that models fine1. A controlled dual-experiment design that quan- tuned on this resource are rated as more empathetic by tifies the influence of emotional tone in both human evaluators compared to those trained on generic prompts and responses. conversational corpora. This underscores the limita2. Empirical evidence that shows the variation in tions of large-scale pretraining alone in achieving afectperformance in emotional conditions in LLM in- sensitive generation, and the value of explicit emotion put. supervision. While EmpatheticDialogues targets open3. A validated experimental framework and a set of domain, afectively grounded dialogue, our work compleifne-tuned model variants to support future re- ments this by focusing on bidirectional afective influence search on the emotion-conditioned human-LLM in cognitively demanding contexts—modeling not only interaction. empathetic output but also how emotion-laden prompts modulate reasoning and how emotional responses impact

The paper is organized as follows. Section 2 re- user cognition and perception. views previous work on emotion-aware language mod- Despite growing interest, few studies have quantified els and afective computing. Section 3 details our dual- how diferent emotional tones in prompts afect model experiment methodology, comprising the Alpha experi- performance across standard NLP benchmarks. Simiment on prompt-induced emotional efects in LLMs and larly, the downstream efects of emotionally biased LLM the Omega framework for studying emotion-conditioned responses on user cognition and perception, especially model outputs in educational settings. Section 4 presents in open-ended, educational tasks, remain largely unexthe results of both experiments, followed by a discussion plored. Moreover, most prior work treats emotional conof their implications in Section 5. We conclude by outlin- tent as stylistic variation rather than a variable with meaing future directions for emotion-sensitive human–LLM surable cognitive or perceptual impact. Our study adinteraction research. dresses these gaps through two contributions. An empirical evaluation of how afect-laden prompts (joy, apathy, 2. Background anger, fear) modulate LLM performance on SuperGLUE tasks, and a validated experimental framework for jointly Recent work in NLP and afective computing has explored assessing the perceptual and cognitive impact of emotionhow LLMs respond to emotionally charged prompts. conditioned LLM responses in user-facing tasks. Studies indicate that afective signals in prompts can These contributions are grounded in the understandinfluence both the emotional tone and the performance ing that, while LLMs do not possess experiential or afof LLM on tasks [1]. However, the mechanisms under- fective grounding, their behavior can still reflect and lying these efects remain debated: do LLMs genuinely amplify afective patterns learned from the data. In fact, process emotional content, or merely simulate it through LLMs operate through statistical association rather than pattern matching? emotional understanding. Based on distributional se

LLMs have shown competence in tasks involving af- mantics [8], they learn afective language patterns by fect recognition and empathy simulation, but limitations processing massive text corpora and encoding them into persist in emotional consistency, intensity calibration, high-dimensional vector spaces. Although emotionally and sensitivity to subtle cues [2]. Psychometric assess- connoted groups can be identified through methods such ments suggest that models like GPT-4 can match or ex- as PCA, UMAP, or probing techniques [9], these do not ceed human baselines in specific afect recognition bench- imply afective grounding. Unlike humans, who intemarks [3], though this performance likely reflects lexical- grate symbolic reasoning with embodied emotional expesemantic association rather than experiential comprehen- rience, LLMs infer meaning through probabilistic pattern sion. recognition. As such, emotional fluency in model output reflects learned correlations, not genuine afect. This gap has implications for design, interpretation, and ethical use in emotionally charged contexts.

3. Methods

This study quantifies the bidirectional impact of emotions on human–LLM interactions through two experiments. Alpha examines how emotional framing in user prompts afects LLM performance on reasoning tasks, while Omega investigates how emotionally biased LLM responses influence human decision-making.

Alpha experiment has been conducted in the English language, because the SuperGLUE datasets are publicly available. Since SuperGLUE datasets come out with predefined ground truths in English, we designed and executed Alpha based on those. The language does not matter here. The key point is to analyze the efect that emotions have on the performance. Prompting in English is generally a good practice in order to avoid minor languages biases [ 10, 11 ].

Experiment Omega was designed in Italian to align with the linguistic context of the educational setting under investigation. This choice ensures ecological validity, as it reflects the actual language used by students and instructors in the targeted learning environment, thereby enabling a more accurate assessment of comprehension and afective perception in real-world conditions.

3.1. Alpha: Analyzing the Impact of Emotions on Machine Performance

This experiment investigates how emotional framing in user prompts afects the performance of LLMs on advanced language understanding tasks. By systematically modulating the emotional tone of inputs across a subset of SuperGLUE tasks, we aim to quantify the extent to which LLM behavior is sensitive to afective cues. The following subsections describe the experimental design, implementation, data preparation, and evaluation protocol.

3.1.1. Experimental Design

Experiment Alpha uses four emotional conditions to frame user prompts, based on three of Ekman’s six basic emotions [12] (joy, anger, and fear) plus a neutral condition representing apathy, which serves as the baseline. We introduce “apathy” not as a basic emotion, but as a control condition meant to simulate emotionally neutral or emotionally flat interaction. In this context, apathy does not refer to the clinical absence of emotion, but to a dispassionate tone that serves as a baseline. This emotion set was designed to balance interpretability with experimental feasibility, and should be considered a pragmatic approximation rather than a strict adherence to Ekman’s taxonomy. Joy, anger, and fear were selected due to their universality and distinct valence and activation profiles: joy as a positively valenced afect, anger as a defense-oriented negative emotion, and fear as an avoidance-oriented negative emotion. Their inclusion allows testing both the valence and motivational dimensions of afect in model reasoning under semantically equivalent instructions.

The experiment is grounded in SuperGLUE, a benchmark designed to assess higher-order language understanding capabilities such as inference, reasoning, and contextual comprehension, dimensions that are hypothesized to be particularly sensitive to emotional modulation. A subset of eight tasks was selected based on coverage and structural diversity: BoolQ (Boolean Question Answering) [13], CB (CommitmentBank) [14], COPA (Choice of Plausible Alternatives) [15], MultiRC (MultiSentence Reading Comprehension) [16], ReCoRD (Reading Comprehension with Commonsense Reasoning) [17], WiC (Words in Context) [18], WSC (Winograd Schema Challenge) [19], and RTE (Recognizing Textual Entailment) [20]. These tasks span competencies including entailment, causality, multi-sentence comprehension, and word sense disambiguation. The mentioned eight SuperGLUE tasks were chosen due to their reliance on nuanced reasoning, contextual inference, and linguistic ambiguity—dimensions where emotional framing can modulate interpretive biases. Entailment tasks such as RTE and CB require readers (or models) to assess whether a hypothesis logically follows from a premise. Prior work has shown that emotional salience can shape these judgments by modulating perceived relevance or certainty of the statements involved [21]. COPA tasks depend on evaluating the most plausible cause or efect in a given scenario. Emotions are known to modulate causal reasoning, altering perceived plausibility by priming certain associations or cognitive shortcuts [22].

Alternative benchmarks, such as MMLU (Massive Multitask Language Understanding) [23] and HELM (Holistic Evaluation of Language Models) [24], were considered but ultimately excluded. MMLU, while comprehensive, focuses primarily on multiple-choice knowledge questions; HELM emphasizes fairness and safety metrics. Neither aligns well with our focus on fine-grained linguistic interactions shaped by emotion. SuperGLUE, by contrast, ofers task types and input structures better suited to capturing afect-sensitive model behavior.

3.1.2. Implementation and Runtime Environment

For each data set record, four variants of emotional prompts were generated: apathy (intended as the baseline), joy, anger, and fear. All records were processed in all emotional conditions, ensuring exhaustive coverage repository [25]. This resource is provided to ensure transand balanced comparison. parency and facilitate reproducibility of our experimental

Model inference was performed locally using Ollama, framework. with results stored in a MongoDB database. The pipeline For CB, RTE, and AX-g, the precision of the entailment was implemented as a Python CLI application, aiming to classification was calculated by matching the predicted lasupport full automation, reproducibility, and structured bels (“entailment” / “not_entailment”) using regex. COPA result querying. The evaluation involved five instruction- assessed causal reasoning, with outputs evaluated via tuned, open-weight LLMs from four major model families regex-based selection of “option 1” or “option 2,” using ac(LLaMA, Qwen, Gemma, Mistral), all quantized to 4-bit curacy as the metric. For WiC, WSC, and BoolQ, boolean precision to support inference on consumer-grade hard- outputs (“true” / “false”) were evaluated using standard ware. To ensure reproducibility and control for random- accuracy, following minimal post-processing. ness, temperature was fixed at zero during all inference In the ReCoRD task, which requires cloze-style comruns. Full model specifications are reported in Table 1. pletion, models were prompted to reproduce the original ground-truth sentence by correctly replacing a placeTable 1 holder with the appropriate entity. A few-shot setup Used Large Language Models with Quantization Details. was adopted to enhance consistency across predictions.

BLEU scores [26] were used as an automatic metric to Model Version Quantization quantify the similarity between generated and reference sentences, capturing token-level variations introduced by emotional modulation.

Mistral LLama 3.1 Qwen 2.5 Gemma 2

LLama 3.2

3.1.3. Data Preparation

7B Instruct 8B Instruct 7B Instruct 9B Instruct 3B Instruct Q4 Q4 Q4 Q4 Q4

3.2. Omega: Studying the Impact of Emotions on Human Interaction

Experiment Omega investigates the efect of emotional bias in AI-generated responses on user learning outcomes and interaction perception. A web-based prototype was developed, integrating four variants of the Velvet-14B language model: three fine-tuned for joy, anger, and fear, and one baseline variant representing apathy. The system also includes a Retrieval-Augmented Generation (RAG) component to deliver contextually relevant responses.

Each task was associated with four prompts difering only in emotional framing, not in structure or semantics.

Apathy served as the neutral baseline. Emotional phrases were inserted to influence afective tone while keeping task wording consistent. Model outputs were evaluated using SuperGLUE’s task-specific metrics, comparing performance across emotional prompt variants within and across tasks. The full set of prompts used in the Alpha experiment is publicly available in a dedicated GitHub SuperGLUE datasets were processed using Pandas and provided in JSONL format. To ensure equal statistical weight across tasks, dataset sizes were standardized via random sampling (maximum 500 records per dataset).

This choice balances computational cost with robust estimation. Three datasets—AX-g (356 records), CB (250), and COPA (400)—did not reach the 500 samples threshold and were used in full without augmentation. The re- 3.2.1. Experimental Setup and Motivation maining datasets were sampled to 500 records. Sampling The experiment was designed for a university context, bounds were determined empirically via exploratory data targeting students attending a lecture on Artificial Intelanalysis. ligence. After the lecture, participants would be divided

Two datasets were excluded out of processing. AX- into four groups, each assigned to interact with a difb due to structural heterogeneity and redundancy with ferent emotionally biased variant of the model. During CB/RTE, and MultiRC due to excessive token length, in- a subsequent comprehension test, students could concompatible with the goals of this study. In total, eight out sult their assigned model. Following the test, they would of ten SuperGLUE tasks were retained for evaluation. complete a Likert-scale [27] questionnaire assessing their experience and perception of the interaction. 3.1.4. Prompt Design and Evaluation Protocol The primary goal was to determine whether emotionally biased language outputs influence both cognitive performance (measured by comprehension scores) and subjective user experience. Two types of data were collected: (1) quantitative performance on the test, and (2) qualitative feedback from the post-test questionnaire.

Anonymized interaction logs from the conversational interface further support the analysis, ofering insight into how diferent emotional tones afect engagement, performance, and perceived model utility.

We adopted Velvet-14B as the base model for Experi- was used. The target during training was the next utterment Omega due to its specialization in the Italian lan- ance in a 10-turn dialogue, conditioned on prior context guage. Developed with a focus on Italian linguistic and and intended emotion. Fine-tuning was conducted using cultural contexts, Velvet-14B ensures better alignment LoRA (Low-Rank Adaptation) [30], which enables efiwith the comprehension and interaction patterns of na- cient training of large models on consumer-grade hardtive speakers, thereby enhancing the validity of emotion- ware. LoRA introduces learnable low-rank matrices for conditioned generation in the targeted educational sce- each weight matrix in the base model. Only these manario. trices are updated during training, and they are applied as a linear transformation during inference to condition 3.2.2. Training Data Preparation outputs. The Hugging Face PEFT library [31] was used to implement LoRA, targeting the query and value proThe emotional variants of Velvet-14B were fine-tuned jection modules of Velvet-14B. using the MELD dataset [28], which includes dialogues The fine-tuning pipeline begins with data tokenization, annotated with emotion labels. Three distinct variants followed by loading Velvet-14B with the LoRA adapter. were created for joy, anger, and fear, as in Experiment Training resumes from the latest checkpoint or starts Alpha (see Paragraph 3.1.1). The "apathy" variant corre- from scratch if none is found. Models and tokenizers are sponds to the baseline, non-fine-tuned Velvet-14B model. periodically saved. Across all variants, training showed

While MELD is originally in English, we adopted a stable convergence, with all models reaching optimal multi-step translation pipeline to ensure the resulting performance within 0.5 epochs—well before the 2-epoch dialogues preserved the emotional nuance. First, we fine- limit. Best-performing checkpoints were consistently tuned Gemma 2 9B to generate emotionally aligned dia- obtained between steps 20 and 30. logues in English. These dialogues were then translated into Italian using Gemma 2 9B model, and post-edited 3.2.4. Web Application and Interaction Framework manually to ensure idiomatic correctness and emotional ifdelity. We acknowledge the absence of a standardized A custom web application was developed to facilitate Italian emotional dialogue dataset and recognize that user interaction with the fine-tuned models. The system this translation pipeline introduces an additional layer of comprises a Streamlit-based frontend, a FastAPI backabstraction. However, it allowed us to generate a linguis- end, and a Milvus vector database supporting RAG. The tically and emotionally coherent training corpus suited frontend, built with Streamlit, simplifies interface develfor the Italian-speaking participants targeted by Experi- opment by translating Python into React components. ment Omega. The backend handles real-time messaging and contextual

Due to MELD’s limited size, data augmentation was prompt construction, creating a seamless ChatGPT-like applied using the Gemma 2 9B model, which generated experience. additional dialogues preserving emotional nuance. This To support retrieval, text is embedded using the process yielded 1,200 dialogues (300 per emotion), each intfloat/multilingual-e5-base model [32], opticonsisting of 10 conversational turns, all translated into mized for multilingual retrieval tasks. The model distinItalian. Although minor issues with literal translation guishes queries and documents using prefixed prompts were observed, the resulting 12,000 utterances formed a ("query:", "passage:"), improving asymmetric retrieval robust training dataset. Gemma 2 9B was selected for its performance. Its balance between performance and efsuperior performance in emotional prompt handling and ifciency makes it suitable for production environments its instruction-tuned, open-weight nature [29], making without specialized hardware. it suitable for consistent and afect-rich synthetic data The RAG component retrieves short academic pasgeneration. sages relevant to the user query (e.g., definitions, con

To validate the emotional bias injection, 100 general- cepts, examples from lecture material), which are then purpose prompts were used to compare outputs from the prepended to the prompt. The goal is not to alter the base model and the emotional variants. Responses were emotional framing, but to anchor the response in topical manually annotated for emotional alignment, confirming knowledge. This contextual grounding ensures that emothe efectiveness of the fine-tuning procedure. tional variation does not come at the expense of content relevance or factuality—especially important in educa

3.2.3. Fine-Tuning Procedure and Emotional Bias tional settings.

Injection RAG operates in two stages: cosine similarity retrieval and normalization. Due to the contrastive learning temFine-tuning targeted dialogue generation, with the objec- perature ( = 0.1), cosine scores are highly concentrated tive of aligning the model’s output tone with the intended in the [0.7, 1] range. A test using 50 unrelated queries emotion (joy, anger, fear). No classification objective confirmed this narrow distribution (Figure 1), which justifies the application of standard score normalization.

The system workflow starts with the user that submits a query via the frontend, which is processed by the backend with contextual history. Relevant chunks are retrieved from the Milvus database and appended to the prompt before passing it to the appropriate emotional model. The response is generated and returned through the backend to the user interface.

3.2.5. Social Experiment

A social experiment was fully designed to evaluate the impact of emotional bias in an educational setting.

Participants ( students) would be randomly assigned to one of four model variants: apathy (baseline), joy, anger, or fear. Following a lecture, students would take a multiple-choice comprehension test (single and multiple answers), with model assistance allowed during the test.

Performance would be assessed via accuracy metrics per group. In parallel, a post-test Likert-scale questionnaire would collect subjective feedback on interaction quality, clarity of responses, and perceived helpfulness.

The study was designed to ofer both objective and subjective insights into the efects of emotionally biased LLMs in educational environments. If implemented, it would have provided valuable data to complement the Alpha experiment, contributing to a broader understanding of emotion in human-AI interaction.

4. Results

This section reports the findings from the Alpha and Omega experiments, which examine the bidirectional role of emotions in human–LLM interaction: user-tomodel (Alpha) and model-to-user (Omega). Empirical results show that emotionally biased prompts, despite constant semantic content, impact model performance. Prompts conveying joy yield the highest average accuracy across tasks and models (58.08%), while those expressing fear perform worst (53.60%), with a 4.5pp performance gap. This confirms that emotional tone, even in subtle prompt variations, can measurably afect output quality.

Efect sizes were evaluated using Cohen’s d, given the small sample sizes. Pairwise comparisons across emotions (e.g., joy vs. fear: d = 0.1771) revealed small yet meaningful diferences, with joy consistently outperforming fear and anger. All comparisons employed pooled standard deviation for normalization. Full results are visualized in Figure 2.

To better illustrate these trends, we report detailed task-level performance across models and emotional conditions in Tables 2–9. Each table shows accuracy (or BLEU score for ReCoRD) across five LLMs for a given task, grouped by emotional prompt variant. The final crosstask summary (Table 9) aggregates mean performance, confirming that prompts expressing joy consistently lead to higher scores across models and tasks, while fear yields the lowest. While LLMs exhibit general robustness to emotional modulation, these results highlight that even minor emotional perturbations can shift performance outcomes in systematic ways.

4.2. Omega: Emotional Influence from Model to User

To assess reverse emotional impact, we fine-tuned Velvet14B via LoRA on joy, anger, and fear-labeled corpora.

Each variant was tested on 100 GPT-4o-generated abstract prompts. Responses were manually annotated for emotional tone presence using a binary function : → {0, 1}, yielding emotional bias scores. The n o i t o m

E Apathy Joy Fear Anger n o i t o m

E Apathy Joy Fear Anger 2 . 3 a m a L L 2 . 3 a m a L L 2 . 3 a m a L L 2 . 3 a m a L

L annotation process was executed following specific tagging rules: • Joy: 1 if, and only if, the response exhibits a warm, reassuring tone conveying joy or a generally positive mood, else 0. Results indicate successful emotional conditioning: the • Anger: 1 if, and only if, the response has a heated, joy-biased model showed a +19% emotional expression blunt tone expressing anger, directness, or aggressiveness, else 0. • Fear: 1 if, and only if, the response displays a gloomy or sad tone expressing fear, worry, insecurity, or sadness, else 0. l a r t s i M l a r t s i M l a r t s i M l a r t s i M n e w

Q rate, anger +8%, and fear +6% (Figure 3). Notably, emo- was deferred, fine-tuned Velvet-14B variants (via LoRA) tional bias afected not only tone but also content, es- exhibited measurable emotional bias (+19% joy), depecially in philosophical responses—despite no overlap spite training on synthetic dialogues and lacking exwith training data. This implies that emotion-conditioned plicit emotion labels. This demonstrates the feasibility ifne-tuning influences the model’s latent representations of lightweight, emotion-targeted fine-tuning for steering in a generalizable way. LLM responses. We acknowledge the use of translated synthetic dialogues in lieu of a native Italian emotional corpus as a limitation. Future work will explore emotion annotation on native Italian corpora to reduce potential translation artifacts.

These findings carry three key implications. First, emotion in language modulates LLM behavior and is not merely decorative. Second, emotional conditioning can be engineered eficiently through prompt design or ifne-tuning. Third, afect-aware models have potential in user-facing applications where tone impacts trust, clarity, or engagement.

Limitations include the restricted emotion set, lack of dimensional afect modeling, handcrafted prompt design, and absence of direct human evaluation in Omega. FuFigure 3: Emotion Bias Detection Results: Baseline vs Fine- ture work will address these by adopting valence–arousal Tuned model. models, expanding the emotional spectrum, and conducting user studies to assess perception, comprehension, and

Although the full Omega experiment was not deployed long-term efects. Moreover, we acknowledge the use of to end users, the underlying framework is fully designed translated synthetic dialogues in lieu of a native Italian and ready for implementation. Deployment was con- emotional corpus as a limitation. Future work should strained by practical limitations: supporting real-time consider to explore emotion annotation on native Italian LLM interaction for a full classroom cohort required a corpora to reduce potential translation artifacts. non-trivial infrastructure, including API routing, authen- One noteworthy limitation of this dual-experiment tication, and persistent session management. Unfortu- framework lies in its linguistic asymmetry: Experiment nately, the associated operational costs exceeded our Alpha is conducted entirely in English, leveraging the available budget. Nevertheless, we validated the frame- SuperGLUE benchmark, while Experiment Omega is dework’s core component—emotion-conditioned genera- signed for Italian-speaking users in an educational setting. tion—by quantifying the degree of emotional bias intro- Although this choice is contextually motivated—Alpha duced during fine-tuning, thus laying the groundwork prioritizes benchmark compatibility and Omega emphafor future user-facing trials. sizes ecological validity in the Italian academic environment—it introduces a gap in linguistic continuity that hinders direct comparison and limits claims of gener5. Discussion and Conclusions alizability. Emotional framing and perception can be language-dependent due to diferences in afective semantics, pragmatics, and cultural connotations. This language asymmetry currently limits direct comparisons between Alpha and Omega. While each experiment was designed to maximize contextual validity—English for standardized benchmarks, Italian for real-world educational use—we recognize the challenge it poses for unified interpretation. A key goal for future work is to harmonize both experiments in a shared linguistic setting, allowing more robust cross-experiment generalization.

Overall, this study lays the groundwork for integrating emotion as a first-class variable in language-based AI systems. Responsible use of emotion-aware techniques could enable more efective, human-aligned, and contextsensitive interactions across a range of applications.

This work presents a dual experimental framework to investigate the bidirectional role of emotion in human–LLM interactions. In Experiment Alpha, we showed that emotional tone in prompts—without altering semantic content—impacts model performance. Prompts expressing joy and apathy outperformed those conveying anger or fear, suggesting that LLMs are sensitive to afective framing. This may stem from emotional mirroring efects in pretrained embeddings or from improved clarity in emotionally positive formulations. The observed alignment with Ekman’s model, particularly the behavioral opposition of joy and fear, supports the hypothesis that LLMs encode structured afective representations.

Experiment Omega further supports this claim from the reverse direction. While user-centered evaluation

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase and reword and Formatting assistance. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

doi:10 .18653/v1/ 2020 .eval4nlp- 1 . 10 . [30]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li , [21]

Wiebe , T. Wilson,

Cardie , Annotating expres- S. Wang,

Wang ,

Chen , Lora: Low-rank adap-

sions of opinions and emotions in language , Lan- tation of large language models , 2021 . URL: https:

guage Resources and Evaluation 39 ( 2005 ) 165 - 210 . //arxiv.org/abs/2106.09685. arXiv: 2106 . 09685 .

doi:10.1007/s10579-005-7880-9 . [31]

Xu ,

Xie , S.-

Z. J.

Qin ,

Tao ,

F. L.

Wang , [22]

Zhu , Utilizing large language models with causal Parameter-eficient fine-tuning methods for pre-

pathic dialogue generation , in: 2025 IEEE 15th assessment , 2023 . URL: https://arxiv.org/abs/2312.

Annual Computing and Communication Workshop 12148 . arXiv: 2312 . 12148 .

and Conference (CCWC) , 2025 , pp. 00103 - 00109 . [32]

Wang ,

Yang ,

Huang ,

Yang ,

Majumder ,

doi:10.1109/CCWC62904 . 2025 .10903745.

Wei , Multilingual e5 text embeddings: A technical [ 23]

Hendrycks ,

Burns ,

Basart ,

Zou , report, 2024 . URL: https://arxiv.org/abs/2402.05672.

Mazeika ,

Song ,

Steinhardt , Measuring arXiv: 2402 . 05672 .

2021. URL: https://arxiv.org/abs/ 2009 .03300.

arXiv: 2009 . 03300 . [24]

Liang ,

Bommasani ,

Lee ,

Tsipras ,

Soylu ,

tic evaluation of language models , 2023 . URL: https:

//arxiv.org/abs/2211.09110. arXiv: 2211 . 09110 . [25]

Gozzi , Bidirectional emotional influ-

repository, https://github.com/gozus19p/

Emotional-Influence-in-Human-

LLM , 2025 .

Accessed : 2025 -07-23. [26]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu:

(Eds.), Proceedings of the 40th Annual Meeting of

phia , Pennsylvania, USA, 2002 , pp. 311 - 318 . URL:

https://aclanthology.org/P02-1040/. doi: 10 .3115/

1073083. 1073135 . [27]

Likert , A technique for the measurement of

attitudes , Archives of Psychology 140 ( 1932 ) 1 - 55 . [28]

Poria ,

Hazarika ,

Majumder ,

Naik , E. Cam-

tions , 2019 . URL: https://arxiv.org/abs/ 1810 .02508.

arXiv: 1810 . 02508 . [29]

Gozzi ,

Di Maio , Comparative analysis of

task vs . multitask prompts, Electronics 13 ( 2024 ).

URL: https://www.mdpi.com/2079-9292/13/23/4712.

doi:10 .3390/electronics13234712.