<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Gozzi);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Bidirectional Emotional Influence in Human-LLM Interaction: Empirical Analysis and Methodological Framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manuel Gozzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Fallucchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Engineering Sciences, Guglielmo Marconi University</institution>
          ,
          <addr-line>00193 Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leibniz Institute for Educational Media, Georg Eckert Institute</institution>
          ,
          <addr-line>Freisestraße 1, 38118 Braunschweig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leithà - Unipol Group</institution>
          ,
          <addr-line>40128 Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Recent advances in natural language processing have highlighted the potential of Large Language Models (LLMs) to adapt to diverse communicative contexts, yet their sensitivity to emotional framing remains underexplored. Prior work has examined stylistic adaptation and sentiment control, but limited attention has been paid to how emotional tone in prompts influences both model behavior and human interpretation. We investigate the role of emotional tone in shaping interactions between humans and LLMs, with a focus on model performance and user perception. We propose a dual-experiment setup: (1) Experiment Alpha evaluates how emotional prompt framing (joy, apathy, anger, fear) impacts LLM performance across SuperGLUE tasks; (2) Experiment Omega introduces a validated experimental framework to study how emotion-conditioned LLM responses afect human comprehension and perception, within an educational setting involving Italian-speaking participants. The Alpha results show that prompts framed with joy and apathy lead to better task performance, with gains of up to 4.48 percentage points. In Omega, fine-tuned models generated a 19% increase in joy-aligned responses, demonstrating the feasibility of afect-conditioned generation. These findings suggest promising applications for emotion-aware LLMs in education, virtual assistants, and afective computing.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Model</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>Afective Computing</kwd>
        <kwd>Human-Computer Interaction</kwd>
        <kwd>Fine-Tuning</kwd>
        <kwd>Emotionconditioned Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tionally expressive LLM outputs in educational settings. As LLMs are increasingly deployed in emotionally
senAlthough Omega has not yet been deployed to end users, sitive domains (education, therapy, virtual assistance)
the infrastructure and corresponding fine-tuned Velvet- understanding their afective capabilities is critical.
Ef14B model [6] variants were developed and evaluated for fective HCI depends not only on semantic accuracy but
emotion-conditioned generation. also on the model’s ability to handle emotional context</p>
      <p>Our central hypothesis challenges the assumption that in a way that promotes trust and cognitive alignment [4].
emotional neutrality is optimal for task performance or An important line of research has investigated the
user engagement. Instead, we posit that emotionally capacity of dialogue models to recognize and respond
charged inputs may better align with the model’s train- to users’ emotions in a contextually appropriate
maning distribution and that expressive outputs could en- ner. Rashkin et al. introduced the EmpatheticDialogues
hance user trust, attention, and retention, particularly in dataset, a collection of 25,000 emotionally grounded
conpedagogical or assistive applications. versations designed to foster empathetic behavior in AI
The main contributions of this paper are: systems [7]. Their findings demonstrate that models
fine1. A controlled dual-experiment design that quan- tuned on this resource are rated as more empathetic by
tifies the influence of emotional tone in both human evaluators compared to those trained on generic
prompts and responses. conversational corpora. This underscores the
limita2. Empirical evidence that shows the variation in tions of large-scale pretraining alone in achieving
afectperformance in emotional conditions in LLM in- sensitive generation, and the value of explicit emotion
put. supervision. While EmpatheticDialogues targets
open3. A validated experimental framework and a set of domain, afectively grounded dialogue, our work
compleifne-tuned model variants to support future re- ments this by focusing on bidirectional afective influence
search on the emotion-conditioned human-LLM in cognitively demanding contexts—modeling not only
interaction. empathetic output but also how emotion-laden prompts
modulate reasoning and how emotional responses impact</p>
      <p>The paper is organized as follows. Section 2 re- user cognition and perception.
views previous work on emotion-aware language mod- Despite growing interest, few studies have quantified
els and afective computing. Section 3 details our dual- how diferent emotional tones in prompts afect model
experiment methodology, comprising the Alpha experi- performance across standard NLP benchmarks.
Simiment on prompt-induced emotional efects in LLMs and larly, the downstream efects of emotionally biased LLM
the Omega framework for studying emotion-conditioned responses on user cognition and perception, especially
model outputs in educational settings. Section 4 presents in open-ended, educational tasks, remain largely
unexthe results of both experiments, followed by a discussion plored. Moreover, most prior work treats emotional
conof their implications in Section 5. We conclude by outlin- tent as stylistic variation rather than a variable with
meaing future directions for emotion-sensitive human–LLM surable cognitive or perceptual impact. Our study
adinteraction research. dresses these gaps through two contributions. An
empirical evaluation of how afect-laden prompts (joy, apathy,
2. Background anger, fear) modulate LLM performance on SuperGLUE
tasks, and a validated experimental framework for jointly
Recent work in NLP and afective computing has explored assessing the perceptual and cognitive impact of
emotionhow LLMs respond to emotionally charged prompts. conditioned LLM responses in user-facing tasks.
Studies indicate that afective signals in prompts can These contributions are grounded in the
understandinfluence both the emotional tone and the performance ing that, while LLMs do not possess experiential or
afof LLM on tasks [1]. However, the mechanisms under- fective grounding, their behavior can still reflect and
lying these efects remain debated: do LLMs genuinely amplify afective patterns learned from the data. In fact,
process emotional content, or merely simulate it through LLMs operate through statistical association rather than
pattern matching? emotional understanding. Based on distributional
se</p>
      <p>LLMs have shown competence in tasks involving af- mantics [8], they learn afective language patterns by
fect recognition and empathy simulation, but limitations processing massive text corpora and encoding them into
persist in emotional consistency, intensity calibration, high-dimensional vector spaces. Although emotionally
and sensitivity to subtle cues [2]. Psychometric assess- connoted groups can be identified through methods such
ments suggest that models like GPT-4 can match or ex- as PCA, UMAP, or probing techniques [9], these do not
ceed human baselines in specific afect recognition bench- imply afective grounding. Unlike humans, who
intemarks [3], though this performance likely reflects lexical- grate symbolic reasoning with embodied emotional
expesemantic association rather than experiential comprehen- rience, LLMs infer meaning through probabilistic pattern
sion. recognition. As such, emotional fluency in model output
reflects learned correlations, not genuine afect. This gap
has implications for design, interpretation, and ethical
use in emotionally charged contexts.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Methods</title>
      <p>This study quantifies the bidirectional impact of
emotions on human–LLM interactions through two
experiments. Alpha examines how emotional framing in user
prompts afects LLM performance on reasoning tasks,
while Omega investigates how emotionally biased LLM
responses influence human decision-making.</p>
      <p>
        Alpha experiment has been conducted in the English
language, because the SuperGLUE datasets are publicly
available. Since SuperGLUE datasets come out with
predefined ground truths in English, we designed and
executed Alpha based on those. The language does not
matter here. The key point is to analyze the efect that
emotions have on the performance. Prompting in
English is generally a good practice in order to avoid minor
languages biases [
        <xref ref-type="bibr" rid="ref1 ref26 ref4 ref8">10, 11</xref>
        ].
      </p>
      <p>Experiment Omega was designed in Italian to align
with the linguistic context of the educational setting
under investigation. This choice ensures ecological validity,
as it reflects the actual language used by students and
instructors in the targeted learning environment, thereby
enabling a more accurate assessment of comprehension
and afective perception in real-world conditions.</p>
      <sec id="sec-2-1">
        <title>3.1. Alpha: Analyzing the Impact of</title>
      </sec>
      <sec id="sec-2-2">
        <title>Emotions on Machine Performance</title>
        <p>This experiment investigates how emotional framing in
user prompts afects the performance of LLMs on
advanced language understanding tasks. By systematically
modulating the emotional tone of inputs across a subset
of SuperGLUE tasks, we aim to quantify the extent to
which LLM behavior is sensitive to afective cues. The
following subsections describe the experimental design,
implementation, data preparation, and evaluation
protocol.</p>
        <sec id="sec-2-2-1">
          <title>3.1.1. Experimental Design</title>
          <p>Experiment Alpha uses four emotional conditions to
frame user prompts, based on three of Ekman’s six basic
emotions [12] (joy, anger, and fear) plus a neutral
condition representing apathy, which serves as the baseline.
We introduce “apathy” not as a basic emotion, but as a
control condition meant to simulate emotionally neutral
or emotionally flat interaction. In this context, apathy
does not refer to the clinical absence of emotion, but to
a dispassionate tone that serves as a baseline. This
emotion set was designed to balance interpretability with
experimental feasibility, and should be considered a
pragmatic approximation rather than a strict adherence to
Ekman’s taxonomy. Joy, anger, and fear were selected
due to their universality and distinct valence and
activation profiles: joy as a positively valenced afect, anger
as a defense-oriented negative emotion, and fear as an
avoidance-oriented negative emotion. Their inclusion
allows testing both the valence and motivational
dimensions of afect in model reasoning under semantically
equivalent instructions.</p>
          <p>The experiment is grounded in SuperGLUE, a
benchmark designed to assess higher-order language
understanding capabilities such as inference, reasoning, and
contextual comprehension, dimensions that are
hypothesized to be particularly sensitive to emotional
modulation. A subset of eight tasks was selected based on
coverage and structural diversity: BoolQ (Boolean
Question Answering) [13], CB (CommitmentBank) [14], COPA
(Choice of Plausible Alternatives) [15], MultiRC
(MultiSentence Reading Comprehension) [16], ReCoRD
(Reading Comprehension with Commonsense Reasoning) [17],
WiC (Words in Context) [18], WSC (Winograd Schema
Challenge) [19], and RTE (Recognizing Textual
Entailment) [20]. These tasks span competencies including
entailment, causality, multi-sentence comprehension, and
word sense disambiguation. The mentioned eight
SuperGLUE tasks were chosen due to their reliance on nuanced
reasoning, contextual inference, and linguistic
ambiguity—dimensions where emotional framing can modulate
interpretive biases. Entailment tasks such as RTE and
CB require readers (or models) to assess whether a
hypothesis logically follows from a premise. Prior work
has shown that emotional salience can shape these
judgments by modulating perceived relevance or certainty
of the statements involved [21]. COPA tasks depend on
evaluating the most plausible cause or efect in a given
scenario. Emotions are known to modulate causal
reasoning, altering perceived plausibility by priming certain
associations or cognitive shortcuts [22].</p>
          <p>Alternative benchmarks, such as MMLU (Massive
Multitask Language Understanding) [23] and HELM (Holistic
Evaluation of Language Models) [24], were considered
but ultimately excluded. MMLU, while comprehensive,
focuses primarily on multiple-choice knowledge
questions; HELM emphasizes fairness and safety metrics.
Neither aligns well with our focus on fine-grained linguistic
interactions shaped by emotion. SuperGLUE, by contrast,
ofers task types and input structures better suited to
capturing afect-sensitive model behavior.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>3.1.2. Implementation and Runtime Environment</title>
          <p>For each data set record, four variants of emotional
prompts were generated: apathy (intended as the
baseline), joy, anger, and fear. All records were processed in
all emotional conditions, ensuring exhaustive coverage repository [25]. This resource is provided to ensure
transand balanced comparison. parency and facilitate reproducibility of our experimental</p>
          <p>Model inference was performed locally using Ollama, framework.
with results stored in a MongoDB database. The pipeline For CB, RTE, and AX-g, the precision of the entailment
was implemented as a Python CLI application, aiming to classification was calculated by matching the predicted
lasupport full automation, reproducibility, and structured bels (“entailment” / “not_entailment”) using regex. COPA
result querying. The evaluation involved five instruction- assessed causal reasoning, with outputs evaluated via
tuned, open-weight LLMs from four major model families regex-based selection of “option 1” or “option 2,” using
ac(LLaMA, Qwen, Gemma, Mistral), all quantized to 4-bit curacy as the metric. For WiC, WSC, and BoolQ, boolean
precision to support inference on consumer-grade hard- outputs (“true” / “false”) were evaluated using standard
ware. To ensure reproducibility and control for random- accuracy, following minimal post-processing.
ness, temperature was fixed at zero during all inference In the ReCoRD task, which requires cloze-style
comruns. Full model specifications are reported in Table 1. pletion, models were prompted to reproduce the original
ground-truth sentence by correctly replacing a
placeTable 1 holder with the appropriate entity. A few-shot setup
Used Large Language Models with Quantization Details. was adopted to enhance consistency across predictions.</p>
          <p>BLEU scores [26] were used as an automatic metric to
Model Version Quantization quantify the similarity between generated and reference
sentences, capturing token-level variations introduced
by emotional modulation.</p>
          <p>Mistral
LLama 3.1
Qwen 2.5
Gemma 2</p>
          <p>LLama 3.2</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>3.1.3. Data Preparation</title>
          <p>7B Instruct
8B Instruct
7B Instruct
9B Instruct
3B Instruct
Q4
Q4
Q4
Q4
Q4</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.2. Omega: Studying the Impact of</title>
      </sec>
      <sec id="sec-2-4">
        <title>Emotions on Human Interaction</title>
        <p>Experiment Omega investigates the efect of emotional
bias in AI-generated responses on user learning outcomes
and interaction perception. A web-based prototype was
developed, integrating four variants of the Velvet-14B
language model: three fine-tuned for joy, anger, and fear,
and one baseline variant representing apathy. The system
also includes a Retrieval-Augmented Generation (RAG)
component to deliver contextually relevant responses.</p>
        <p>Each task was associated with four prompts difering
only in emotional framing, not in structure or semantics.</p>
        <p>Apathy served as the neutral baseline. Emotional phrases
were inserted to influence afective tone while keeping
task wording consistent. Model outputs were evaluated
using SuperGLUE’s task-specific metrics, comparing
performance across emotional prompt variants within and
across tasks. The full set of prompts used in the Alpha
experiment is publicly available in a dedicated GitHub
SuperGLUE datasets were processed using Pandas and
provided in JSONL format. To ensure equal statistical
weight across tasks, dataset sizes were standardized via
random sampling (maximum 500 records per dataset).</p>
        <p>This choice balances computational cost with robust
estimation. Three datasets—AX-g (356 records), CB (250),
and COPA (400)—did not reach the 500 samples
threshold and were used in full without augmentation. The re- 3.2.1. Experimental Setup and Motivation
maining datasets were sampled to 500 records. Sampling The experiment was designed for a university context,
bounds were determined empirically via exploratory data targeting students attending a lecture on Artificial
Intelanalysis. ligence. After the lecture, participants would be divided</p>
        <p>Two datasets were excluded out of processing. AX- into four groups, each assigned to interact with a
difb due to structural heterogeneity and redundancy with ferent emotionally biased variant of the model. During
CB/RTE, and MultiRC due to excessive token length, in- a subsequent comprehension test, students could
concompatible with the goals of this study. In total, eight out sult their assigned model. Following the test, they would
of ten SuperGLUE tasks were retained for evaluation. complete a Likert-scale [27] questionnaire assessing their
experience and perception of the interaction.
3.1.4. Prompt Design and Evaluation Protocol The primary goal was to determine whether
emotionally biased language outputs influence both cognitive
performance (measured by comprehension scores) and
subjective user experience. Two types of data were
collected: (1) quantitative performance on the test, and (2)
qualitative feedback from the post-test questionnaire.</p>
        <p>Anonymized interaction logs from the conversational
interface further support the analysis, ofering insight
into how diferent emotional tones afect engagement,
performance, and perceived model utility.</p>
        <p>We adopted Velvet-14B as the base model for Experi- was used. The target during training was the next
utterment Omega due to its specialization in the Italian lan- ance in a 10-turn dialogue, conditioned on prior context
guage. Developed with a focus on Italian linguistic and and intended emotion. Fine-tuning was conducted using
cultural contexts, Velvet-14B ensures better alignment LoRA (Low-Rank Adaptation) [30], which enables
efiwith the comprehension and interaction patterns of na- cient training of large models on consumer-grade
hardtive speakers, thereby enhancing the validity of emotion- ware. LoRA introduces learnable low-rank matrices for
conditioned generation in the targeted educational sce- each weight matrix in the base model. Only these
manario. trices are updated during training, and they are applied
as a linear transformation during inference to condition
3.2.2. Training Data Preparation outputs. The Hugging Face PEFT library [31] was used
to implement LoRA, targeting the query and value
proThe emotional variants of Velvet-14B were fine-tuned jection modules of Velvet-14B.
using the MELD dataset [28], which includes dialogues The fine-tuning pipeline begins with data tokenization,
annotated with emotion labels. Three distinct variants followed by loading Velvet-14B with the LoRA adapter.
were created for joy, anger, and fear, as in Experiment Training resumes from the latest checkpoint or starts
Alpha (see Paragraph 3.1.1). The "apathy" variant corre- from scratch if none is found. Models and tokenizers are
sponds to the baseline, non-fine-tuned Velvet-14B model. periodically saved. Across all variants, training showed</p>
        <p>While MELD is originally in English, we adopted a stable convergence, with all models reaching optimal
multi-step translation pipeline to ensure the resulting performance within 0.5 epochs—well before the 2-epoch
dialogues preserved the emotional nuance. First, we fine- limit. Best-performing checkpoints were consistently
tuned Gemma 2 9B to generate emotionally aligned dia- obtained between steps 20 and 30.
logues in English. These dialogues were then translated
into Italian using Gemma 2 9B model, and post-edited 3.2.4. Web Application and Interaction Framework
manually to ensure idiomatic correctness and emotional
ifdelity. We acknowledge the absence of a standardized A custom web application was developed to facilitate
Italian emotional dialogue dataset and recognize that user interaction with the fine-tuned models. The system
this translation pipeline introduces an additional layer of comprises a Streamlit-based frontend, a FastAPI
backabstraction. However, it allowed us to generate a linguis- end, and a Milvus vector database supporting RAG. The
tically and emotionally coherent training corpus suited frontend, built with Streamlit, simplifies interface
develfor the Italian-speaking participants targeted by Experi- opment by translating Python into React components.
ment Omega. The backend handles real-time messaging and contextual</p>
        <p>Due to MELD’s limited size, data augmentation was prompt construction, creating a seamless ChatGPT-like
applied using the Gemma 2 9B model, which generated experience.
additional dialogues preserving emotional nuance. This To support retrieval, text is embedded using the
process yielded 1,200 dialogues (300 per emotion), each intfloat/multilingual-e5-base model [32],
opticonsisting of 10 conversational turns, all translated into mized for multilingual retrieval tasks. The model
distinItalian. Although minor issues with literal translation guishes queries and documents using prefixed prompts
were observed, the resulting 12,000 utterances formed a ("query:", "passage:"), improving asymmetric retrieval
robust training dataset. Gemma 2 9B was selected for its performance. Its balance between performance and
efsuperior performance in emotional prompt handling and ifciency makes it suitable for production environments
its instruction-tuned, open-weight nature [29], making without specialized hardware.
it suitable for consistent and afect-rich synthetic data The RAG component retrieves short academic
pasgeneration. sages relevant to the user query (e.g., definitions,
con</p>
        <p>To validate the emotional bias injection, 100 general- cepts, examples from lecture material), which are then
purpose prompts were used to compare outputs from the prepended to the prompt. The goal is not to alter the
base model and the emotional variants. Responses were emotional framing, but to anchor the response in topical
manually annotated for emotional alignment, confirming knowledge. This contextual grounding ensures that
emothe efectiveness of the fine-tuning procedure. tional variation does not come at the expense of content
relevance or factuality—especially important in
educa</p>
      </sec>
      <sec id="sec-2-5">
        <title>3.2.3. Fine-Tuning Procedure and Emotional Bias tional settings.</title>
        <p>Injection RAG operates in two stages: cosine similarity retrieval
and normalization. Due to the contrastive learning
temFine-tuning targeted dialogue generation, with the objec- perature ( = 0.1), cosine scores are highly concentrated
tive of aligning the model’s output tone with the intended in the [0.7, 1] range. A test using 50 unrelated queries
emotion (joy, anger, fear). No classification objective confirmed this narrow distribution (Figure 1), which
justifies the application of standard score normalization.</p>
        <p>The system workflow starts with the user that
submits a query via the frontend, which is processed by the
backend with contextual history. Relevant chunks are
retrieved from the Milvus database and appended to the
prompt before passing it to the appropriate emotional
model. The response is generated and returned through
the backend to the user interface.</p>
        <sec id="sec-2-5-1">
          <title>3.2.5. Social Experiment</title>
          <p>A social experiment was fully designed to evaluate the
impact of emotional bias in an educational setting.</p>
          <p>Participants ( students) would be randomly assigned
to one of four model variants: apathy (baseline), joy,
anger, or fear. Following a lecture, students would take a
multiple-choice comprehension test (single and multiple
answers), with model assistance allowed during the test.</p>
          <p>Performance would be assessed via accuracy metrics
per group. In parallel, a post-test Likert-scale
questionnaire would collect subjective feedback on interaction
quality, clarity of responses, and perceived helpfulness.</p>
          <p>The study was designed to ofer both objective and
subjective insights into the efects of emotionally biased
LLMs in educational environments. If implemented, it
would have provided valuable data to complement the
Alpha experiment, contributing to a broader understanding
of emotion in human-AI interaction.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results</title>
      <p>This section reports the findings from the Alpha and
Omega experiments, which examine the bidirectional
role of emotions in human–LLM interaction:
user-tomodel (Alpha) and model-to-user (Omega).
Empirical results show that emotionally biased prompts,
despite constant semantic content, impact model
performance. Prompts conveying joy yield the highest average
accuracy across tasks and models (58.08%), while those
expressing fear perform worst (53.60%), with a 4.5pp
performance gap. This confirms that emotional tone, even
in subtle prompt variations, can measurably afect output
quality.</p>
      <p>Efect sizes were evaluated using Cohen’s d, given the
small sample sizes. Pairwise comparisons across
emotions (e.g., joy vs. fear: d = 0.1771) revealed small yet
meaningful diferences, with joy consistently
outperforming fear and anger. All comparisons employed pooled
standard deviation for normalization. Full results are
visualized in Figure 2.</p>
      <p>To better illustrate these trends, we report detailed
task-level performance across models and emotional
conditions in Tables 2–9. Each table shows accuracy (or
BLEU score for ReCoRD) across five LLMs for a given task,
grouped by emotional prompt variant. The final
crosstask summary (Table 9) aggregates mean performance,
confirming that prompts expressing joy consistently lead
to higher scores across models and tasks, while fear yields
the lowest. While LLMs exhibit general robustness to
emotional modulation, these results highlight that even
minor emotional perturbations can shift performance
outcomes in systematic ways.</p>
      <sec id="sec-3-1">
        <title>4.2. Omega: Emotional Influence from</title>
      </sec>
      <sec id="sec-3-2">
        <title>Model to User</title>
        <p>To assess reverse emotional impact, we fine-tuned
Velvet14B via LoRA on joy, anger, and fear-labeled corpora.</p>
        <p>Each variant was tested on 100 GPT-4o-generated
abstract prompts. Responses were manually annotated
for emotional tone presence using a binary function
 :  → {0, 1}, yielding emotional bias scores. The
n
o
i
t
o
m</p>
        <p>E
Apathy
Joy
Fear
Anger
n
o
i
t
o
m</p>
        <p>E
Apathy
Joy
Fear
Anger
n
o
i
t
o
m</p>
        <p>E
Apathy
Joy
Fear
Anger
n
o
i
t
o
m</p>
        <p>E
Apathy
Joy
Fear
Anger
2
.
3
a
m
a
L
L
2
.
3
a
m
a
L
L
2
.
3
a
m
a
L
L
2
.
3
a
m
a
L</p>
        <p>L
annotation process was executed following specific
tagging rules:
• Joy: 1 if, and only if, the response exhibits a warm,
reassuring tone conveying joy or a generally
positive mood, else 0. Results indicate successful emotional conditioning: the
• Anger: 1 if, and only if, the response has a heated, joy-biased model showed a +19% emotional expression
blunt tone expressing anger, directness, or
aggressiveness, else 0.
• Fear: 1 if, and only if, the response displays a
gloomy or sad tone expressing fear, worry,
insecurity, or sadness, else 0.
l
a
r
t
s
i
M
l
a
r
t
s
i
M
l
a
r
t
s
i
M
l
a
r
t
s
i
M
n
e
w</p>
        <p>Q
rate, anger +8%, and fear +6% (Figure 3). Notably, emo- was deferred, fine-tuned Velvet-14B variants (via LoRA)
tional bias afected not only tone but also content, es- exhibited measurable emotional bias (+19% joy),
depecially in philosophical responses—despite no overlap spite training on synthetic dialogues and lacking
exwith training data. This implies that emotion-conditioned plicit emotion labels. This demonstrates the feasibility
ifne-tuning influences the model’s latent representations of lightweight, emotion-targeted fine-tuning for steering
in a generalizable way. LLM responses. We acknowledge the use of translated
synthetic dialogues in lieu of a native Italian emotional
corpus as a limitation. Future work will explore emotion
annotation on native Italian corpora to reduce potential
translation artifacts.</p>
        <p>These findings carry three key implications. First,
emotion in language modulates LLM behavior and is
not merely decorative. Second, emotional conditioning
can be engineered eficiently through prompt design or
ifne-tuning. Third, afect-aware models have potential in
user-facing applications where tone impacts trust, clarity,
or engagement.</p>
        <p>Limitations include the restricted emotion set, lack of
dimensional afect modeling, handcrafted prompt design,
and absence of direct human evaluation in Omega.
FuFigure 3: Emotion Bias Detection Results: Baseline vs Fine- ture work will address these by adopting valence–arousal
Tuned model. models, expanding the emotional spectrum, and
conducting user studies to assess perception, comprehension, and</p>
        <p>Although the full Omega experiment was not deployed long-term efects. Moreover, we acknowledge the use of
to end users, the underlying framework is fully designed translated synthetic dialogues in lieu of a native Italian
and ready for implementation. Deployment was con- emotional corpus as a limitation. Future work should
strained by practical limitations: supporting real-time consider to explore emotion annotation on native Italian
LLM interaction for a full classroom cohort required a corpora to reduce potential translation artifacts.
non-trivial infrastructure, including API routing, authen- One noteworthy limitation of this dual-experiment
tication, and persistent session management. Unfortu- framework lies in its linguistic asymmetry: Experiment
nately, the associated operational costs exceeded our Alpha is conducted entirely in English, leveraging the
available budget. Nevertheless, we validated the frame- SuperGLUE benchmark, while Experiment Omega is
dework’s core component—emotion-conditioned genera- signed for Italian-speaking users in an educational setting.
tion—by quantifying the degree of emotional bias intro- Although this choice is contextually motivated—Alpha
duced during fine-tuning, thus laying the groundwork prioritizes benchmark compatibility and Omega
emphafor future user-facing trials. sizes ecological validity in the Italian academic
environment—it introduces a gap in linguistic continuity that
hinders direct comparison and limits claims of
gener5. Discussion and Conclusions alizability. Emotional framing and perception can be
language-dependent due to diferences in afective
semantics, pragmatics, and cultural connotations. This
language asymmetry currently limits direct comparisons
between Alpha and Omega. While each experiment was
designed to maximize contextual validity—English for
standardized benchmarks, Italian for real-world
educational use—we recognize the challenge it poses for unified
interpretation. A key goal for future work is to harmonize
both experiments in a shared linguistic setting, allowing
more robust cross-experiment generalization.</p>
        <p>Overall, this study lays the groundwork for integrating
emotion as a first-class variable in language-based AI
systems. Responsible use of emotion-aware techniques
could enable more efective, human-aligned, and
contextsensitive interactions across a range of applications.</p>
        <p>This work presents a dual experimental framework to
investigate the bidirectional role of emotion in human–LLM
interactions. In Experiment Alpha, we showed that
emotional tone in prompts—without altering semantic
content—impacts model performance. Prompts expressing
joy and apathy outperformed those conveying anger or
fear, suggesting that LLMs are sensitive to afective
framing. This may stem from emotional mirroring efects
in pretrained embeddings or from improved clarity in
emotionally positive formulations. The observed
alignment with Ekman’s model, particularly the behavioral
opposition of joy and fear, supports the hypothesis that
LLMs encode structured afective representations.</p>
        <p>Experiment Omega further supports this claim from
the reverse direction. While user-centered evaluation</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase
and reword and Formatting assistance. After using these tool(s)/service(s), the author(s) reviewed
and edited the content as needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>doi:10</source>
          .18653/v1/
          <year>2020</year>
          .eval4nlp-
          <fpage>1</fpage>
          .
          <fpage>10</fpage>
          . [30]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          , [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wiebe</surname>
          </string-name>
          , T. Wilson,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardie</surname>
          </string-name>
          , Annotating expres- S. Wang,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora: Low-rank adap-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>sions of opinions and emotions in language</article-title>
          , Lan- tation
          <source>of large language models</source>
          ,
          <year>2021</year>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>guage Resources and Evaluation</source>
          <volume>39</volume>
          (
          <year>2005</year>
          )
          <fpage>165</fpage>
          -
          <lpage>210</lpage>
          . //arxiv.org/abs/2106.09685. arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>doi:10.1007/s10579-005-7880-9</source>
          . [31]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xie</surname>
          </string-name>
          , S.-
          <string-name>
            <given-names>Z. J.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Utilizing large language models with causal Parameter-eficient fine-tuning methods for pre-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>pathic dialogue generation</article-title>
          ,
          <source>in: 2025 IEEE 15th assessment</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2312.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>Annual Computing and Communication Workshop</source>
          <volume>12148</volume>
          . arXiv:
          <volume>2312</volume>
          .
          <fpage>12148</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>and Conference (CCWC)</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>00103</fpage>
          -
          <lpage>00109</lpage>
          . [32]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>doi:10.1109/CCWC62904</source>
          .
          <year>2025</year>
          .10903745.
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Multilingual e5 text embeddings: A technical [</article-title>
          23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zou</surname>
          </string-name>
          , report,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2402.05672.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazeika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          , Measuring arXiv:
          <volume>2402</volume>
          .
          <fpage>05672</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          2021. URL: https://arxiv.org/abs/
          <year>2009</year>
          .03300.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          arXiv:
          <year>2009</year>
          .
          <volume>03300</volume>
          . [24]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsipras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>tic evaluation of language models</article-title>
          ,
          <year>2023</year>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          //arxiv.org/abs/2211.09110. arXiv:
          <volume>2211</volume>
          .
          <fpage>09110</fpage>
          . [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gozzi</surname>
          </string-name>
          , Bidirectional emotional influ-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>repository, https://github.com/gozus19p/</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>Emotional-Influence-in-Human-</article-title>
          <string-name>
            <surname>LLM</surname>
          </string-name>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Accessed</surname>
          </string-name>
          :
          <fpage>2025</fpage>
          -07-23. [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu, Bleu:
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          (Eds.),
          <source>Proceedings of the 40th Annual Meeting of</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>phia</surname>
          </string-name>
          , Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          https://aclanthology.org/P02-1040/. doi:
          <volume>10</volume>
          .3115/
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          1073083.
          <fpage>1073135</fpage>
          . [27]
          <string-name>
            <given-names>R.</given-names>
            <surname>Likert</surname>
          </string-name>
          ,
          <article-title>A technique for the measurement of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>attitudes</surname>
          </string-name>
          ,
          <source>Archives of Psychology</source>
          <volume>140</volume>
          (
          <year>1932</year>
          )
          <fpage>1</fpage>
          -
          <lpage>55</lpage>
          . [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Poria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hazarika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Naik</surname>
          </string-name>
          , E. Cam-
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>tions</surname>
          </string-name>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1810</year>
          .02508.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          arXiv:
          <year>1810</year>
          .
          <volume>02508</volume>
          . [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Maio</surname>
          </string-name>
          ,
          <article-title>Comparative analysis of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>task vs</article-title>
          .
          <source>multitask prompts, Electronics</source>
          <volume>13</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>URL: https://www.mdpi.com/2079-9292/13/23/4712.</mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>doi:10</source>
          .3390/electronics13234712.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>