<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of Emotion-Conditioned Response Generation for Role-Playing Agents Using Large Language Models: A Case Study with Facial Expression Labels from Visual Novel Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shinji Muraji</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafal Rzepka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toshihiko Itoh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hokkaido University</institution>
          ,
          <addr-line>Kita 14, Nishi 9, Kita-ku, Sapporo, Hokkaido, 060-0814</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <fpage>144</fpage>
      <lpage>159</lpage>
      <abstract>
        <p>In recent years, research on role-playing agents that leverage the powerful conversational capabilities of large language models (LLMs) has been actively conducted. While many studies have explored how LLMs can generate utterances that reflect a character's individuality, the examination of utterance generation conditioned on the target character's emotions has not been suficiently conducted. To address this gap, we used scenario data from a visual novel game in which facial expressions, one of the key components of emotional information, are explicitly annotated. We trained an LLM to generate responses conditioned on facial expression labels by incorporating these labels into the input during fine-tuning, and evaluated the resulting utterances in terms of their character-likeness. Our experiments revealed three key findings: (1) conditioning on facial expression labels improves the perceived character-likeness of the generated utterances; (2) providing the LLM with facial expression labels that do not correspond to the character's actual expressions can still enhance the perceived character-likeness of the generated utterances; and (3) there remains a gap between how humans and LLMs process emotional information in dialogue generation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Emotional Intelligence</kwd>
        <kwd>Role-Playing Agent</kwd>
        <kwd>Large Language Model</kwd>
        <kwd>Emotion Processing</kwd>
        <kwd>Facial Expression Label</kwd>
        <kwd>Character-Likeness</kwd>
        <kwd>Visual Novel</kwd>
        <kwd>Dialogue Generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent advances in large language models (LLMs) have enabled dialogue systems to produce utterances
that convey individuality. To achieve consistent and engaging role-specific behavior, many studies have
developed role-playing agents that emulate particular characters, evaluating their ability to reproduce
the target character’s knowledge, personality, and linguistic style [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>However, the emotional expressiveness of such agents has largely remained a black box. Emotions
play a central role in defining individuality, yet previous works rarely treat emotional information as
an explicit conditioning factor in role-playing generation. To address this gap, we investigate whether
explicitly conditioning utterance generation on facial expressions, a key form of emotional information,
enhances the perceived character-likeness of generated responses. An overview of this concept is
shown in Figure 1.</p>
      <p>Collecting reliable emotion labels for dialogue data is challenging due to subjective variation among
annotators, making it dificult to obtain ground-truth emotional information for a target character. As a
result, it has not been fully investigated whether conditioning on emotional information can improve
the perceived character-likeness of generated utterances. In contrast, visual novel games provide
naturally aligned multimodal data in which each utterance is explicitly paired with the character’s
facial expression in the original work. We treat these facial expressions as ground-truth emotional
information and use them to condition LLM-based utterance generation, evaluating whether such
conditioning improves the role-playing consistency of responses.</p>
      <p>Furthermore, to examine the role of label reliability, we compare three conditions: (1) using
groundtruth expressions from the game, (2) using LLM-generated expressions predicted from the scene, and
(3) using randomly assigned expressions. Through this comparison, we explore how the source and
accuracy of emotional information afect the character-likeness of generated utterances.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Role-Playing and Emotion-Conditioned Agents</title>
        <p>
          Prior research on role-playing agents generally follows two main approaches: prompt-based and
finetuning strategies [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
          ]. Prompt-based methods aim to elicit character-like behavior through
instruction design or retrieval, whereas fine-tuning methods adapt model parameters for specific
characters. Emotion-related work includes EmoCharacter, a benchmark for assessing the emotional
ifdelity of role-playing agents, which reports that fine-tuning on real dialogue data and in-context
learning can improve emotional fidelity [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In parallel, EmotionalRAG introduces emotion-aware
retrieval to enhance role-playing without relying on fine-tuning the base LLM [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Emotion-Annotated Dialogue Datasets</title>
        <p>
          Emotion recognition in dialogue involves labeling utterances with emotion categories. Representative
datasets include MELD [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], where each utterance is annotated with Ekman’s six basic emotions plus
Neutral, and EmoryNLP [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which uses labels drawn from Willcox’s Feeling Wheel. These datasets
highlight the high subjectivity and annotation dificulty in emotion-labeling. In contrast, our work uses
character facial expressions depicted in visual novel scenes as author-defined emotional signals for
conditioning response generation.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Modeling Role-Playing Agents</title>
      <p>We aim to develop an agent that incorporates the target character’s emotional information into its
training process. Following previous studies, we construct a role-playing agent using a large language
model (LLM) and employ both a role-playing-oriented prompt template and a fine-tuning strategy.
The role-playing-oriented prompt template is designed to leverage the LLM’s ability to understand
instructions and utilize general factual knowledge about the target character. Meanwhile, the fine-tuning
strategy adjusts the model’s internal parameters so that it can reproduce finer aspects of the character’s
linguistic style and factual consistency while efectively utilizing emotional information. Each strategy
is described in detail below.</p>
      <sec id="sec-3-1">
        <title>3.1. Prompt Template</title>
        <p>We designed a static prompt template for role-playing to elicit the LLM’s character-consistent responses
during both training and evaluation. Each prompt includes a brief Wikipedia-based description of
the target character and specifies either the situation alone or both the situation and facial expression
as conditioning information. A detailed explanation and the full prompt structure are provided in
Appendix A.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Fine-Tuning Strategy</title>
        <p>We fine-tune a large language model (LLM) to generate responses conditioned on the target character’s
emotional information. To isolate the efect of emotional conditioning, we prepare two core models:
one fine-tuned on the situation only (baseline agent ) and another fine-tuned on both the situation and
the corresponding facial expression (facial expression-conditioned agent).</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Baseline agent</title>
          <p>The baseline agent is trained using (situation, utterance) pairs, where the situation consists of the
preceding ten lines of the novel-style scenario text describing dialogue turns, scene setting, and character
actions. The model is optimized to maximize the likelihood of reproducing the target character’s actual
utterances from these contexts.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Facial Expression-Conditioned Agent</title>
          <p>The facial expression-conditioned agent extends this setting by incorporating an additional conditioning
signal: the facial expression label corresponding to each utterance. It is trained on (situation, facial
expression, utterance) triples, learning to generate utterances that reflect both the narrative context and
the character’s emotional state.</p>
          <p>To further examine whether the use of ground-truth facial expressions is necessary, we compare three
training variants: (1) using the original ground-truth expressions, (2) using LLM-inferred expressions
from the situation text, and (3) using randomly assigned expressions. This comparison allows us to
assess whether plausible but non-authentic emotional information can still enhance the perceived
character-likeness of generated utterances.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Construction of a Facial Expression-Annotated Dataset</title>
      <p>To train the proposed role-playing agents and examine the efect of emotional information, we construct
triplets of (situation, facial expression, utterance) from a visual novel game.</p>
      <sec id="sec-4-1">
        <title>4.1. Data Collection</title>
        <p>In collaboration with and under the consent of the game’s developer, we build a dataset for one target
character from a commercial visual novel. Although expanding to multiple characters is desirable, the
construction and human evaluation costs are high; therefore, we report results for a single character in
this paper. The developer granted permission to use the data and publish the results, but the title and
character name had to be withheld. Instead, we report suficient aggregate statistics to characterize the
dataset.</p>
        <p>A visual novel presents text in a novel-like format, accompanied by images and voices attached to
each script line. A brief overview of this format is given in Appendix B. Each image depicts the current
facial expression of a character, which we label and use as conditioning input for the LLM.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Overall Data Structure</title>
        <p>An example of a visual novel game screen is shown in Figure 2. The major elements are highlighted with
circles, and their corresponding names are indicated in italics. Note that this example was created for
explanatory purposes and is not part of the actual game scenario. The game script consists of narrative
text that describes the protagonist’s inner thoughts and the scene, as well as character utterances with
explicit speakers. Although the narrative parts could be considered noise, rewriting them from the
protagonist’s perspective into another character’s viewpoint would be impractical. Therefore, in this
study, we use the original script text as contextual descriptions of the situations in which utterances
occur.</p>
        <p>Each line in the script corresponds to one or more sentences and is linked to the facial expressions
of the characters shown in that scene. We extract the target character’s facial expression for each line;
when the character is absent, the most recent expression is retained. An example is shown in Table 1,
illustrating how each script line is paired with the target character’s facial expression.</p>
        <p>From data structured as shown in Table 1, we use the target character’s utterances as anchors to
extract the preceding ten lines of the script as the situation, and the facial expression at the time of the
utterance as the facial expression. If fewer than ten preceding lines exist, we use all available preceding
lines as the context. When a scene transition occurs in the game, we treat it as a new scene and construct
separate data for each scene. An example of a resulting triplet generated from the final target utterance
in Table 1 is shown in Table 2.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Facial Expression Labels</title>
        <p>Each facial expression image of the target character was assigned a short, descriptive textual label
so that the LLM could process expressions as linguistic inputs rather than categorical codes. Three
annotators (including one of the authors) who had played the visual novel collaboratively created these
labels, describing each expression in concise natural-language terms. When a single image conveyed
multiple aspects, all agreed-upon descriptors were retained. A detailed description of the labeling
procedure and examples of the resulting labels are provided in Appendix C.</p>
        <p>Although the game includes more than 45 facial images of the target character, several depict nearly
identical expressions with diferent poses. To simplify the learning task and avoid excessive label
granularity, we consulted with the game developer and consolidated them into 14 representative facial
expressions. An overview of the final label definitions and their distribution is presented in Appendix D.</p>
        <p>The distribution of facial expression labels is notably imbalanced, reflecting the frequent occurrence
of serious or emotionally intense scenes in the game. Since our experiments evaluate responses within
the same narrative context, we did not apply any rebalancing techniques to the dataset.</p>
        <p>We also prepared augmented expression labels generated by an open-source LLM from the preceding
context in the script to examine whether inferred emotional information could enhance role consistency.
Details of this augmentation process and the prompt used are provided in Appendix E.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Dataset Statistics</title>
        <p>This section presents the statistics of the collected dataset. To prepare the data for training, we split it
into training, validation, and test sets. Because the target character’s utterances in the test set might
otherwise appear within the situational context of the training set, we perform the split on a per-scene
basis. Specifically, the 109 total scenes are divided in an 8:1:1 ratio for the training, validation, and test
sets, respectively. The resulting dataset statistics are shown in Table 3.</p>
        <p>Since the data were collected in Japanese, sequence lengths are measured in characters, not words.
Although the dataset focuses on a single target character, it provides a relatively large-scale resource in
which each utterance is paired with detailed emotion-related information derived from facial expressions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <p>We fine-tune the LLM using the dataset constructed in Section 4 and evaluate the generated utterances
in terms of semantic similarity and perceived character-likeness. This section describes the fine-tuning
details, including hyperparameters, the models used for comparison, and the evaluation methods.</p>
      <sec id="sec-5-1">
        <title>5.1. Fine-Tuning Details</title>
        <p>We fine-tune an open-source large language model, Qwen/Qwen3-32B [9], using 4-bit quantization
and the QLoRA framework [10] under a causal language modeling objective. All computations are
performed in bfloat16 precision to reduce memory usage.</p>
        <p>The LoRA rank is set to 256, with a dropout rate of 0.1, applied to the standard attention and MLP
projection layers. Training uses the AdamW optimizer with a learning rate of 5 × 10−5 and a cosine
learning rate schedule. We train for five epochs with an efective batch size of 8, and select the model
achieving the lowest validation loss as the final checkpoint. Full training hyperparameters are listed in
Appendix F.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Comparison Models</title>
        <p>From each triplet (situation, facial expression, utterance) in the test set, Each model generates utterances
based on either the situation alone or both the situation and the facial expression. The models compared
in this study are as follows:
• Baseline agents (situation only)
– No fine-tuning: The original pretrained LLM without fine-tuning, using only prompt
instructions to generate responses.
– Fine-tuned on situation only: The model fine-tuned to generate utterances conditioned
solely on the situation.
• Emotion-Conditioned Agents (situation + facial expression)
– Ground-truth expression: Trained on ground-truth facial expressions extracted from the
game.
– LLM-augmented expression: Trained on facial expressions automatically inferred by an</p>
        <p>LLM from the situation.</p>
        <p>– Random expression: Trained on randomly assigned facial expression labels.</p>
        <p>The “no fine-tuning” model generates responses solely based on the given prompt instructions,
allowing us to examine the efect of fine-tuning by comparing it with the situation-only model. For
each emotion-conditioned agent, the same type of facial expression labels used for training are also
used at inference time.</p>
        <p>To further verify whether the agent truly conditions its responses on the given facial expressions, we
also test the ground-truth expression model with mismatched expressions at inference time. Specifically,
we evaluate two additional settings in which the model trained with ground-truth facial expressions is
given LLM-augmented expressions or random expressions during inference. Thus, in total, we compare
seven models.</p>
        <p>• Emotion-Conditioned Agents (situation + facial expression)
– Ground-truth expression (inference with LLM-augmented expression) Trained with
ground-truth expressions and evaluated using LLM-augmented expressions at inference
time.
– Ground-truth expression (inference with random expression) Trained with
groundtruth expressions and evaluated using randomly assigned expressions at inference time.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Evaluation Method</title>
        <p>We evaluate the utterances generated by each agent in terms of two aspects: (1) semantic similarity to
the corresponding gold utterance, and (2) perceived character-likeness of the generated utterance with
respect to the given situation. Each evaluation method is described in detail below.</p>
        <sec id="sec-5-3-1">
          <title>5.3.1. Semantic Similarity</title>
          <p>
            Following prior studies [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] , we automatically evaluate the generated utterances by measuring the
cosine similarity between their sentence embeddings and those of the corresponding gold utterances. To
focus purely on the utterance content, quotation marks and surrounding tokens were removed before
embedding. For sentence embedding, we used plamo-embedding-1b [11], a model known for its strong
Japanese semantic representation capability.
          </p>
        </sec>
        <sec id="sec-5-3-2">
          <title>5.3.2. Character-Likeness Evaluation</title>
          <p>
            For a more rigorous evaluation of character-likeness, we conducted a human annotation study.
Annotators were presented with pairs of a situation and a generated utterance and asked to rate how
consistent the utterance was with the target character across four aspects: overall impression,
personality, linguistic style, and knowledge consistency. Each aspect was scored on a 7-point Likert scale.
These evaluation criteria were designed with reference to prior studies on role-playing agents [
            <xref ref-type="bibr" rid="ref1">1, 12</xref>
            ],
focusing on aspects potentially related to emotional expression. A 7-point scale was chosen to capture
ifner distinctions than a 5-point scale.
          </p>
          <p>Because some gold utterances in the dataset are short and generic (e.g., “Yeah”), it is unrealistic for
all gold utterances to consistently receive the maximum score. Therefore, we also included the gold
utterances in the annotation process to establish a human upper bound. As a result, each annotation
session consisted of eight utterances per situation, including seven generated utterances and one gold
utterance, each rated across four criteria.</p>
          <p>Since this annotation is costly, we randomly sampled 100 situations from the 420 test cases for
evaluation. For each situation, annotators rated the utterances (presented in random order) based on
the four aspects above. Each utterance was rated by three diferent annotators, and the mean of their
scores was taken as the utterance score. The model-level score was computed as the average across
the 100 situations. When two models generated identical utterances, the utterance was evaluated only
once, and the same score was assigned to both.</p>
          <p>Annotators were required to be suficiently familiar with the target character to understand their
emotional expressions; therefore, only those who had played through and completed the original visual
novel were eligible. A total of nine volunteers (including one author, who evaluated in a blinded setting)
participated in the annotation.</p>
          <p>To assess the reliability of the human judgments, we measured inter-annotator agreement using
Krippendorf’s  , obtaining a value of  = 0.44. Although the task is inherently subjective, this level
of agreement indicates a reasonable level of consistency among annotators.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>In this section, we present the evaluation results of semantic similarity and perceived character-likeness,
followed by additional analyses and discussion.</p>
      <sec id="sec-6-1">
        <title>6.1. Semantic Similarity Results</title>
        <p>The results of the automatic evaluation based on semantic similarity are shown in Table 5. The model
trained and inferred with ground-truth facial expressions achieved the highest similarity to the gold
utterances, consistent with previous studies. Interestingly, from the perspective of semantic similarity,
using random facial expressions during training yielded slightly higher scores than training without
any emotional conditioning.</p>
        <p>Furthermore, for the model trained with ground-truth expressions, inference with LLM-augmented
expressions resulted in higher similarity than inference with random expressions. This indicates that
when emotion-conditioned training is employed, conditioning the model on facial expressions closer
to the original data can enhance semantic similarity. However, both inference settings performed
worse than the model trained and inferred with random expressions, suggesting that when high-quality
facial expression information is unavailable at test time, training with actual expressions ofers limited
advantage in terms of semantic similarity.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Results of Character-Likeness Evaluation</title>
        <sec id="sec-6-2-1">
          <title>6.2.1. Efect of Fine-Tuning</title>
          <p>The diference between the no fine-tuning model and the situation-only model demonstrates that
our fine-tuning setup and data volume were efective. Because we did not apply dynamic retrieval
or knowledge augmentation, the relatively low scores of the no fine-tuning model likely reflect its
reliance on prompt instructions alone. The situation-only model achieved around four points on the
seven-point Likert scale, corresponding to “neutral” across all criteria, indicating that the adopted
ifne-tuning strategy raised the perceived quality to an acceptable level.</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>6.2.2. Efect of Training with Ground-Truth Expressions</title>
          <p>When comparing the situation-only model and the ground-truth expression model, little diference
was observed. This suggests that conditioning on ground-truth facial expressions did not necessarily
make the generated utterances more characteristic of the target character in this experimental setting.
However, because the latter model explicitly incorporates emotional information rather than treating it
as an implicit factor, it has the potential to enable finer control over the agent’s expressive behavior.
Further investigation of this controllability is left for future work.</p>
        </sec>
        <sec id="sec-6-2-3">
          <title>6.2.3. Which Expression Data Should Be Used?</title>
          <p>Among the models trained and tested with diferent types of expression data, the one trained and
evaluated with LLM-augmented expressions achieved the highest scores across all criteria, a somewhat
unexpected outcome. In principle, the triplets of (situation, expression, utterance) from the original
game are internally consistent from a human perspective, and we hypothesized that a model trained
on such data would reproduce the character’s utterances most faithfully (as supported by its superior
semantic similarity). However, the LLM-augmented expressions were inferred solely from the situation
and were not guaranteed to align with the actual utterances. Given their low accuracy relative to the
true labels, these pseudo-expressions likely introduce noise. Nevertheless, the model trained with them
achieved the best human-rated character-likeness. To analyze this unexpected result, we conducted
an additional human evaluation to assess whether the generated utterances were consistent with the
conditioned facial expressions, as described in the next subsection.</p>
          <p>When comparing the ground-truth expression and random expression models, the former achieved
slightly higher scores. This indicates that the model did not treat facial expressions merely as random
noise but learned some meaningful associations between expressions and utterances. However, since
the diference from the situation-only model was marginal, this association was likely weak, potentially
due to label imbalance and limited data, which remain an open challenge.</p>
        </sec>
        <sec id="sec-6-2-4">
          <title>6.2.4. Efect of Changing Test-Time Expressions</title>
          <p>Next, we compare cases where the model trained on ground-truth expressions was tested with diferent
types of expressions. When ground-truth expressions were replaced by LLM-augmented or random
ones during inference, the scores dropped to the lowest levels among all fine-tuned models. This result
suggests that the model had learned dependencies between actual expressions and utterances, and
its generation quality degraded when provided with inconsistent or mismatched emotional cues at
inference time.</p>
        </sec>
        <sec id="sec-6-2-5">
          <title>6.2.5. Scores Assigned to Gold Utterances</title>
          <p>Finally, we discuss the scores assigned to the gold (original) utterances. The average score for gold
utterances was approximately 5.85, indicating that annotators could reliably distinguish authentic lines
from generated ones. However, some gold utterances received much lower scores, the lowest being 3.33,
showing that even original utterances are not always perceived as “in-character.” This highlights that
the gold data do not always align with human perception of character-likeness. Therefore, evaluation
metrics for role-playing agents should be carefully designed according to the intended purpose of the
system.</p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Additional Analysis and Discussion</title>
        <p>As described in the previous section, the model trained with LLM-augmented facial expressions
outperformed the model trained with ground-truth expressions in terms of perceived character-likeness.
However, it remains unclear whether the expressions used for conditioning were actually consistent
with the generated utterances when viewed by humans. In other words, a triplet of (situation, expression,
utterance) might appear inconsistent or unnatural to human observers.</p>
        <p>To investigate this, we conducted an additional experiment in which three annotators familiar with
the target character (including one of the authors) evaluated whether the conditioned facial expressions
matched the generated utterances. Each annotator was first shown a (situation, utterance) pair and
then asked: “If the target character were to utter this line in the given situation, would the following
facial expression seem appropriate for them?”</p>
        <p>Annotators rated the appropriateness of each expression on a seven-point Likert scale. If they could
not imagine the target character making the utterance in that situation, they were instructed to select
4 (neutral). Unlike the previous evaluation, which rated character-likeness for (situation, utterance)
pairs, the present evaluation asks annotators to judge the appropriateness of a facial expression image
given the same (situation, utterance) pair. All evaluations were performed blindly, and annotators did
not know which model generated each utterance. The same 100 situations and utterances as before
were used, along with the original facial expression images from which each model’s expressions were
derived during generation.</p>
        <p>The results are shown in Table 7. The model trained with ground-truth expressions produced
utterances that were judged by humans as more consistent with the given expressions than those
generated by the LLM-augmented expression model. This indicates that the latter model learned to
associate utterances with expressions that humans did not perceive as coherent.</p>
        <p>Nevertheless, the character-likeness scores for the LLM-augmented model were higher overall. This
suggests that, for fine-tuning an LLM to generate utterances perceived as more in-character, perfect
visual or emotional consistency between expressions and utterances may not be necessary. Even
expression labels that appear inconsistent to humans can serve as useful conditioning signals, guiding
the model toward more character-like generation.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>In this study, we developed a role-playing agent using a static prompt designed for role-playing and a
ifne-tuning strategy. We examined whether conditioning utterance generation on facial expressions, an
essential component of emotional information, would lead to responses that appear more consistent
with the target character.</p>
      <p>Experimental results showed that when a pretrained LLM inferred facial expressions from context
and used those inferred labels for conditional learning, the generated utterances were perceived as
more character-like. In contrast, the model trained with ground-truth facial expressions did not achieve
higher character-likeness scores than the situation-only model, although it produced utterances that
were more consistent with the conditioned expressions than those of the LLM-augmented expression
model. These findings suggest a gap between the expression–utterance correspondence perceived by
humans and the correspondence that facilitates character-like generation in LLMs.</p>
      <p>To enable detailed analysis, we limited our dataset to a single target character, allowing us to collect
high-quality expression–utterance pairs and conduct fine-grained evaluations. As future work, we plan
to extend our experiments to multiple characters to investigate inter-character diferences in emotional
modeling. Furthermore, because our current approach did not employ retrieval augmentation, the
model’s expressive diversity may have been constrained by its internal parameters. We therefore intend
to explore retrieval-based augmentation of character information to better leverage emotional cues
such as facial expressions and other afective signals during generation.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT-5 in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
[9] Qwen Team, Qwen3 technical report, 2025. URL: https://arxiv.org/abs/2505.09388.</p>
      <p>arXiv:2505.09388.
[10] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLORA: eficient finetuning of quantized
LLMs, in: Proceedings of the 37th International Conference on Neural Information Processing
Systems, NIPS ’23, Curran Associates Inc., Red Hook, NY, USA, 2023.
[11] Preferred Networks, Inc, Plamo-embedding-1b, 2025. URL: https://huggingface.co/pfnet/
plamo-embedding-1b.
[12] Q. Tu, S. Fan, Z. Tian, T. Shen, S. Shang, X. Gao, R. Yan, CharacterEval: A Chinese benchmark for
role-playing conversational agent evaluation, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.),
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 11836–11850.
URL: https://aclanthology.org/2024.acl-long.638/. doi:10.18653/v1/2024.acl-long.638.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Prompt Template</title>
      <p>
        Previous studies have commonly employed prompt-based strategies to construct role-playing agents,
for example by tailoring the instructions given to the LLM or by augmenting the input with
characterspecific information retrieved from external sources [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. These approaches aim to leverage the LLM’s
internal knowledge and improve role-playing behavior through prompt design and retrieval.
      </p>
      <p>While improvements in prompting or dynamic knowledge augmentation are not unrelated to
emotional information, optimizing them together with emotion-conditioned training would make the
problem setting overly complex. Therefore, we designed a static prompt template to elicit the LLM’s
capabilities consistently during both training and evaluation.</p>
      <p>
        As static character information, we included concise descriptions sourced from Wikipedia. Inspired
by Shao et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], who automatically construct character profiles from Wikipedia for fine-tuning, we did
not auto-generate training targets; instead, we used the Wikipedia descriptions only as fixed context
within our static prompt template.
      </p>
    </sec>
    <sec id="sec-10">
      <title>B. Details of Visual Novel Gameplay and Data Structure</title>
      <p>For readers unfamiliar with visual novel games, Figure 4 illustrates an example of scene transitions.
As described in Section 1, a visual novel is a type of game in which users progress through
novelstyle text by clicking or pressing keys, while listening to the spoken dialogue and viewing character
images. Figure 4 corresponds to the example shown in Table 1. Although the audio is not used in our
experiments, many visual novels include voice acting synchronized with character utterances. Thus,
the data can be regarded as containing discretized facial expressions linked with the corresponding
utterances and voice audio.</p>
    </sec>
    <sec id="sec-11">
      <title>C. Facial Expression Labeling Procedure</title>
      <p>Specifically, three annotators (including one of the authors) who had experienced the target visual
novel within the past year collaboratively created the labels. The labeling procedure was as follows.
First, each annotator independently described each facial expression image in one word or short phrase.
Next, all annotators reviewed one another’s descriptions and voted for the expression they found most
appropriate. Finally, all expressions that received at least one vote were adopted as valid labels. This
procedure excluded cases where annotators could not provide an adequate description. When a single
facial expression was judged to convey multiple aspects, all recognized descriptors were retained in the
label, which can therefore consist of multiple terms (e.g., [Uneasy, confused]).</p>
    </sec>
    <sec id="sec-12">
      <title>D. Facial Expression Labels and Distribution</title>
      <p>This appendix provides the full list of the 14 facial expression labels, along with the number of samples
for the original, LLM-augmented, and randomly assigned datasets (Table 8). The text labels are translated
from Japanese.</p>
    </sec>
    <sec id="sec-13">
      <title>E. Prompt for Data Augmentation</title>
      <p>For the LLM-based data augmentation, we used an open-source model that could run on our available
GPU resources (Qwen/Qwen3-32B-AWQ [9]) to predict the target character’s facial expression from
the surrounding situation text. If the model failed to output one of the 14 predefined facial expression
labels, generation was repeated until a valid label was produced.</p>
      <p>We also evaluated how closely the LLM-augmented and randomly assigned facial expression labels
matched the ground-truth labels (Table 9). Because each category is weighted equally, macro-F1 scores
are reported, which tend to be lower in absolute value. Although improving expression-labeling accuracy
is not the main objective of this work, the results show that LLM-based augmentation performs better
than random labeling but still has room for improvement.</p>
    </sec>
    <sec id="sec-14">
      <title>F. Training Hyperparameters</title>
      <p>This appendix lists the full hyperparameter configuration omitted from the main text. All fine-tuning
was conducted using the 4-bit quantized version of Qwen/Qwen3-32B [9] under a causal language
modeling objective via the QLoRA framework [10], on a workstation equipped with two NVIDIA
GeForce RTX 3090 GPUs (24 GB VRAM each), running Ubuntu 22.04 with CUDA 12.6 and PyTorch 2.8.0.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , J.-t. Huang,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Leng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Xiao,</surname>
          </string-name>
          <article-title>InCharacter: Evaluating personality fidelity in role-playing agents through psychological interviews</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>1840</fpage>
          -
          <lpage>1873</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>102</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>102</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          , Character-LLM:
          <article-title>A trainable agent for role-playing</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>13153</fpage>
          -
          <lpage>13187</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>814</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Leng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. MI,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Chatharuhi:
          <article-title>Reviving anime character in reality via large language model</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2308.09597. arXiv:
          <volume>2308</volume>
          .
          <fpage>09597</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Que</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Ouyang,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          , J. Peng, RoleLLM: Benchmarking, eliciting, and
          <article-title>enhancing role-playing abilities of large language models</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>14743</fpage>
          -
          <lpage>14777</lpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .findings-acl.
          <volume>878</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .findings-acl.
          <volume>878</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Gao,
          <article-title>EmoCharacter: Evaluating the emotional fidelity of role-playing agents in dialogues</article-title>
          , in: L.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ritter</surname>
          </string-name>
          , L. Wang (Eds.),
          <source>Proceedings of the</source>
          <year>2025</year>
          <article-title>Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Albuquerque, New Mexico,
          <year>2025</year>
          , pp.
          <fpage>6218</fpage>
          -
          <lpage>6240</lpage>
          . URL: https:// aclanthology.org/
          <year>2025</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>316</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2025</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>316</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <surname>Emotional</surname>
            <given-names>RAG</given-names>
          </string-name>
          :
          <article-title>Enhancing Role-Playing Agents through Emotional Retrieval</article-title>
          , in: 2024
          <source>IEEE International Conference on Knowledge Graph (ICKG)</source>
          ,
          <source>IEEE Computer Society</source>
          , Los Alamitos, CA, USA,
          <year>2024</year>
          , pp.
          <fpage>120</fpage>
          -
          <lpage>127</lpage>
          . URL: https://doi.ieeecomputersociety.
          <source>org/10.1109/ICKG63256</source>
          .
          <year>2024</year>
          .
          <volume>00023</volume>
          . doi:
          <volume>10</volume>
          .1109/ICKG63256.
          <year>2024</year>
          .
          <volume>00023</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Poria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hazarika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Naik</surname>
          </string-name>
          , E. Cambria,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          ,
          <article-title>MELD: A multimodal multi-party dataset for emotion recognition in conversations</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>527</fpage>
          -
          <lpage>536</lpage>
          . URL: https://aclanthology.org/P19-1050/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1050.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Zahiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Emotion detection on TV show transcripts with sequence-based convolutional neural networks</article-title>
          ,
          <source>CoRR abs/1708</source>
          .04299 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1708.04299. arXiv:
          <volume>1708</volume>
          .
          <fpage>04299</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>