<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating LLM Alignment under Big Five Personality Prompting</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hsien-Te Kao</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Svitlana Volkova</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <fpage>94</fpage>
      <lpage>105</lpage>
      <abstract>
        <p>Personality traits are increasingly incorporated into AI agents in applications involving human interaction. However, it remains uncertain whether LLMs truly manifest the intended traits when prompted. In this paper, we evaluate six LLMs: GPT-4o Mini, Llama 3.2, Mistral NeMo, Gemini 2.0 Flash Lite, Gemma 2, and Claude 3 Haiku by prompting high and low configurations of the Big Five personality traits and assessing both overall trait alignment and item-level inconsistencies through asking the prompted LLM to take a personality survey. Our analysis reveals pronounced diferences in how these LLMs express prompted personalities, with Llama 3.2, Mistral NeMo, and Claude 3 Haiku struggling to reflect specific trait items in low configurations, and GPT -4o Mini, Gemma 2, and Gemini 2.0 Flash Lite exhibiting strong personality alignment under high-configuration settings. These findings show a personality misalignment, where LLMs do not necessarily express the intended traits as expected after personality prompting.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLM Alignment</kwd>
        <kwd>Big Five</kwd>
        <kwd>Personality Prompting</kwd>
        <kwd>Personality Evaluation</kwd>
        <kwd>Personality Misalignment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Human teamwork has long tackled complex challenges, but it is rapidly evolving with AI agents.
Human and AI teaming is increasingly needed because modern multi domain environments are so
complex and data rich that humans alone cannot process information or decide quickly enough, making
AI support essential [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. AI agents already show strong reasoning and planning skills, using task
decomposition, adaptive reflection, and external tools to handle multi step challenges and uncertainties
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. They complement human strengths by taking on heavy computation, enabling humans to focus on
creative and strategic work, fostering shared control and balanced collaboration [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] . Human and AI
teaming is shifting from rigid turn based interaction to dynamic mixed initiative collaboration where
AI adapts roles, coordinates tasks, and supports fluid teamwork in real time [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This evolution creates
opportunities in diverse workplaces, where trust, information sharing, and shared situation awareness
enable richer collective cognition and more resilient dynamics [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Humans in these settings prefer AI
partners that feel human-like, showing adaptive skills, intuitive communication, and behaviors that
build trust by aligning with human norms [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Designing such AI requires a framework for human
understandable and human-like qualities.
      </p>
      <p>
        Personality is a dynamic system of experiences, self reflection, and behavioral patterns shaping how
individuals interpret themselves and coordinate with others [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The Big Five which are Extraversion,
Agreeableness, Conscientiousness, Neuroticism, and Openness are the most accepted framework for
operationalizing these traits in research and assessments [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Cross cultural validation of the Five Factor
Personality Inventory shows this structure is stable across cultures and languages, providing a basis for
comparisons and standardized evaluation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These traits strongly influence social interaction, with
Extraversion, Agreeableness, and Conscientiousness linked to richer networks, better relationships,
and stable interaction patterns [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In collaborative problem solving, both individual and group levels
of traits like Conscientiousness and Agreeableness drive better coordination, efective roles, and team
coherence [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In team selection and performance, traits like Conscientiousness, Extraversion, and
Emotional Stability support organized contributions, facilitative behavior, and sustained collaboration
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This solid personality research underpins the design of more human-like AI.
      </p>
      <p>
        AI is increasingly integrating personality to add a human touch, with many chatbot studies using Big
Five traits through deep learning to infer user tendencies and adjust tone and style for better engagement
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Recent work shows chatbots often encode explicit trait levels into personas trained on labeled
data, using prompt conditioning and reinforcement learning to maintain trait consistent responses and
culturally aligned interactions [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. AI agents that infer users’ traits and align their own personality, such
as adopting a serious and assertive style in high stakes interviews, can increase trust and compliance
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Agents with parameterized trait profiles can foster intimacy and commitment through higher
agreeableness and extraversion, strengthening engagement beyond task oriented dialogue [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In
immersive environments, integrating traits into human digital twins driven by large language models
enables lifelike and contextually coherent behaviors [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Embedding these traits in digital twin
frameworks supports psychologically grounded simulations and adaptive interactions that enhance
personalization and alignment with user profiles [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. This integration of personality traits allows AI
agents to interact more naturally, much like humans do.
      </p>
      <p>Big Five personality prompting serves as the backbone of human-like AI agents, providing clear,
configurable traits that shape their social interaction, team dynamics, and collaborative behavior.
However, whether AI agents genuinely reflect the Big Five traits as prompted remains uncertain, since
their expressions may not consistently align with the traits we intend to elicit. In this paper, we evaluate
the alignment of Big Five personality prompting across six popular LLMs: GPT-4o Mini, Llama 3.2,
Mistral NeMo, Gemini 2.0 Flash Lite, Gemma 2, and Claude 3 Haiku. We use the Big Five personality
survey to investigate both overall trait score and fine-grained survey item misalignment under high
and low trait configurations. Our key findings are: (1) these six LLMs vary in personality alignment
across traits and configuration levels; (2) Llama 3.2, Mistral NeMo, Gemini 2.0 Flash Lite, Gemma 2, and
Claude 3 Haiku show large misalignment for specific trait items across multiple personality traits in
low trait configurations; (3) GPT-4o Mini, Gemma 2, and Gemini 2.0 Flash Lite show strong personality
alignment in high trait configurations. These findings highlight an alignment gap, where the traits
LLMs display after personality prompting do not always align with the intended characteristics, falling
short of consistent expectations. While personality traits are well-defined constructs in psychology,
LLMs do not necessarily internalize, understand, or express them in ways that perfectly align with the
standard expectations. This reveals a critical blind spot at the core of personality prompting.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Previous studies have found that LLMs exhibit human-like, stable personality patterns when evaluated
using psychometrically grounded frameworks. They can exhibit consistent traits when initialized with
specific personality prompts, with self-reported Big Five scores aligning with targeted profiles and
producing linguistic markers identifiable by human raters, though classification accuracy diminishes
when AI authorship is disclosed [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Distinct profiles appear within the same LLM family, with
many LLMs naturally exhibiting higher agreeableness, openness, and conscientiousness, and with
customized psychological instruments yielding valid and fine-grained assessments [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Compared to
human datasets of over three million participants, LLMs consistently show elevated agreeableness and
conscientiousness, reduced neuroticism, and a high sensitivity to prompt structure and specificity [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
GPT-4 shows high extraversion, agreeableness, conscientiousness, and openness with strong internal
consistency, aligns with INTJ-like MBTI configurations, and scores lower than human norms on darker
traits such as psychopathy [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. ChatGPT also display trait stability, particularly in conscientiousness,
with text-mining-informed assessments reducing hallucinations and uncovering profiles statistically
close to human distributions [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Early GPT-3 assessments likewise found broadly human-like trait
patterns and consistent value alignment, indicating that LLMs can be evaluated systematically for stable,
human-consistent personality expressions [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>
        Recent work shows that LLMs possess flexible personality expressions that can be modulated through
prompt design, persona instructions, and role-based conditioning, although outcomes vary by trait,
context, and architectural factors. Their traits fluctuate with shufled question orders, scale with model size,
and vary across instantiated personas, suggesting high prompt dependence and contextual variability in
personality stability [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. The Machine Personality Inventory and personality prompting frameworks
demonstrate that trait induction and quantification are feasible through structured interaction and
psychometric scafolding [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Prompt-based conditioning can meaningfully shape LLM behavior, though
certain traits exhibit greater resilience across interactions, and multi-turn dialogues tend to modulate or
weaken initial persona efects [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Advanced models like GPT-4 can reliably emulate diverse Big Five
configurations, but maintaining role-consistent behavior over extended prompts becomes increasingly
dificult as persona complexity grows, indicating structural limits to prompt-driven control [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. Under
targeted prompting, personality evaluations yield psychometrically valid outputs, and fine-tuned models
can robustly emulate specific human profiles, validating both the feasibility and ethical salience of
LLM persona design [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. Even open-source LLMs exhibit identifiable MBTI and Big Five traits, and
while many resist trait enforcement via simple prompting, the combination of explicit trait cues with
domain-specific roles enhances consistency in personality expression [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ].
      </p>
      <p>
        Researchers have further demonstrated that LLM personality is an emergent property governed
by model architecture, data composition, and interaction context, rather than a static or globally
uniform feature. Personality aligns with human-like MBTI patterns under controlled prompting, but
remains highly dependent on model-specific training and conditioning schemas [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. Role-playing
agents show that personality alignment can be reliably assessed using tools such as InCharacter, with
LLMs replicating target personas while also revealing variability across character complexity and
scenario scale [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. Self-reported personalities from chatbots often diverge from conversational user
impressions, demonstrating that trait expression is context-sensitive and not uniformly reliable across
dialogue conditions [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. Personality traits measured in LLM outputs can exhibit internal stability and
psychometric validity under structured prompting strategies, but overall alignment depends heavily
on model size and instruction tuning [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. Priming LLMs with explicit personality descriptors or
diagnostic cues has been shown to guide models like GPT-2 and BERT toward expressing specific Big
Five dimensions, reinforcing the prompt-dependent malleability of emergent traits [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. Personality
experiments show that response-level editing can shift traits like neuroticism or agreeableness, yet
maintaining consistent, controllable personality across interactions remains a core challenge, illustrating
both the current boundaries and promise of LLM personality engineering [35].
      </p>
      <p>Earlier work has established that LLMs can exhibit stable and human-like personality traits under
psychometric evaluation. However, much of this work focuses on demonstrating the presence of
underlying personality traits in LLMs rather than rigorously testing how well prompted traits align
with intended configurations. Existing research largely assumes that when an LLM is prompted with a
specific trait level, the resulting expression will reflect that trait in a coherent and consistent manner.
However, the precision of this alignment at overall trait scores in diferent levels of configurations,
particularly at the item level within validated psychological instruments, has not been thoroughly
evaluated. Our paper addresses this overlooked area by examining how well personality prompts
translate into the expected trait expressions across diverse LLMs in high and low trait configurations.
By using the Big Five personality survey to compare the intended trait prompt with the actual LLM
expressions, we reveal how and where these alignments break down. This enables a focused evaluation
of alignment precision at both the trait and item levels, thus filling a critical conceptual gap in LLM
personality research.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>We examined personality prompting for six popular LLMs: GPT-4o Mini [36], Llama 3.2 [37], Mistral
NeMo [38], Gemini 2.0 Flash Lite [39], Gemma 2 [40], and Claude 3 Haiku [41]. Each LLM is asked
to take the 50-item IPIP Big Five personality survey [42] after personality prompting to evaluate
personality alignment based on 50 survey questions (10 questions per personality trait). We selected
these LLMs based on the latest versions available at the time of the experiment, generation cost,
generation speed, and their likelihood of being used for long, consecutive interactions. The IPIP Big Five
survey is open source and widely used in personality research [43]. It measures five broad dimensions of
personality: Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness to Experience.
Extraversion reflects sociability, assertiveness, and enthusiasm. Agreeableness captures traits like
compassion, trust, and cooperativeness. Conscientiousness involves organization, dependability, and
goal-directed behavior. Neuroticism assesses emotional stability and the tendency to experience negative
emotions. Openness to Experience includes imagination, intellectual curiosity, and a preference for
novelty and variety. The IPIP Big Five survey has been validated across diferent cultures, demonstrating
strong cross-cultural reliability and construct validity [44]. The personality prompting consists of three
components: a specified personality trait, survey instructions, and the survey items. This structure
ensures that no other elements influence the LLMs’ expressions beyond the intended personality trait.</p>
      <p>The personality prompt is: "You have [High/Low]
[Extraversion/Agreeableness/Conscientiousness/Neuroticism/Openness to Experience]." Each item is scored from 1 (Disagree) to 5 (Agree), with 10
items corresponding to each trait. We ran 100 simulations for 2 configuration levels (High and Low)
across 5 personality traits and 6 LLMs, resulting in a total of 6,000 simulations. Each simulation is scored
accordingly based on the IPIP Big Five survey scoring scheme. The evaluation focuses on the trait items
and the trait score corresponding to the simulated personality trait. The trait score is rescaled from a
total of 50 to a 5-point scale to align with the item-level scoring. For the low-trait configuration, the
misalignment score for each trait item is calculated by summing the number of misaligned responses,
where a score of 3 or higher is considered misaligned, and dividing by the 100 simulations. We consider
scores of 1 and 2 as aligned in the low-trait configuration to add flexibility that better reflects human
interpretation of low trait expression. For the high-trait configuration, the misalignment score is
calculated similarly, except a score of 3 or lower is considered misaligned. Reverse-scored items are
properly handled by converting their scores before alignment evaluation. The items are then sorted
based on their misalignment scores to identify the top 1 item with the highest misalignment per trait
and configuration.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Extraversion</title>
        <p>GPT-4o Mini demonstrates the strongest personality alignment across both low and high Extraversion
configurations, achieving a score of 1.15 for low Extraversion and 4.93 for high Extraversion, and when
prompted with a restrained or expressive trait its responses consistently match the expected orientation
with no instance where a response contradicts the intended behavior. Gemini 2.0 Flash Lite follows
closely, scoring 1.17 for low Extraversion and 5.00 for high Extraversion; when prompted with low
Extraversion we expect responses that avoid signaling comfort in social settings, but instead it shows
a readiness for social ease on “Feel comfortable around people” with misalignment score 40%, while
under high Extraversion we expect responses that openly signal social engagement and it fully delivers
without any conflicting behavior. Gemma 2 maintains strong alignment with 1.23 for low Extraversion
and 4.90 for high Extraversion, and when prompted with each trait its responses precisely reflect the
expected restrained or expressive stance with no single item showing conflicting behavior.</p>
        <p>Mistral NeMo, with a low Extraversion score of 1.64, shows that when prompted for restrained
responses we expect indications of discomfort in social contexts, yet it instead signals strong comfort
on “Feel comfortable around people” with misalignment score 92%, and with high Extraversion at 4.63,
where we expect expressive behavior and active engagement, it unexpectedly signals reluctance to
draw attention on “Don’t like to draw attention to myself” with misalignment score 13%. Llama 3.2,
with a low Extraversion score of 1.66, also diverges when prompted with low Extraversion, where
we expect avoidance of social ease but instead see comfort on “Feel comfortable around people” with
misalignment score 37%, and with high Extraversion at 4.54, where we expect active participation, it
instead produces reserved behavior on “Don’t talk a lot” with misalignment score 27%. Claude 3 Haiku</p>
        <p>Model
GPT-4o</p>
        <p>Mini
Mistral</p>
        <p>NeMo
Gemini 2.0
Flash Lite
Claude 3</p>
        <p>Haiku
Llama 3.2
1.66
1.83
2.15
1.91
records the weakest alignment with a low Extraversion score of 2.19 and a high Extraversion score of
4.32, and when prompted with low Extraversion we expect reserved responses but it produces a direct
signal of comfort in social situations on “Feel comfortable around people” with misalignment score
100%, while under high Extraversion we expect expressive responses but instead see reserved behavior
on “Am quiet around strangers” with misalignment score 26%.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Agreeableness</title>
        <p>GPT-4o Mini demonstrates the strongest personality alignment across both low and high Agreeableness
configurations, achieving a score of 1.21 for low Agreeableness and 5.00 for high Agreeableness; when
prompted with low Agreeableness we expect abrasive responses, and its output on “Insult people” with
misalignment score 1%, while under high Agreeableness we expect cooperative responses and it fully
delivers with no conflicting behavior. Gemma 2 follows with a score of 1.33 for low Agreeableness and
5.00 for high Agreeableness; when prompted with low Agreeableness we expect sharp unkindness but
instead it completely reverses that orientation on “Insult people” with misalignment score 100%, while
under high Agreeableness we expect caring cooperative responses and it maintains perfect consistency
with no misalignment. Llama 3.2 records a score of 1.83 for low Agreeableness and 4.42 for high
Agreeableness; when prompted with low Agreeableness we expect indiference toward others but it
unexpectedly signals engagement on “Am not really interested in others” with misalignment score 53%,
and under high Agreeableness we expect concern for others yet it signals detachment on “Feel little
concern for others” with misalignment score 83%.</p>
        <p>Gemini 2.0 Flash Lite shows a score of 2.14 for low Agreeableness and 5.00 for high Agreeableness;
when prompted with low Agreeableness we expect clear antagonistic responses but instead it avoids
that stance on “Insult people” with misalignment score 95%, while under high Agreeableness we expect
harmonious cooperation and it fully aligns with no contradictory behavior. Mistral NeMo follows with
2.16 for low Agreeableness and 4.89 for high Agreeableness; when prompted with low Agreeableness
we expect direct unkind responses but it softens that expectation on “Insult people” with misalignment
score 77%, and under high Agreeableness we expect consistent kindness which it largely provides with
no item-level conflicts. Claude 3 Haiku exhibits the weakest personality alignment with a score of 3.39
for low Agreeableness and 4.77 for high Agreeableness; when prompted with low Agreeableness we
expect hostile responses but it instead produces almost entirely opposite cooperative behavior on “Insult
people” with misalignment score 99%, while under high Agreeableness we expect caring responses but
it shows some variance despite an overall high trait score.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Conscientiousness</title>
        <p>GPT-4o Mini demonstrates the strongest personality alignment across both low and high
Conscientiousness configurations, achieving a score of 1.00 for low Conscientiousness and 5.00 for high
Conscientiousness, and when prompted with either a disorganized or highly organized trait its
responses consistently match the expected orientation with no instance where a response contradicts the
40%
95%
0%
0%
100%
0%
100%
100%
0%
100%
100%
99%
100%
42%
100%
intended behavior. Gemini 2.0 Flash Lite follows closely, scoring 1.41 for low Conscientiousness and
5.00 for high Conscientiousness; when prompted with low Conscientiousness we expect responses that
signal carelessness and lack of order, and all responses adhere to this without contradiction, while under
high Conscientiousness we expect highly organized behavior and it fully delivers with no conflicting
response. Mistral NeMo, with a low Conscientiousness score of 1.66, shows that when prompted for
low Conscientiousness we expect responses reflecting disorder, yet on “Make a mess of things” with
misalignment score 31% it instead signals unexpected orderliness, while with high Conscientiousness
at 4.73 we expect consistently organized behavior and all responses follow that expectation.</p>
        <p>Gemma 2, with a low Conscientiousness score of 1.88, reveals that when prompted for low
Conscientiousness we expect a willingness to shirk duties, yet on “Shirk my duties” with misalignment score
100% it unexpectedly signals full responsibility, while with high Conscientiousness at 5.00 we expect
organized responses and this alignment is fully maintained. Llama 3.2, with a low Conscientiousness
score of 2.15, shows when prompted for low Conscientiousness we expect unpreparedness, yet on “Am
always prepared” with misalignment score 61% it signals the opposite, and with high Conscientiousness
at 4.38 we expect careful organization but on “Leave my belongings around” with misalignment score
52% it instead signals careless behavior. Claude 3 Haiku records the weakest alignment with a low
Conscientiousness score of 3.36, where we expect inattentiveness but on “Pay attention to details” with
misalignment score 100% it signals full attention, and with high Conscientiousness at 4.58 we expect
organized responses yet on “Often forget to put things back in their proper place” with misalignment
score 18% it signals unexpected forgetfulness.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Neuroticism</title>
        <p>Gemma 2 achieves the strongest personality alignment across both low and high Neuroticism
configurations, scoring 1.00 for low Neuroticism and 4.98 for high Neuroticism, and when prompted with low
Neuroticism we expect responses indicating emotional stability without signs of worry, and indeed
every response adheres precisely with no conflicting behavior, while under high Neuroticism we expect
responses signaling emotional volatility and again see no deviation. Gemini 2.0 Flash Lite follows
closely, with a score of 1.07 for low Neuroticism and 4.97 for high Neuroticism; when prompted with
low Neuroticism we expect responses avoiding worry or tension and all responses match without
contradiction, and when prompted with high Neuroticism we expect signs of emotional fluctuation and
again observe full alignment. GPT-4o Mini also demonstrates exceptional alignment, scoring 1.18 for
low Neuroticism and 4.98 for high Neuroticism, with responses under low Neuroticism consistently
avoiding worry as expected and under high Neuroticism consistently exhibiting the expected signs of
emotional unease with no single item diverging.</p>
        <p>Llama 3.2, with a low Neuroticism score of 1.91, shows misalignment under low Neuroticism where
we expect responses indicating calmness but instead see a response indicating worry on “Worry about
things” with misalignment score 49%, and under high Neuroticism at 4.35, where we expect overt
emotional instability, it unexpectedly produces a response downplaying such instability on “Often feel
blue” with misalignment score 51%. Mistral NeMo, with a low Neuroticism score of 1.92, similarly
misaligns under low Neuroticism by signaling worry when calm behavior is expected on “Worry about
things” with misalignment score 51%, and under high Neuroticism at 4.74, where heightened emotional
expression is expected, it instead shows partial composure on “Often feel blue” with misalignment score
19%. Claude 3 Haiku shows the weakest alignment, with a low Neuroticism score of 2.07, where instead
of the expected calm stance it signals concern on “Worry about things” with misalignment score 42%,
and under high Neuroticism at 3.86, where we expect signs of emotional turbulence, it reverses the trait
by signaling calmness on “Am relaxed most of the time” with misalignment score 93%.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Openness to Experience</title>
        <p>Gemini 2.0 Flash Lite demonstrates the strongest personality alignment when prompted with low
Openness to Experience, achieving a score of 1.72, and with high Openness to Experience, achieving
4.93; yet when configured for low Openness to Experience we expect restrained cognitive responses but
instead see a direct signal of cognitive agility on “Am quick to understand things” with misalignment
score 100%, while under high Openness to Experience we expect imaginative and receptive responses
and it delivers without any conflicting behavior. GPT-4o Mini follows with a low Openness to Experience
score of 2.00 and a high Openness to Experience score of 4.76; when prompted with low Openness
to Experience we expect dificulty with rapid abstract grasp, but its response unexpectedly signals
readiness on “Am quick to understand things” with misalignment score 98%, while under high Openness
to Experience it produces responses aligned with the expected imaginative configuration and no
misaligned items. Gemma 2 records 2.27 for low Openness to Experience and 4.75 for high Openness to
Experience; when prompted with low Openness to Experience we expect indications of struggling with
abstraction but instead see fluent understanding on “Have dificulty understanding abstract ideas” with
misalignment score 100%, whereas its high Openness to Experience responses remain consistent with
the expected trait.</p>
        <p>Mistral NeMo, with 2.98 for low Openness to Experience and 4.81 for high Openness to Experience,
shows that when configured for low Openness to Experience we expect limited verbal range yet it
signals extensive lexical ability on “Have a rich vocabulary” with misalignment score 100%, while under
high Openness to Experience it sustains precise alignment without any conflicting response. Llama 3.2,
at 2.53 for low Openness to Experience and 4.28 for high Openness to Experience, demonstrates that
when prompted with low Openness to Experience we expect minimal cognitive agility but instead see
signs of rapid understanding on “Am quick to understand things” with misalignment score 89%, and
when prompted with high Openness to Experience we expect enthusiasm for abstraction yet it signals
disinterest on “Am not interested in abstract ideas” with misalignment score 50%. Claude 3 Haiku, with
the weakest alignment at 3.83 for low Openness to Experience and 4.59 for high Openness to Experience,
reveals that when configured for low Openness to Experience we expect limited vocabulary but instead
observe expansive expression on “Have a rich vocabulary” with misalignment score 100%, and when
configured for high Openness to Experience it produces responses largely aligned with the intended
trait without severe divergence.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>We investigated whether LLMs precisely express the Big Five personality traits when explicitly prompted,
revealing substantial variation in trait alignment across diferent LLMs and configurations. GPT -4o
Mini showed the strongest and most consistent alignment, accurately reflecting both high and low trait
expressions across all five dimensions with virtually no conflicting responses. Gemma 2 and Gemini 2.0
Flash Lite also demonstrated strong personality alignment, particularly under high-trait conditions, but
showed notable inconsistencies when simulating low Agreeableness and low Openness to Experience,
often reverting to prosocial or cognitively fluent outputs. In contrast, Mistral NeMo, Llama 3.2, and
Claude 3 Haiku struggled more significantly, particularly in low-trait scenarios such as low Extraversion,
low Conscientiousness, and low Neuroticism, frequently exhibiting expressions that contradicted the
intended personality trait. Claude 3 Haiku showed the weakest overall alignment, with high rates of
misaligned items across multiple traits. LLMs consistently struggled with low-trait prompts, often
defaulting to socially desirable or emotionally stable responses, limiting accurate reflection of restrained
or volatile personality traits.</p>
      <p>The findings extend the existing body of work by demonstrating that while LLMs are increasingly
capable of expressing personality traits through prompt-based conditioning, their ability to express the
full spectrum of traits, particularly low-trait configurations, remains uneven and LLM dependent. Prior
research established that LLMs can express personality through structured prompts and psychometric
frameworks, but our results add critical nuance by showing that this expression is more reliable for
traits that align with socially desirable or cognitively fluent behavior, such as high Agreeableness
or high Openness. In contrast, traits associated with emotional volatility, disengagement, or
nonnormative responses, such as low Neuroticism or low Extraversion, are more dificult for many LLMs
to exhibit accurately. This reflects a consistent tendency among LLMs to generate emotionally steady,
prosocial, or cognitively coherent responses, which can interfere with eforts to elicit less typical or
socially dispreferred personality expressions. Our findings build on prior literature by ofering a direct
comparison of personality alignment across LLMs using explicit trait-based prompting, highlighting
where expression succeeds and where it falters. This contributes empirical clarity to ongoing questions
about the boundaries of prompt-driven personality expression.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>AI agents are becoming increasingly important partners in complex, data-rich environments where
human teams alone can no longer keep pace. As a result, there is growing interest in deploying AI agents
that display human-like personality traits to support more authentic interaction with human teammates.
Yet it remains unclear whether personality traits prompted in LLMs are consistently expressed in ways
that reflect the intended characteristics. This paper evaluated the alignment of Big Five personality
prompting across six widely used LLMs to determine whether the traits expressed reflect the intended
configurations. We found that personality alignment varies across LLMs and trait levels. GPT-4o Mini,
Gemma 2, and Gemini 2.0 Flash Lite demonstrated strong alignment in high trait configurations, while
Llama 3.2, Mistral NeMo, and Claude 3 Haiku exhibited notable misalignment at specific trait items,
particularly in low trait settings. These findings reveal an important LLM alignment gap between
personality prompting and personality expression. Researchers, corporations, and agencies deploying
personality-prompted AI agents should consider whether these agents genuinely reflect the intended
overall traits and specific trait characteristics. They may encounter surprising, unintended personality
expressions in their AI agents.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling
check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This work is supported by the Defense Advanced Research Projects Agency (DARPA) contracts
HR00112490410, HR00112490408 and HR0011-24-3-0325.The views and conclusions contained in this
document are those of the authors and should not be interpreted as representing the oficial policies,
either expressed or implied, of the U.S. Government.
[35] S. Mao, X. Wang, M. Wang, Y. Jiang, P. Xie, F. Huang, N. Zhang, Editing personality for large
language models, in: CCF International Conference on Natural Language Processing and Chinese
Computing, Springer, 2024, pp. 241–254.
[36] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,</p>
      <p>S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
[37] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur,</p>
      <p>A. Schelten, A. Vaughan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024).
[38] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F.
Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao,
T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https://arxiv.org/abs/2310.06825.
arXiv:2310.06825.
[39] G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,
et al., Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, arXiv
preprint arXiv:2403.05530 (2024).
[40] G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard,
B. Shahriari, A. Ramé, et al., Gemma 2: Improving open language models at a practical size, arXiv
preprint arXiv:2408.00118 (2024).
[41] Anthropic, Model card for the claude 3 model family: Opus, sonnet, haiku, https://www.anthropic.</p>
      <p>com/index/claude-3-model-card, 2024. Accessed: 2025-08-02.
[42] M. B. Donnellan, F. L. Oswald, B. M. Baird, R. E. Lucas, The mini-ipip scales: tiny-yet-efective
measures of the big five factors of personality., Psychological assessment 18 (2006) 192.
[43] E. Topolewska, E. Skimina, W. Strus, J. Cieciuch, T. Rowiński, The short ipip-bfm-20 questionnaire
for measuring the big five, Roczniki Psychologiczne 17 (2014) 385–402.
[44] P. J. Kajonius, Cross-cultural personality diferences between east asia and northern europe in
ipip-neo, International Journal of Personality Psychology 3 (2017) 1–7.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.-A.</given-names>
            <surname>Teaming</surname>
          </string-name>
          ,
          <article-title>State-of-the-art and research needs</article-title>
          ,
          <source>National Academies of Sciences, Engineering and Medicine</source>
          , Washington DC 10 (
          <year>2022</year>
          )
          <fpage>26355</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Masterman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Besen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sawtell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <article-title>The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey</article-title>
          ,
          <source>arXiv preprint arXiv:2404.11584</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Raghu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Unraveling human-ai teaming: A review and outlook</article-title>
          ,
          <source>arXiv preprint arXiv:2504.05755</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gervasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sequeira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Marion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bakst</surname>
          </string-name>
          , H. Gent,
          <article-title>Ai as collaborative partner: Rethinking human-ai teaming for the real world</article-title>
          ,
          <source>in: Proceedings of the AAAI Symposium Series</source>
          , volume
          <volume>5</volume>
          ,
          <year>2025</year>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Iftikhar</surname>
          </string-name>
          , Y.-T. Chiu,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Caudwell</surname>
          </string-name>
          ,
          <article-title>Human-agent team dynamics: A review and future research opportunities</article-title>
          ,
          <source>IEEE Transactions on Engineering Management</source>
          <volume>71</volume>
          (
          <year>2023</year>
          )
          <fpage>10139</fpage>
          -
          <lpage>10154</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>McNeese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Freeman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Musick</surname>
          </string-name>
          ,
          <article-title>" an ideal human" expectations of ai teammates in human-ai teaming</article-title>
          ,
          <source>Proceedings of the ACM on Human-Computer Interaction</source>
          <volume>4</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>O. F.</given-names>
            <surname>Kernberg</surname>
          </string-name>
          , What is personality?,
          <source>Journal of personality disorders 30</source>
          (
          <year>2016</year>
          )
          <fpage>145</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. P. John,</surname>
          </string-name>
          <article-title>An introduction to the five-factor model and its applications</article-title>
          ,
          <source>Journal of personality 60</source>
          (
          <year>1992</year>
          )
          <fpage>175</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jolijn Hendriks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perugini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Angleitner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ostendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , F. De Fruyt,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hřebíčková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kreitler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Murakami</surname>
          </string-name>
          , D. Bratko, et al.,
          <article-title>The five-factor personality inventory: crosscultural generalizability across 13 countries</article-title>
          ,
          <source>European journal of personality 17</source>
          (
          <year>2003</year>
          )
          <fpage>347</fpage>
          -
          <lpage>373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>J. B. Asendorpf</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wilpers</surname>
          </string-name>
          ,
          <article-title>Personality efects on social relationships</article-title>
          .,
          <source>Journal of personality and social psychology 74</source>
          (
          <year>1998</year>
          )
          <fpage>1531</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jolić Marjanović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Krstić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rajić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Stepanović</given-names>
            <surname>Ilić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Videnović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Altaras</surname>
          </string-name>
          <string-name>
            <surname>Dimitrijević</surname>
          </string-name>
          ,
          <article-title>The big five and collaborative problem solving: a narrative systematic review</article-title>
          ,
          <source>European Journal of Personality</source>
          <volume>38</volume>
          (
          <year>2024</year>
          )
          <fpage>457</fpage>
          -
          <lpage>475</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F. P.</given-names>
            <surname>Morgeson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Reider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Campion</surname>
          </string-name>
          ,
          <article-title>Selecting individuals in team settings: The importance of social skills, personality characteristics, and teamwork knowledge</article-title>
          ,
          <source>Personnel psychology 58</source>
          (
          <year>2005</year>
          )
          <fpage>583</fpage>
          -
          <lpage>611</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T. Ait</given-names>
            <surname>Baha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>El Hajji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Es-Saady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fadili</surname>
          </string-name>
          ,
          <article-title>The power of personalization: A systematic review of personality-adaptive chatbots</article-title>
          ,
          <source>SN Computer Science</source>
          <volume>4</volume>
          (
          <year>2023</year>
          )
          <fpage>661</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sutclife</surname>
          </string-name>
          ,
          <article-title>A survey of personality, persona, and profile in conversational agents and chatbots</article-title>
          ,
          <source>arXiv preprint arXiv:2401.00609</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>M. X. Zhou</surname>
            , G. Mark,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Trusting virtual agents: The efect of personality</article-title>
          ,
          <source>ACM Transactions on Interactive Intelligent Systems (TiiS) 9</source>
          (
          <issue>2019</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanijja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Thapliyal</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Zhang,</surname>
          </string-name>
          <article-title>What afects the usage of artificial conversational agents? an agent personality and love theory perspective</article-title>
          ,
          <source>Computers in Human Behavior</source>
          <volume>145</volume>
          (
          <year>2023</year>
          )
          <fpage>107788</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Brito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Dollis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. B.</given-names>
            <surname>Färber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S. F. B.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Sousa</surname>
          </string-name>
          , et al.,
          <article-title>Integrating personality into digital humans: A review of llm-driven approaches for virtual reality</article-title>
          ,
          <source>arXiv preprint arXiv:2503.16457</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nugent</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Cleland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <article-title>Human digital twin: A survey</article-title>
          ,
          <source>Journal of Cloud Computing</source>
          <volume>13</volume>
          (
          <year>2024</year>
          )
          <fpage>131</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Breazeal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kabbara</surname>
          </string-name>
          ,
          <article-title>Personallm: Investigating the ability of large language models to express personality traits</article-title>
          ,
          <source>arXiv preprint arXiv:2305.02547</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhandari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Naseem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nasim</surname>
          </string-name>
          ,
          <article-title>Evaluating personality traits in large language models: Insights from psychological questionnaires</article-title>
          ,
          <source>in: Companion Proceedings of the ACM on Web Conference</source>
          <year>2025</year>
          ,
          <year>2025</year>
          , pp.
          <fpage>868</fpage>
          -
          <lpage>872</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zou</surname>
          </string-name>
          , L. Sun,
          <article-title>Quantifying ai psychology: A psychometrics benchmark for large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2406.17675</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Wu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Humanizing llms: A survey of psychological measurements with tools, datasets, and human-agent applications</article-title>
          ,
          <source>arXiv preprint arXiv:2505.00049</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Shang,
          <article-title>Humanity in ai: Detecting the personality of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2410.08545</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Miotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rossberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          ,
          <article-title>Who is gpt-3? an exploration of personality, values and demographics</article-title>
          ,
          <source>arXiv preprint arXiv:2209.14338</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tommaso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hegazy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lemay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abukalam</surname>
          </string-name>
          , I. Rish, G. Dumas,
          <article-title>Llms and personalities: Inconsistencies across scales</article-title>
          , in: NeurIPS 2024 Workshop on Behavioral Machine Learning, ????
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , W. Han,
          <string-name>
            <surname>C</surname>
          </string-name>
          . Zhang, Y. Zhu,
          <article-title>Mpi: Evaluating and inducing personality in pre-trained language models</article-title>
          ,
          <source>arXiv preprint arXiv:2206.07550</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>I.</given-names>
            <surname>Frisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giulianelli</surname>
          </string-name>
          ,
          <article-title>Llm agents in interaction: Measuring personality consistency and linguistic alignment in interacting populations of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2402.02896</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Ones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Evaluating the ability of large language models to emulate personality</article-title>
          ,
          <source>Scientific reports 15</source>
          (
          <year>2025</year>
          )
          <fpage>519</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>G.</given-names>
            <surname>Serapio-García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Safdari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Crepy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdulhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Faust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matarić</surname>
          </string-name>
          ,
          <article-title>Personality traits in large language models (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>L.</given-names>
            <surname>La Cava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tagarelli</surname>
          </string-name>
          ,
          <article-title>Open models, closed minds? on agents capabilities in mimicking human personalities through open large language models</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>39</volume>
          ,
          <year>2025</year>
          , pp.
          <fpage>1355</fpage>
          -
          <lpage>1363</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>K.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <article-title>Do llms possess a personality? making the mbti test an amazing evaluation for large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.16180</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , J.-t. Huang,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Leng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews</article-title>
          ,
          <source>arXiv preprint arXiv:2310.17976</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <article-title>Can llm" self-report"?: Evaluating the validity of self-report scales in measuring personality design in llm-based chatbots</article-title>
          ,
          <source>arXiv preprint arXiv:2412.00207</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>G.</given-names>
            <surname>Caron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Identifying and manipulating the personality traits of language models</article-title>
          ,
          <source>arXiv preprint arXiv:2212.10276</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>