1. Introduction

i>. erik h. erikson. new york: Norton

Identity by Design? Evaluating Gender Conditioning in LLM-Generated Agent Identity Profiles

Mattia Rampazzo

0 2 3

Saba Ghanbari Haez

0 1 2

Patrizio Bellan

0 2

Simone Magnolini

0 2

Leonardo Sanna

0 2

Mauro Dragoni

0 2 0 Fondazione Bruno Kessler , Povo , Italy 1 Free University of Bozen-Bolzano , Bolzano , Italy 2 Identity-Aware AI workshop at 28th European Conference on Artificial Intelligence 3 University of Trento , Trento , Italy

2025

113 1951 253 253

In multi-agent reasoning frameworks powered by large language models, agent roles are often instantiated through identity descriptions that condition their behavior. This paper investigates whether and how the gender assigned to the agent responsible for defining role-specific identity profiles afects the linguistic identity, sentiment, and gender expression of downstream agents. We introduce an extensive corpus of agent identity descriptions generated under controlled combinations of frameworks, roles, models, and gender conditions. Through quantitative and qualitative linguistic analysis, we observe a consistent skew toward female identity across models and roles when gender is unspecified, along with varying degrees of polarity and subjectivity depending on the description framework. Notably, cognitively-oriented frameworks suppress afective expression, while trait-based frameworks amplify gender alignment. These results reveal that identity conditioning is not solely determined by prompt parameters, but emerges through a layered interaction of model priors, framework semantics, and role-specific expressive constraints.

eol>Agent Identity Description Framework Gender Bias Detection Synthetic Personas Gender-Conditioned Prompting for Identity Descriptions in LLMs Prompt Conditioning Bias in LLMs

1. Introduction

we ask whether this framing influences the text used to define other agents. In this, every description pertains to a top-tier expert function rather than to casual or novice personas.

To study this question empirically, we constructed a large-scale collection of role descriptions generated under controlled variation. Thirteen psychologically grounded identity description frameworks were combined with five task domains. Within each configuration, the Psychologist agent generated role descriptions for the rest of the ensemble, tailoring them to the contextual demands of the dataset and framework. The procedure was repeated across thirteen base language models, yielding thousands of descriptions whose variation is structured by four factors: the Psychologist’s gender constraint, the identity description framework, the base language model, and the task domain.

To evaluate whether gender conditioning of the prompt-giver leaves measurable traces in these expert profiles, we analyze each description along three axes: polarity as an index of emotional valence, subjectivity as a measure of evaluative stance versus factual tone, and a discrete sentiment label to capture categorical afect. These metrics allow us to probe tonal and stylistic shifts associated with the Psychologist’s gender, and to determine whether such shifts are consistent across roles, frameworks, tasks, and models.

The central question guiding this study is whether gender conditioning of the identity-generating agent influences the content and tone of the role descriptions it produces.

RQ Do LLMs produce systematically diferent identity descriptions when gender is varied in the prompt — and how are these efects modulated by identity description frameworks, and language model?

This research question arises from the observation that identity descriptions—used to define virtual agents, e.g., in multi-role systems—are often treated as neutral artifacts, yet they are themselves generated by another model whose prompt-based framing can bias the textual outcome.

The analyses we conducted revealed consistent efects across multiple dimensions. Gender conditioning influences the emotional tone of the text, the use of subjective versus objective language, and the distribution of adjectives that convey traits such as assertiveness, empathy, or decisiveness. In many cases, the perceived gender of the generated description aligns with the constraint applied to the Psychologist, even when the target role is not gendered. By highlighting these subtle, yet systematic efects, our work contributes to a deeper understanding of how identity framing in prompts can shape the stylistic and semantic qualities of LLM-generated text. We make the full corpus of identity descriptions publicly available at github.com/patriziobellan86/ Identity-by-Design-Evaluating-Gender-Conditioning-in-LLM-Generated-Agent-Identity-Profiles.

The paper is structured as follows: Section 2 reviews prior work on agent personality and bias in language generation. Section 3 introduces the Pool of Experts (PoE) architecture and the identity description generation pipeline. Section 4 presents our experimental design and describes the linguistic analysis procedures. Section 5 presents results and discusses the main findings. Section 6 concludes the paper and outlines directions for future research.

2. Related Work

Agent Personality Research on conversational agents has long drawn on psychology, particularly personality modeling frameworks such as the Big Five traits [ 4, 5, 6 ]. Early work such as [ 7 ] directly embedded these traits into conversational agent design, paving the way for subsequent studies that emphasize personalization, reliability, and richer user engagement. Within healthcare, for example, [ 8, 9 ] demonstrate how familiar personas can strengthen trust and empathy, while [ 10 ] show that tailoring responses to user personality improves therapeutic outcomes.

Outside healthcare, broader strategies have been explored for role-driven behavior in LLMs. For instance, [ 11 ] employ prompt-based methods to enhance contextual awareness, whereas [ 12 ] demonstrate that embedding personality traits directly into LLMs enables more flexible and diverse conversational styles. Expanding on personalization, [ 13 ] investigate preference evaluation, identifying limitations of simplistic persona construction, while [14] enrich dialogue quality by integrating structured persona data.

Another active line of research addresses scaling persona diversity. [15] contribute methods for building more realistic and varied personas, and [16] introduce the Synthetic-Persona-Chat dataset through persona-aligned dialogue generation. More recently, scholars have examined whether LLMs maintain stable trait-like or identity-driven behaviors when prompted with synthetic personas. For example, [17] find consistent expression of Big Five traits with implications for bias and downstream task performance, while [18] analyze how models internalize social identities and reflect them in political or ethical stances. Complementing this, [19] extend identity integration to human annotators by embedding sociodemographic attributes, though with limited predictive benefits.

Taken together, this body of work highlights the expressive range of persona-based modeling in LLMs, yet its focus remains largely tied to narrow psychological frameworks or specific downstream tasks. Our contribution expands this scope by systematically comparing multiple psychologically grounded description frameworks and testing their influence across heterogeneous reasoning tasks. Extending this line of inquiry, we further manipulate the identity of the profile-generating agent itself (the “Psychologist”), testing whether upstream gender cues propagate into the role descriptions of other expert agents. By combining 13 psychological description frameworks, 5 reasoning tasks, and 13 LLM families, we provide a comprehensive evaluation to date of how prompt-giver identity shapes persona construction.

Profile Definition and Persona Generation Recent studies have explored how LLMs can define and generate character profiles to support more diverse agent behaviors. [ 20] introduce Persona Hub, a large repository of synthetic personas designed to steer LLM outputs toward distinctive styles. Building on this, [21] fine-tune models with persona-driven corpora, enabling generalization across arbitrary traits so that models can role-play characters with specific personalities or backgrounds. Such structured profile definitions expand an LLM’s ability to assume diverse identities.

A parallel development is the use of LLMs themselves for persona-aligned content generation. [16] propose a Generator–Critic pipeline to create high-quality persona-aligned dialogues, illustrating how structured prompt chains can scale synthetic identity creation. Similarly, [22] show that automatically generated expert roles improve reasoning diversity and factual accuracy. These works demonstrate the potential of structured persona generation to enrich LLM interactions, while also raising questions about how authorship conditions outcomes.

Our work directly addresses this gap by situating profile authorship within a Pool of Experts architecture, where a Psychologist Agent is responsible for generating all other role profiles. We uniquely test how the Psychologist’s gender framing conditions persona generation, making profile authorship itself a locus of identity bias.

Bias and Gender Efects in LLMs Bias in LLM outputs has become a central concern, with numerous studies showing that models reproduce human-like stereotypes across gender, race, and other identities [23]. For example, [24] report that women are more often described with communal traits and men with agentic traits, while [25] uncover gendered occupational associations. [26] further demonstrate that even advanced systems such as GPT-4 and Claude exhibit gender bias in job interview scenarios. Beyond gender, [23] show that LLMs display ingroup favoritism and outgroup derogation, pointing to deep-seated social identity biases. Although mitigation strategies such as curated training data and bias-aware fine-tuning show promise, addressing these issues remains a persistent challenge.

An emerging line of research reveals that bias can arise not only in outputs but also in the persona generation process itself [18, 27]. Studies show that when LLMs invent role descriptions, they often embed stereotypes—for example, framing female teachers as nurturing and male teachers as authoritative [28]. Such upstream bias is underexplored but critical, as it can shape all downstream simulations. Our work contributes to this direction by testing whether the gender framing of a Psychologist prompt-giver influences the profiles it creates, showing that identity conditioning leaves measurable stylistic traces. This highlights the need for identity-aware prompting strategies to ensure that synthetic personas enrich LLM interactions without reinforcing stereotypes.

Unlike prior studies that analyze bias primarily in outputs or static personas, our study investigates bias at the point of profile creation. We show that gender framing of the Psychologist Agent leaves measurable traces in tone, sentiment, and perceived identity of expert-level profiles (Project Manager, Expert Agents, Final Decision Maker). This upstream perspective reveals how stereotypes can become embedded into the very scafolding of multi-agent systems, rather than only surfacing in their downstream behaviors.

3. The Pool of Experts Framework

PoE [ 3 ] is a prompting architecture designed to emulate interdisciplinary reasoning through the orchestration of multiple role-specialized agents instantiated from a shared LLM. Unlike most prompting pipelines, which rely on monolithic or loosely contextualized prompts, PoE assigns each agent a distinct and explicitly defined identity. This is not merely role-labeling: each agent operates under a psychologically grounded identity profile that conditions its behavior throughout the reasoning process. This modular structure supports interpretability, behavioral diversity, and dynamic specialization, while remaining compatible with zero-shot LLM deployment.

At the start of the PoE inizialization, an LLM is tasked to generate the identity of the Psychologist Agent (PA). This agent is responsible for producing the identity descriptions of all other agents. This initial step may be guided by a selected identity description framework, such as the Big Five Personality Traits [29], which provides structured psychological principles for shaping the PA’s behavior and narrative tone. The PA is itself an LLM instance, prompted to adopt the role of a psychologist. Once instantiated, it proceeds to generate textual identity profiles for each downstream agent. Each profile encodes a set of cognitive, motivational, and communicative dispositions. For example, when instructed to describe a Project Manager agent according to the MBTI [30] framework, the PA might produce a structured paragraph emphasizing planning, organization, and interpersonal awareness. The generated identity description is then injected into the system prompt of the corresponding agent, efectively shaping its reasoning and expression style.

The agent hierarchy constructed by PoE follows a consistent pattern. After the PA is instantiated and generates the required personality descriptions, the system proceeds to instantiate a Project Manager agent (PM), whose role is to interpret the task and identify relevant domains of expertise. Based on this analysis, the PA then produces tailored identity profiles for each Expert Agent (EA), ensuring alignment with their respective fields (e.g., sociologist, linguistic expert) and the task context. Each EA is instantiated with the assigned identity and generates an independent, structured response. Finally, the PA also produces the identity profile for the Final Decision Maker (FDM), an agent tasked with synthesizing the EAs responses into a coherent and justified final answer.

Although all agents are derived from the same base LLM, their behavioral diferentiation emerges from the identity prompts generated by the PA. This method allows PoE to simulate cognitive diversity and role specialization without altering the underlying model weights. The psychological framing of each agent serves as a behavioral scafold, modulating how the model interprets and responds to information. The identity profile of each agent also acts as an inductive bias, shaping the model’s default assumptions, expectations, and expressive tendencies.

4. Experimental Design

In the present study, we repurpose the Pool of Experts (PoE) not as a tool for evaluating task performance, but as a controlled environment for probing the linguistic consequences of identity construction. Our focus is on the generative behavior of the PA itself: specifically, we investigate how gender conditioning afects the language it uses to describe other agents. To this end, we systematically vary three experimental dimensions: (i) the gender assigned to the PA, (ii) the identity description framework guiding the generation, and (iii) the backbone LLM employed.

Gender assigned to the Psychologist Agent. The generation of this agent is conditioned along four categories: unconstrained (no gender specified), male, female, or non-binary. Listing 1 shows the system and user prompt templates used to instantiate the PA under each gender condition.

Listing 1: System and User prompt template used to generate the PA identity profile. SYSTEM: You are a psychologist, a highly skilled and knowledgeable expert in your field. Your task is to create a detailed and authentic description of a {gender constraint} **Psychologist** who is responsible for generating description of a person. Use the description framework {identity description framework} as a guide to describe the Psychologist.

USER: Now it’s your turn. Create a {gender constraint} **Psychologist** description that strictly follows the description framework {identity description framework}.

Identity Description Framework. For the second, we draw on thirteen psychologically inspired identity description frameworks that reflect diverse theoretical approaches to modeling personality, cognition, behavior, and user interaction. Trait-oriented models such as the Big Five Personality Traits [29] and the Myers-Briggs Type Indicator [30] capture relatively stable personality characteristics, while psychodynamic and developmental perspectives such as Freudian Psychoanalysis [31] and Erikson’s Psychosocial Stages [32] emphasize unconscious processes and lifespan development. Cognitive theories, including Cognitive Behavioral Theory [ 33 ], Cognitive Load Theory [ 34 ], and Dual-Process Theory [ 35 ], focus on how individuals acquire, manage, and use information to make decisions. Frameworks such as Social Cognitive Theory [ 36 ] and Flow Theory [ 37 ] highlight the influence of motivation, environment, and experiential states on behavior. To complement these perspectives, we also consider user-centered approaches including User Design Persona [ 38 ], User-Centered Design [ 39 ], and Mental Models [ 40 ], which foreground human–system interaction and usability. Finally, we include the Enneagram of Personality Traits [ 41 ], a non-scientific yet widely used typology in organizational and business contexts, where it serves as a tool for interpreting interpersonal dynamics.

Backbone Models. Identity profile generation is carried out using thirteen LLMs: claude-3.7-sonnet, claude-sonnet-4, gemma3-12b, gemma-3-27b-it, gemma3-4b, llama-3.1-8b-instruct, llama-3.2-3b-instruct, llama-3.3-70b-instruct, llama-4-scout, mistral-nemo, nova-micro-v1, qwen3-32b, and qwen3-8b. All models were queried via the OpenRouter service1. Generations were performed with fixed decoding parameters and deterministic seeds to ensure reproducibility. For each model, we generated identity descriptions across the full cross-product of gender conditions, identity description frameworks, and PoE agent roles. By holding the framework and model constant while varying only the PA’s gender, we obtain a rich corpus of role descriptions that enables systematic analysis of stylistic, afective, and identity-related textual properties.

Datasets. To provide variability in the identity descriptions, we generate profiles with reference to ifve representative datasets. Social Support [ 42 ] emphasizes empathy and interpersonal understanding, CommonSenseQA [ 43 ] targets everyday commonsense inference, StrategyQA [ 44 ] requires multi-hop strategic reasoning, Social IQa [ 45 ] captures intuitive judgments about social dynamics, and Last Letter Concat [46] provides a synthetic benchmark for symbolic manipulation. Each dataset serves as a taskconditioning signal, playing the same role, such as the PM, which yields diferent identity descriptions in StrategyQA and Social IQa, even under identical framework and gender conditions. For each combination of conditions, identity profiles are generated for all roles in the PoE hierarchy.

Experimental Pipeline The experimental pipeline is as follows: first, the identity of the PA is generated (Figure 1a); then, the PA generates the identity for the PM agent (Figure 1b), for the EAs (Figure 1c), and for the FDM (Figure 1d).

Importantly, the number of EAs may vary, as PoE dynamically recruits agents based on task demands. This variability depends on the framework, the dataset, and the LLM. However, the architectural dynamics related to expert selection fall outside the scope of this paper.

After generating the identity profiles, we performed an automatic classification of the perceived gender of each profile. To this end, we adopted an LLM-based strategy, using gpt-4o-mini to assign each profile to one of four categories: male, female, non-binary, or uncertain. We opted for an LLM-based approach rather than a lexicon-based or rule-driven method because the generated texts often encode gender implicitly through stylistic and semantic cues rather than through explicit lexical markers, making traditional heuristics insuficiently reliable. We acknowledge the limitation posed by the methodological circularity of employing LLMs both for profile generation and for their subsequent evaluation, as this setup may amplify model-specific biases. In future studies, we will work to mitigate this issue by conducting controlled experiments with multi-annotator human ratings to assess and calibrate the LLM-based gender guesser.

Linguistic analyses. To assess how gender conditioning influences the generated profiles, we conduct a comprehensive linguistic analysis of the corpus. We examine several stylistic and semantic dimensions that capture both surface-level variation and deeper narrative framing. Our analysis focuses on four key aspects: classified gender, sentiment polarity, subjectivity, and the lexical richness of adjectives 2. Sentiment analysis provides a measure of afective tone in the text. Within this, we extract two complementary metrics: polarity, which reflects the emotional valence of a description (ranging from negative to positive), and subjectivity, which indicates the extent to which a passage conveys opinions or internal states as opposed to objective information. These features allow us to quantify whether diferent gender framings result in more emotionally expressive or more neutral agent portrayals. Adjective Diversity and Coverage by Gender. To investigate whether gender framing influences lexical style, we focused specifically on adjectives in the generated role descriptions. Adjectives are a key marker of descriptive richness, making them suitable for detecting subtle stylistic variation and potential bias. From the extracted adjective sets, we computed three complementary metrics for each gender and dataset: (i) the adjective rate, i.e., the proportion of descriptions in which an adjective 1https://openrouter.ai 2All sentiment, polarity, and subjectivity scores were obtained using the TextBlob library https://textblob.readthedocs.io appears; (ii) the number of unique adjectives per description (avg), which reflects local richness; and (iii) the vocabulary coverage (cov), defined as the percentage of the total adjective vocabulary used by each group. We also calculated a normalized rate of unique adjectives per description (rate_unique) to account for diferences in dataset size. These measures together provide a structured view of how much descriptive variety diferent gender framings allow or restrict.

Linguistic analysis of identity descriptions grouped by framework. The table reports the inferred gender distributions (male, female, non-binary, uncertain), sentiment distribution (positive, neutral, negative), and average polarity and subjectivity scores. Results are aggregated across all roles, datasets, and models. Frameworks are ordered within each PA-gender condition. male

5. Discussion

along three analytical axes: gender distribution (inferred gender proportions), sentiment (positive/neumodel male Linguistic analysis of identity descriptions grouped by model. The table reports the inferred gender distributions, sentiment distribution, and average polarity and subjectivity scores. Results are aggregated across all roles, datasets, and frameworks. Models are ordered within each PA-gender condition. tral/negative), subjectivity score, and polarity intensity. We grouped the results by conditioning variable (e.g., by model or by role) and aggregated across datasets. This allows us to highlight systematic trends that are otherwise obscured by the large number of individual entries. For example, grouping by role reveals how linguistic subjectivity difers between psychologists and experts, while grouping by model clarifies divergent behaviors across model families such as Claude and LLaMA. These summaries are intended to highlight key patterns and contrasts across conditions3.

Global Gender Distribution.

As a starting point, we examine the inferred gender distribution aggregated across all generation conditions. This provides a high-level view of how gender conditioning influences the overall representation of gendered identity in the resulting profiles. Across identity description frameworks, the generated profiles are predominantly classified as female. For example, in the global rows of Table 6, the Flow Theory framework produces the highest proportion of female classifications, while the remaining frameworks consistently generate slightly above 50% female profiles. Interestingly, all frameworks generate only around 20% male profiles and approximately 10% non-binary 3The complete set of results, including all combinations and disaggregated values, is provided in the Appendix. role psychologist 0.295 la project-manager 0.289 lob expert agent 0.272 g final decision maker 0.257

psychologist 0.000 lea project-manager 0.392 em expert agent 0.467 f final decision maker 0.371

psychologist 0.974 le project-manager 0.275 am expert agent 0.182

final decision maker 0.232 ry psychologist 0.015 ian project-manager 0.133 -b expert agent 0.105 n on final decision maker 0.098 .in psychologist 0.192 tra project-manager 0.357 sn expert agent 0.333 n final decision maker 0.326 o c u distributions (male, female, non-binary, uncertain), sentiment distribution (positive, neutral, negative), and average polarity and subjectivity scores. Results are aggregated across all models, datasets, and frameworks. Roles are ordered within each PA-gender condition.

Gender Distribution male female non-binary uncertain positive

Sentiment neutral negative more neutral profiles with lower average polarity values. Subjectivity varies within a narrower band, peaking in the Big Five and reaching its lowest value in Mental Models, as shown in Table 6. Efect of Gender Conditioning. Analyzing the four gender experimental conditions, we observe that the gender setting of the PA significantly influences both the stylistic tone and the inferred gender attribution of the generated profiles.

When the PA is conditioned as female, the resulting descriptions exhibit stronger female alignment overall, particularly under frameworks such as Flow Theory and User-Centered Design. In these cases, non-binary and uncertain classifications remain consistently low. Additionally, we observe a reduction in the proportion of profiles classified as uncertain, accompanied by a slight increase in those classified as male, suggesting that gender conditioning enhances identity specificity. Sentiment trends under the female condition are predominantly positive. Frameworks such as the Big Five, Social Cognitive Theory, and the Enneagram of Personality Traits produce over 75% of positively classified descriptions. However, the highest average polarity and subjectivity scores are observed only in the Big Five framework, highlighting its stronger afective and subjective tone relative to others.

When the Psychologist is set to male, an unexpected trend emerges: female alignment increases even further across most frameworks. In several cases, such as Cognitive Load Theory and Cognitive Behavioral Theory, the proportion of profiles classified as female exceeds that observed under the female condition. Sentiment remains predominantly positive across frameworks, with the Big Five and Enneagram of Personality Traits exhibiting the highest rates of positive classifications. However, both polarity and subjectivity scores are slightly lower than those observed in the female condition, suggesting a more emotionally neutral and less personalized tone in the generated profiles.

Under the non-binary setting, non-binary identity attribution increases markedly. Several frameworks show non-binary percentages above 30% (e.g., Social Cognitive Theory), while male and female labels remain more balanced. Sentiment remains generally positive, with Big Five and Enneagram of Personality Traits scoring the higest positive among the experimental categories. Polarity remains relatively stable and comparatively to the other experimental conditions.

In the unconstrained setting, where no gender is assigned to the PA, female identity emerges as the dominant classification across most frameworks. Two notable exceptions are the Big Five and Myers-Briggs Type Indicator, which display a slight preference toward male classifications. In this condition, non-binary identities drop dramatically, averaging around 3%. Positive sentiment remains most prominent in the Big Five and Enneagram of Personality Traits frameworks, with the Big Five again exhibiting the highest subjectivity. In contrast, cognitively-oriented frameworks such as Erikson’s Psychosocial Stages yield more emotionally neutral profiles, characterized by low polarity scores and a high proportion of neutral classifications.

Overall, all four conditions maintain consistent rankings: Big Five and Enneagram of Personality Traits remain the most positive and subjective; Cognitive Load Theory and Mental Models trend neutral and analytic; Flow Theory and User-Centered Design consistently yield more female-coded outputs. The non-binary setting uniquely supports broader identity diversity across roles.

Model-Level Patterns. Aggregating by model reveals distinct stylistic tendencies (Table 7). The LLaMA-family models consistently produce the warmest descriptions, characterized by the highest levels of positive sentiment, polarity, and subjectivity. In contrast, the Qwen family exhibits a flatter afective profile, generating more emotionally neutral text. The Claude models fall in between these two extremes, ofering moderately expressive outputs. In terms of inferred gender distributions, we observe similar trends to those seen across description frameworks: most generated identity profiles are classified as female. However, notable variations emerge across models. Nova-Micro and the Qwen models produce the lowest proportion of female identities, showing a relative preference for male descriptions. Conversely, the LLaMA-family models skew strongly toward female identity generation. Interestingly, a closer examination of the Claude Sonnet models reveals divergent behavior: version 3.7 shows a marked bias toward female identities, while this tendency is substantially attenuated in version 4, indicating a shift toward more balanced gender representations.

Role-Level Patterns. Grouping by role (Table 3) reveals clear stylistic diferences across the PoE hierarchy. Descriptions of the Project Manager tend to be the most positively valenced, exhibiting the highest polarity scores. Expert Agents are sligthly less positive. The Final Decision Maker profiles display slightly higher subjectivity while maintaining a comparable level of positive sentiment. Psychologist descriptions are moderately subjective, falling between the Expert and Decision Maker roles in afective tone. Across roles, negative sentiment remains negligible. Notably, under the non-binary Psychologist condition, perceived non-binary identity increases and partially propagates to downstream roles, for instance, Project Manager profiles show 18.9% non-binary classification, and Final Decision Maker profiles reach 25.6%. This trend underscores a key asymmetry: only when the PA is explicitly conditioned as non-binary do downstream roles exhibit meaningful non-binary identity, whereas in all other gender conditions, non-binary representations remain minimal (approximately 3%).

Statistical Analyses. To assess whether these patterns are statistically reliable, we ran statistical tests. For polarity and subjectivity, we used the Kruskal-Wallis test to compare rank distributions across the levels of each factor. We conducted these tests in three ways: globally across the full corpus, and separately within each Psychologist-gender setting (unconstrained, male, female, non-binary). For perceived gender (male/female/non-binary/uncertain) and sentiment class (positive/neutral/negative), we treated both as contingency problems and applied Pearson’s 2 tests of independence at multiple levels of aggregation. When a significant efect was found, we conducted pairwise Wilcoxon rank-sum (Mann–Whitney) tests between all levels of the factor under study (e.g., across models, frameworks, or roles), applying Bonferroni correction to control the family-wise error rate. Similarly, for sentiment and perceived gender, we followed significant 2 tests with pairwise 2 comparisons between levels of the same factor, again using Bonferroni correction to adjust for multiple comparisons.

Across description frameworks, across base models, and across roles, Kruskal–Wallis tests were highly significant in all configurations, indicating systematic diferences in both polarity and subjectivity that do not reduce to sampling noise. 2 tests were significant both globally and within each Psychologist-gender condition across all three factors: identity description framework, model, and role. This finding indicates that sentiment and perceived gender distributions vary systematically rather than randomly. After Bonferroni correction, most comparisons remain statistically significant For example, the Chi-squared tests confirm that diferent description frameworks yield significantly divergent gender distributions in the generated agent profiles, indicating that the stylistic scafolding imposed by each framework systematically influences perceived identity. In this setting, the Big Five Personality Traits framework diverges significantly from nearly all others. It elicits strongly evaluative, trait-laden language that tends to carry gendered connotations.

Adjective Analysis. Table 4 summarizes results across four experimental conditions. In the unconstrained dataset, male outputs are clearly more lexically rich (51.7 unique adjectives on average, 86.5% coverage) than female ones (39.1, 83.1%), while non-binary and uncertain outputs show far lower richness and coverage. When the Psychologist is constrained to be female, male descriptions become even more elaborate (52.1 avg, 89.4% coverage), whereas female and non-binary outputs lose both richness and breadth. This reveals an asymmetry where female framing enhances male lexical diversity while suppressing other groups. By contrast, under a male Psychologist, female outputs gain sharply in richness (47.0 avg, 90.4%) and surpass male ones (42.0 avg, 75.8%), indicating a reversed bias pattern. The non-binary Psychologist condition yields the highest averages overall, with male descriptions again leading (53.2 avg), but with uneven coverage (66.0%), suggesting repetition within a narrower adjective space. Across all conditions, uncertain-gender outputs remain the least lexically diverse.

Overall, these analyses show that the gender identity of the profile-generating Psychologist Agent systematically shapes the diversity of adjectives used by downstream role profiles. Importantly, the observed shifts are not uniform: male profiles benefit disproportionately when the Psychologist is female or non-binary, whereas female profiles only gain richness under a male Psychologist. This asymmetric redistribution of adjective usage is a clear signal of bias in persona construction, highlighting how upstream identity cues condition not only direct outputs but also the expressive range available to other groups.

In conclusion, the answer to our research question is afirmative but qualified: the gender assigned to the Psychologist Agent does shape the identity and stylistic properties of generated profiles, but its influence is mediated by the identity description framework and model. Non-binary cues propagate most efectively but introduce ambiguity; male and female cues leak asymmetrically; and framework–role–model combinations can amplify or suppress the intended signal. Understanding these interactions is essential for responsible use of identity-aware prompting in LLM-based systems.

6. Conclusion

This study set out to investigate whether, and how, the gender assigned influences the identity, tone, and stance of the generated role descriptions. Our findings reveal a layered interplay between gender conditioning, identity description framework, and agent role. On top of the role scafold, the identity description frameworks influence tone and stance. People- and trait-oriented frameworks, such as the Big Five, consistently elicit warmer, more subjective, and highly positive descriptions. In contrast, cognitively-oriented frameworks such as Cognitive Load Theory promote neutral phrasing. These ifndings demonstrate that the chosen identity description frameworks set the expressive bandwidth for identity to be realized.

Gender conditioning introduces a second axis of variation. When the PA’s gender is left unspecified, generated profiles tend to skew female across roles and frameworks, suggesting a possible latent femalecoded prior embedded within the model weights or the generation schema. Interestingly, this leakage is asymmetric and model-family dependent: male conditioning sometimes fails to override default femalecoded patterns (e.g., Big Five). These results suggest that model priors and framework-specific stylistic constraints can override explicit gender cues, even under tightly controlled prompting. Non-binary identity follows a diferent pattern. When the PA is conditioned as non-binary, we observe the strongest and most consistent propagation of non-binary alignment to downstream roles. However, it also increases interpretive instability. This indicates that identity cues are not simply passed downstream, but shaped through a process of interaction between gender markers, framework constraints, role semantics, and model-specific expressive preferences. These dynamics result in emergent biases. For example, male-coded identities often align with low-subjectivity, which may unintentionally reinforce social associations between masculinity and objectivity. Across the corpus, such tendencies accumulate: even when prompts are balanced and neutral in form, their interaction with model and framework biases can arise identity–role couplings that reflect familiar sociocultural archetypes.

Model families add a final layer of influence. Certain LLMs, such as Claude 4, favor subjective and positive expression, which strengthens identity alignment, especially when combined with socially expressive frameworks. Conversely, models like Qwen3-32B tend toward neutral or flattened afect, weakening the salience of identity cues and increasing the likelihood of gender drift or ambiguity.

Taken together, these findings highlight a core insight: identity-aware prompting is a system-level phenomenon. The gender assigned to the PA influences the generation of downstream profiles, but not deterministically. Efective identity control emerges only when gender conditioning, identity framework, role semantics, and LLM behavior are aligned. Even in well-calibrated settings, drift, leakage, and model-level priors remain active forces that shape the generative outcome.

These results point to broader design implications. Neutrality cannot be assumed at any point in the generation pipeline. Even well-structured and symmetrical prompts can produce skewed or stereotypical profiles if their constituent parts are not jointly calibrated.

Acknowledgements

We acknowledge the support of the PNRR project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU. This paper is supported by the TrustAlert project, funded by Fondazione Compagnia San Paolo and Fondazione CDP under the “Artificial Intelligence” call.

Declaration on Generative AI

During the preparation of this work, the authors used Grammarly in order to: Grammar and spelling check. After using these tools/services, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [14] J. Liu, C. Symons, R. R. Vatsavai, Persona-based conversational ai: State of the art and challenges, 2022. URL: https://arxiv.org/abs/2212.03699. arXiv:2212.03699. [15] A. Schuller, et al., Generating personas using llms and assessing their viability, in: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’24, ACM, 2024. doi:10.1145/3613905.3650860. [16] P. Jandaghi, et al., Faithful persona-based conversational dataset generation with llms, in:

Proceedings of the 6th Workshop on NLP4ConvAI 2024, ACL, 2024, pp. 114–139. [17] T. Hu, N. Collier, Quantifying the persona efect in LLM simulations, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 10289–10307. URL: https://aclanthology.org/2024.acl-long.554/. doi:10.18653/v1/2024. acl-long.554. [18] A. Liu, M. Diab, D. Fried, Evaluating large language model biases in persona-steered generation, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024, Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 9832–9850. URL: https://aclanthology.org/2024.findings-acl.586/. doi:10.18653/v1/2024.findings-acl.586. [19] M. Orlikowski, P. Röttger, P. Cimiano, D. Hovy, The ecological fallacy in annotation: Modeling human label variation goes beyond sociodemographics, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 1017–1029. URL: https://aclanthology.org/2023.acl-short.88/. doi:10.18653/v1/2023. acl-short.88. [20] T. Ge, X. Chan, X. Wang, D. Yu, H. Mi, D. Yu, Scaling synthetic data creation with 1,000,000,000 personas, arXiv preprint arXiv:2406.20094 (2024). [21] X. Wang, H. Zhang, T. Ge, W. Yu, D. Yu, D. Yu, Opencharacter: Training customizable role-playing llms with large-scale synthetic personas, arXiv preprint arXiv:2501.15427 (2025). [22] D. X. Long, et al., Multi-expert prompting improves reliability, safety and usefulness of llms, in:

Proceedings of the 2024 Conference on Empirical Methods in NLP, ACL, 2024. [23] T. Hu, Y. Kyrychenko, S. Rathje, N. Collier, S. van der Linden, J. Roozenbeek, Generative language models exhibit social identity biases, Nature Computational Science 5 (2025) 65–75. [24] Y. Wan, G. Pu, et al., “kelly is a warm person, joseph is a role model”: Gender biases in LLMgenerated reference letters, in: Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, 2023, pp. 3730–3748. URL: https: //aclanthology.org/2023.findings-emnlp.243/. doi:10.18653/v1/2023.findings-emnlp.243. [25] H. Kotek, R. Dockum, D. Sun, Gender bias and stereotypes in large language models, in: Proceedings of The ACM Collective Intelligence Conference, CI ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 12–24. URL: https://doi.org/10.1145/3582269.3615599. doi:10.1145/ 3582269.3615599. [26] H. Kong, Y. Ahn, S. Lee, Y. Maeng, Gender bias in llm-generated interview responses, arXiv preprint arXiv:2410.20739 (2024). [27] B. C. Z. Tan, R. K.-W. Lee, Unmasking implicit bias: Evaluating persona-prompted LLM responses in power-disparate social scenarios, in: Proceedings of the 2025 Conference of NAACL: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics, Albuquerque, New Mexico, 2025, pp. 1075–1108. URL: https://aclanthology.org/2025.naacl-long.50/. doi:10.18653/v1/2025.naacl-long.50. [28] S. Gupta, V. Shrivastava, A. Deshpande, A. Kalyan, P. Clark, A. Sabharwal, T. Khot, Bias runs deep: Implicit reasoning biases in persona-assigned LLMs, in: The Twelfth International Conference on Learning Representations, 2024. URL: https://openreview.net/forum?id=kGteeZ18Ir. [29] S. Roccas, L. Sagiv, S. H. Schwartz, A. Knafo, The big five personality factors and personal values, Personality and Social Psychology Bulletin 28 (2002) 789–801. URL: https://doi.org/10.1177/0146167202289008. doi:10.1177/0146167202289008. arXiv:https://doi.org/10.1177/0146167202289008. D19-1454. [46] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chainof-thought prompting elicits reasoning in large language models, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL: http://papers.nips.cc/paper_files/ paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html. In this appendix, we provide illustrative examples of identity profiles in Appendix A, along with the complete result tables in Appendix B.

A. Examples of Identity Profiles

This section presents illustrative examples of generated identity descriptions.

PA Gender constraint: Male; Inferred Gender: non-binary;

Role: Psychologist; Model: nova-micro-v1; Framework: Cognitive Behavioral Theory Identity description: The non-binary Psychologist, Alex, operates within the framework of Cognitive Behavioral Theory (CBT), focusing on the interplay between thoughts, emotions, and behaviors. Alex is committed to understanding how these elements influence each other and how they contribute to the client’s psychological well-being. Alex’s approach is grounded in the belief that by identifying and challenging maladaptive thought patterns, clients can develop healthier emotional responses and more adaptive behaviors. Alex employs structured sessions that begin with the identification of specific problems and the development of a collaborative therapeutic relationship. The initial phase involves psychoeducation, where Alex explains the CBT model to the client, emphasizing the connection between thoughts, feelings, and actions. This is followed by the identification of the client’s cognitive distortions and maladaptive beliefs, which are often the root of their emotional distress. During therapy, Alex uses techniques such as cognitive restructuring to help the client recognize and challenge these distortions. For example, Alex might help a client who believes they are worthless to examine the evidence for and against this belief, ultimately leading to a more balanced and realistic perspective. Alex also incorporates behavioral strategies, such as exposure therapy for clients with phobias or anxiety disorders, and encourages the practice of new, healthier behaviors. Homework assignments are common, designed to reinforce the skills learned in sessions and to provide opportunities for the client to apply these strategies in real-life situations. Throughout the therapeutic process, Alex maintains a flexible and adaptive approach, tailoring interventions to the unique needs and context of each client. Regular feedback and adjustments are made to ensure that the therapy remains relevant and effective. Alex’s therapeutic style is characterized by warmth, empathy, and a non-judgmental attitude, creating a safe space for clients to explore their thoughts and feelings without fear of criticism. This supportive environment fosters trust and encourages clients to engage more deeply in their own healing process. In addition to individual therapy, Alex also offers group sessions where clients can share experiences and learn from one another. These groups provide a sense of community and mutual support, further reinforcing the CBT principles of cognitive and behavioral change. Alex’s work is informed by a commitment to ongoing professional development and a dedication to integrating the latest research findings into practice. This ensures that the therapeutic interventions provided are evidence-based and effective. Ultimately, Alex’s goal is to empower clients to become active participants in their own therapeutic process, equipping them with the tools and strategies needed to manage their thoughts, emotions, and behaviors in a more adaptive and fulfilling way.

PA Gender constraint: Female; Inferred Gender: female;

Role: Project Manager; Model: gemma-3-27b-it ; Framework: Dual-Process Theory Identity description: Ms. Eleanor Vance presents initially as a pragmatic, mid-career professional, appearing approximately 42 years of age. Her attire is consistently business casual – tailored trousers or skirts, simple blouses, and cardigans in neutral tones. There’s a deliberate lack of ostentation; her focus is demonstrably not on self-presentation. A System 1 assessment might categorize her as efficient, perhaps even a little reserved. However, this is a carefully constructed facade. System 2 analysis reveals a highly structured approach to information gathering and assessment. The clothing, while unassuming, is chosen for practicality and comfort during extended periods of analytical work. Subtle observations – the precise angle of her gaze, the almost imperceptible adjustments in posture – indicate constant cognitive processing. She exhibits a measured emotional tone, not an absence of feeling, but a controlled emotional range enabling objective evaluation. Her conversational style reflects this duality. System 1 perceives a deliberate pace, frequent pauses, and a preference for open-ended inquiries. She avoids direct questioning, opting instead for prompts designed to encourage expansive responses. These pauses, however, aren’t indicative of indecision; they are strategic opportunities for active listening, observing micro-expressions, and formulating targeted follow-up questions. She favors phrasing such as “Could you elaborate on your approach to. . . ?” or “What factors led you to consider. . . ?” rather than leading statements. Ms. Vance’s expertise lies in identifying and assembling the optimal team for complex problem-solving, not in solving the problems herself. In the context of a project focused on evaluating pre-trained models for complex question answering, her initial assessment would not be to consider model architectures directly. Instead, she would decompose the task into its constituent cognitive components. She is acutely aware of the cognitive biases inherent in initial problem framing – the tendency to focus on readily available solutions, the anchoring effect of initial assumptions – and actively mitigates these biases. Her methodology is rigorously systematic. She breaks down the challenge – assessing models on context-free question answering, multi-step reasoning, implicit reasoning, and logical reasoning – into discrete areas of expertise. This involves identifying the core skills required: formal logic, computational linguistics, cognitive psychology (specifically, models of human reasoning), knowledge representation, and potentially even philosophy of language. She doesn’t simply seek individuals *proficient* in these fields, but assesses their specific approaches and methodologies. Her selection criteria prioritize a diversity of perspectives. She actively seeks individuals with differing theoretical frameworks and methodological approaches within each field. For example, within computational linguistics, she might seek experts specializing in both symbolic and connectionist approaches. Within cognitive psychology, she’d consider researchers focusing on both normative and descriptive models of reasoning. She employs a ‘pattern-matching’ approach, comparing candidate profiles against a mental database of expertise, identifying potential synergies and areas of complementary skill. Her System 2 processing extends to anticipating potential points of friction between disciplines. She recognizes that a logician’s perspective on ‘reasoning’ might differ significantly from a cognitive psychologist’s. She proactively seeks individuals capable of bridging these conceptual gaps. Her assessments are rarely definitive; she frames recommendations in terms of probabilities and potential contributions, acknowledging the inherent uncertainty in predicting team dynamics. She’d articulate her reasoning as, “Based on their published work, Dr. Chen’s expertise in knowledge graph construction appears highly relevant, with an estimated 70% probability of contributing significantly to the task of identifying implicit reasoning steps.” She actively solicits disconfirming evidence, challenging her own assumptions and remaining open to alternative interpretations. Her empathy is primarily cognitive – a striving to understand each expert’s internal frame of reference – maintained with a deliberate professional distance.

PA Gender constraint: Unconstained; Inferred Gender: female;

Role: Psychologist; Model: nova-micro-v1; Framework: User Design Persona Identity description: The Psychologist, Dr. Emily Carter, is a seasoned professional in the field of clinical psychology with a specialization in personality assessment and behavioral analysis. She is 45 years old, holds a Ph.D. in Clinical Psychology from Stanford University, and has over 20 years of experience in both academic and clinical settings. Dr. Carter is known for her empathetic approach and her ability to create a safe, non-judgmental space for her clients. She employs a combination of traditional psychological assessments and modern analytical techniques to understand and describe individuals’ psychological profiles. Dr. Carter’s professional background includes working in various mental health institutions, including a large urban hospital and a private practice. She has conducted extensive research on the psychological underpinnings of personality traits and has published several papers in peer-reviewed journals. Her work often involves collaborating with other mental health professionals to provide comprehensive care for patients. Her personal life is equally rich and balanced. She has a husband and two children, and she enjoys spending time with her family. Outside of her professional life, Dr. Carter is a passionate advocate for mental health awareness and frequently participates in community outreach programs to educate the public about psychological health. Dr. Carter is highly skilled in various psychological assessment tools, including the Minnesota Multiphasic Personality Inventory (MMPI), the Myers-Briggs Type Indicator (MBTI), and the Big Five personality traits model. She is adept at interpreting these assessments to generate detailed and authentic descriptions of individuals, providing insights into their behavioral patterns, emotional responses, and interpersonal dynamics. Her methodology is grounded in a humanistic approach, emphasizing the importance of individual experiences and personal growth. She believes that understanding one’s psychological profile can lead to greater self-awareness and personal development. Dr. Carter’s descriptions are thorough, nuanced, and tailored to the unique context of each individual, offering a comprehensive view of their psychological makeup. In her practice, Dr. Carter uses a combination of qualitative and quantitative data to create detailed psychological profiles. She meticulously analyzes test results, interviews, and observational data to construct a holistic picture of each person. Her descriptions are not only accurate but also empathetic, aiming to respect and honor the individual’s subjective experiences. Dr. Carter’s professional demeanor is characterized by her calm, composed, and approachable nature. She is known for her excellent listening skills and her ability to build strong therapeutic relationships. Her clients often describe her as someone who makes them feel understood and valued. In summary, Dr. Emily Carter is a highly skilled psychologist whose expertise lies in generating detailed and authentic psychological descriptions. Her blend of academic rigor, clinical experience, and empathetic approach makes her a trusted professional in the field of personality assessment and behavioral analysis.

PA Gender constraint: Non-binary; Inferred Gender: female;

Role: Psychologist; Model: claude-3.7-sonnet; Framework: Big Five Personality Traits Identity description: Dr. Sophia Chen is a clinical psychologist with a balanced personality profile across the Big Five dimensions. In terms of Openness to Experience, she scores moderately high, demonstrating intellectual curiosity and appreciation for art, emotion, adventure, and unconventional ideas. This trait enables her to approach client cases with creativity and consider multiple therapeutic approaches. She maintains an open mind to new psychological theories while remaining grounded in evidence-based practices. Her intellectual curiosity drives her continuous professional development, regularly attending conferences and workshops to expand her knowledge base.Regarding Conscientiousness, Dr. Chen exhibits high scores, showing exceptional organization, reliability, self-discipline, and achievement-orientation. Her clinical notes are meticulously maintained, appointments run punctually, and treatment plans are thoroughly developed. She demonstrates careful attention to detail when conducting assessments and formulating diagnoses. Her conscientious nature ensures she follows up with clients consistently and completes administrative tasks efficiently. This trait contributes significantly to her professional reputation for reliability and thoroughness. On the Extraversion scale, Dr. Chen falls in the mid-range, displaying a balanced approach to social interaction. While she communicates effectively and establishes strong rapport with clients, she also values reflective solitude to process complex cases. In group therapy settings, she facilitates discussion with appropriate energy while ensuring quieter participants have space to contribute. Her moderate extraversion allows her to adapt her interpersonal style to different client needs—becoming more animated with children or reserved with clients who prefer a more subdued approach.nnDr. Chen scores highly on Agreeableness, demonstrating genuine compassion, empathy, and a cooperative spirit. She approaches client relationships with warmth and non-judgment, creating a safe therapeutic environment. Her agreeable nature manifests in her collaborative approach to treatment planning, involving clients in decision-making processes. In multidisciplinary team meetings, she values diverse perspectives and works toward consensus. However, she maintains appropriate professional boundaries and can be assertive when clinical situations require direct intervention. Regarding Emotional Stability (low Neuroticism), Dr. Chen exhibits high scores, displaying remarkable emotional regulation and resilience. She maintains composure when working with clients in crisis and processes vicarious trauma effectively through supervision and self-care practices. Her emotional stability allows her to remain present and attentive during intense therapeutic sessions without becoming overwhelmed. She acknowledges her own emotional responses to challenging cases but manages them constructively. This trait enables her to model healthy emotional processing for clients while maintaining professional objectivity. These balanced Big Five traits combine to make Dr. Chen an effective psychologist who adapts her approach to meet diverse client needs while maintaining professional standards and personal well-being. Her personality profile supports her ability to build therapeutic alliances, implement structured interventions, think creatively about complex cases, and sustain her practice through challenging circumstances. Gender of Psychologist

B. Statistics

This appendix provides detailed statistical and linguistic analyses of the identity descriptions generated under various experimental conditions. We report statistical tests for gender, sentiment, polarity, and subjectivity across diferent conditions (Table 5), as well as aggregated analyses by identity description framework (Table 6) and model (Table 7).

• Table 5 reports the results of Chi-squared tests for categorical variables (inferred gender and sentiment) and Kruskal-Wallis tests for continuous variables (polarity and subjectivity), grouped by description framework, model, and role across five PA-gender conditions and globally. All -values are < 0.001, indicating statistically significant diferences across conditions. • Table 6 presents a linguistic breakdown of identity descriptions grouped by framework. It includes inferred gender distributions (male, female, non-binary, uncertain), sentiment distributions (positive, neutral, negative), and the average polarity and subjectivity of the descriptions. Frameworks are ordered within each PA-gender condition and results are aggregated across roles, models, and datasets. • Table 7 ofers a parallel linguistic analysis grouped by model. As with the previous table, results include inferred gender and sentiment distributions, along with mean polarity and subjectivity.

Models are grouped and compared within each PA-gender condition.

Identity description framework merged across identity description frameworks. male uncertain positive negative 0.319 0.245 0.270 0.262 0.345 0.281 0.196 0.359 0.245 0.355 0.258 0.230 0.195 male

[1]

Bolpagni ,

S. D.

Carli ,

Sanna ,

Gabrielli ,

Dragoni , Role-play large language models for short behavior change interventions: An exploratory study on brief action planning , in: R. Bellazzi , J. M. J. Herrero , L. Sacchi , B. Zupan (Eds.), Artificial Intelligence in Medicine - 23rd International Conference, AIME 2025 , Pavia, Italy, June 23-26, 2025 , Proceedings, Part

, volume 15735 of Lecture Notes in Computer Science, Springer, 2025 , pp. 46 - 51 . URL: https://doi.org/10.1007/ 978-3- 031 -95841- 0 _9. doi: 10 .1007/978-3- 031 -95841-0\_9.

[2]

J. Z.

Pan ,

Razniewski ,

Kalo ,

Singhania ,

Chen ,

Dietze ,

Jabeen ,

Omeliyanenko ,

Zhang ,

Lissandrini ,

Biswas , G. de Melo,

Bonifati , E. Vakaj,

Dragoni ,

Graux , Large language models and knowledge graphs: Opportunities and challenges , TGDK 1 ( 2023 ) 2: 1 - 2 : 38 . URL: https://doi.org/10.4230/TGDK.1. 1 .2. doi: 10 .4230/TGDK.1. 1 .2.

[3]

Bellan ,

S. G.

Haez ,

Sanna ,

Magnolini ,

Dragoni , Leveraging multi-agent systems for domain-pertinence query classification in informative chatbots , in: Artificial Intelligence in Medicine , Springer Nature Switzerland, Cham, 2025 , pp. 44 - 54 .

[4]

S. T.

Völkel , Conversational agents with personality , Ph.D. thesis , lmu, 2022 .

[5]

Roccas , Sagiv , et al., The big five personality factors and personal values , Personality and social psychology bulletin 28 ( 2002 ) 789 - 801 .

[6]

Zhao ,

Deng ,

Wang , H. Cheng, R. Zhang, S. - K. Ng,

T.-S.

Chua , et al., Exploring the impact of personality traits on conversational recommender systems: A simulation with large language models , arXiv preprint arXiv:2504.12313 ( 2025 ).

[7]

Kim , et al., Designing personalities of conversational agents , in: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems , 2019 .

[8]

A. B.

Kocaballi ,

Berkovsky ,

J. C.

Quiroz ,

Laranjo ,

H. L.

Tong ,

Rezazadegan ,

Briatore , E. Coiera, The personalization of conversational agents in health care: systematic review , Journal of medical Internet research 21 ( 2019 ) e15360 .

[9]

Hwang ,

Shin ,

Baek ,

Suh ,

Lee , Applying the persona of user's family member and the doctor to the conversational agents for healthcare , 2021 . URL: https://arxiv.org/abs/2109.01729. arXiv: 2109 . 01729 .

[10]

Ahmad , et al., Designing personality-adaptive conversational agents for mental health care , Information Systems Frontiers 24 ( 2022 ).

[11]

Kong , et al., Better zero-shot reasoning with role-play prompting , in: Proceedings of the 2024 Conference of the North American Chapter of ACL: HLT (Volume 1 : Long

Papers)

, 2024 .

[12]

Serapio-García ,

Safdari ,

Crepy ,

Sun ,

Fitz ,

Romero ,

Abdulhai ,

Faust ,

Matarić , Personality traits in large language models , 2023 . URL: https://arxiv.org/abs/2307.00184. arXiv: 2307 . 00184 .

[13]

Y. R.

Dong , et al., Can

LLM

be a personalized judge? , in: Findings of the ACL: EMNLP 2024 , ACL, 2024 . doi: 10 .1126/science.113.2931.253.a. arXiv:https://www.science.org/doi/pdf/10.1126/science.113.2931.253.a.

[33]

Beck , Cognitive Therapy and the Emotional Disorders , International Universities Press, 1976 . URL: https://books.google.it/books?id=zndHAAAAMAAJ.

[34]

Sweller , Cognitive load during problem solving: Efects on learning , Cognitive Science 12 ( 1988 ) 257 - 285 . doi: 10 .1207/s15516709cog1202\_ 4 .

[35]

Wason ,

Evans , Dual processes in reasoning?, Cognition 3 ( 1974 ) 141 - 154 . URL: https://www.sciencedirect.com/science/article/pii/0010027774900171. doi:https://doi.org/ 10.1016/ 0010 - 0277 ( 74 ) 90017 - 1 .

[36]

G. W.

Blackwell , Social learning and imitation. by neal e. miller and john dollard . new haven: Yale university press, 1941 . 341pp. $ 3 .50, Social

Forces

21 ( 1942 ) 256 - 256 . URL: https://doi.org/10.2307/2570586. doi: 10 .2307/2570586. arXiv:https://academic.oup.com/sf/article-pdf/21/2/256/5861048/21-2-256.pdf.

[37]

Csikszentmihalyi , Beyond Boredom and Anxiety, Jossey-Bass behavioral science series , JosseyBass Publishers, 1975 . URL: https://books.google.it/books?id=afdGAAAAMAAJ.

[38]

Cooper ,

Safo , The Inmates Are Running the Asylum , Macmillan Publishing Co., Inc., USA, 1999 .

[39]

D. A.

Norman ,

S. W.

Draper , User Centered System Design; New Perspectives on Human-Computer Interaction , L. Erlbaum Associates Inc., USA, 1986 .

[40]

P. N.

Johnson-Laird , Mental models in cognitive science , Cogn. Sci. 4 ( 1980 ) 71 - 115 . URL: https://api.semanticscholar.org/CorpusID:7090767.

[41]

Riso ,

Hudson , Personality Types: Using the Enneagram for Self-Discovery , Houghton Miflin, 1996 . URL: https://books.google.it/books?id=pmdARxbC1SUC.

[42]

Wang ,

Jurgens , It's going to be okay: Measuring access to support in online communities , in: E. Rilof , D.

Chiang , J.

Hockenmaier , J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Brussels, Belgium, 2018 , pp. 33 - 45 . URL: https://aclanthology.org/D18-1004/. doi: 10 .18653/v1/ D18 -1004.

[43]

Talmor ,

Herzig ,

Lourie ,

Berant , Commonsenseqa: A question answering challenge targeting commonsense knowledge , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis , MN, USA, June 2-7, 2019 , Volume 1 (Long and Short Papers), Association for Computational Linguistics , 2019 , pp. 4149 - 4158 . URL: https://doi.org/10.18653/v1/n19- 1421 . doi: 10 .18653/V1/N19-1421.

[44]

Geva ,

Khashabi , E. Segal,

Khot ,

Roth ,

Berant , Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies, Transactions of the Association for Computational Linguistics 9 ( 2021 ) 346 - 361 . URL: https://doi.org/10.1162/tacl_a_00370. doi: 10 .1162/tacl_a_ 00370 . arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_ 00370 /1924104/tacl_

[45]

Sap ,

Rashkin ,

Chen ,

R. Le

Bras ,

Choi , Social IQa: Commonsense reasoning about social interactions , in: K. Inui,

Jiang ,

Ng , X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 4463 - 4473 . URL: https://aclanthology.org/D19-1454/. doi: 10 .18653/v1/ Big Five Personality Traits Cognitive Behavioral Theory Cognitive Load Theory Dual-Process Theory d Enneagram of Personality Traits e ian Erikson's Psychosocial Stages trs Flow Theory on Freudian Psychoanalysis cn Mental Models u Myers-Briggs Type Indicator Social Cognitive Theory User Design Persona User-Centered Design