1. Introduction

T. Winograd, Shifting viewpoints: Artificial intelli- gence and human-computer interaction, Artificial Intelligence

10.1016/j

Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models

Agnese Lombardi

Alessandro Lenci

0 0 CoLing Lab, Department of Philology , Literature and Linguistics , University of Pisa

1986

170 2006 1226 1240

Language is fundamental to human cooperation, facilitating not only the exchange of information but also the coordination of actions through shared interpretations of situational contexts. This study explores whether the Generative Agent-Based Model (GABM) Concordia can efectively model Theory of Mind (ToM) within simulated real-world environments. Specifically, we assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context, rather than relying on linguistic memorization. Our findings reveal a critical limitation: GPT-4 frequently fails to select actions based on belief attribution, suggesting that apparent ToM-like abilities observed in previous studies may stem from shallow statistical associations rather than true reasoning. Additionally, the model struggles to generate coherent causal efects from agent actions, exposing dificulties in processing complex social interactions. These results challenge current statements about emergent ToM-like capabilities in LLMs and highlight the need for more rigorous, action-based evaluation frameworks.

eol>Large Language Models Theory of Mind Generative Agent-Based Models

1. Introduction

CLiC-it 2025: Eleventh Italian Conference on Computational Linguistics, September 24 — 26, 2025, Cagliari, Italy * Corresponding author $ agnese.lombardi@phd.unipi.it (A. Lombardi); alessandro.lenci@unipi.it (A. Lenci) https://agneselombardi.github.io/ (A. Lombardi) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License of Mind (ToM) and the capacity for “mentalizing”, that Attribution 4.0 International (CC BY 4.0). is the ability to reason about others’ mental states, to Kim et al. [11], we adhere to the two key criteria for a efectively link language with actions within a given sit- ToM task outlined by Quesque and Rossetti [12]: nonuational context. A crucial aspect of the ToM involved merging and mentalizing. in communication are second-order beliefs, express- The non-merging criterion requires that evaluation ing an agent’s mental states about the content of the tasks ensure a clear distinction between an agent’s own other agent’s mental states (e.g., John believes that Marks mental state and that of others. This distinction is often believes that q). Addressing the formalization and commu- absent in many LLM evaluations, as these models typinication of intentions thus necessitates an understanding cally process the entire conversation as input, granting of language as a form of communicative action. This them “omniscient knowledge”. Consequently, it becomes approach inherently entails the consideration of extralin- challenging to determine whether a model’s response guistic factors, as demonstrated in studies on multimodal reflects a character’s belief or results from its comprecommunication [8], and requires more sophisticated mod- hensive access to the conversation history. In contrast, els of situational contexts to comprehensively capture our approach explicitly separates the mental states of the interplay between language use and interpretation. characters and ensures that their actions are determined

Traditionally, the evaluation of Large Language models solely by their individual knowledge and intentions. (LLMs) has largely overlooked the relationship between The mentalizing criterion stipulates that lower-level language and action, instead focusing primarily on the cognitive processes should not account for successful percommunicative context and dialogue. This omission is, formance on ToM tasks. If a simpler explanation sufices, in part, due to the inherent challenges associated with it should be preferred over a more complex one when inassessing the agentive aspect of language and its connec- terpreting results. In our framework, we introduce a clear tion to actions. distinction: the speaker’s responses and actions can be

This study proposes the use of the Generative Agent- directly inferred from world-state correlations, whereas Based Model (GABM) Concordia [9] to embed utter- the listener’s responses and actions necessitate a more ances and narratives within a situational context. The intricate mentalizing process. This process requires reagoal is to determine whether reproducing such complex soning about language, context, intentions, beliefs, and scenarios – closely resembling real-world environments desires. To further support this distinction, we present – can facilitate the discrimination between intended and multiple versions of the same narrative, systematically literal meanings. Our primary research objective is to altering agents’ knowledge to encourage diverse interassess Theory of Mind (ToM) abilities, operationalized in pretations. this experiment as the capacity to infer intended meaning Our results reveal a critical limitation: Modeling situabased on extralinguistic factors. tional context through real-world simulations is insufi

Rather than directly prompting the model to interpret cient to elicit ToM-like abilities in the model. Specifically, the meaning of an utterance, we ask it to identify the most GPT-4 frequently selects actions without appropriprobable action that the listener would choose, given ately interpreting utterances and the belief context, specific preconditions. This approach is justified by the demonstrating a clear divergence from the ToM caassumption that each interpretation of an utterance is pabilities observed in humans1. linked to a set of possible actions.

Our experiment takes into account the overlap between literal and non-literal meanings and the inference 2. Related Work processes required for the listener to comprehend the intended meaning of an utterance. In our stimuli, we 2.1. Generative Agent-Based Models incorporate utterances that allow for both direct and in- Generative Agent-Based Models (GABMs) represent a direct interpretations. Thus, diferent actions may arise significant departure from traditional agent-based moddepending on how the same utterance is understood. els, which have typically been employed at a relatively

To control for conventional utterance-action associa- high level of abstraction. Moreover, the application of tions, we adapt the False-Belief task [10] into a novel ex- traditional models has been largely confined to specific perimental format. By evaluating action selection rather domains, such as empirical social research [13], market than meaning comprehension directly, we minimize con- simulations [14], and computational sociology [15]. By cerns that the model may have been exposed to the in- contrast, GABMs [16, 17, 18] enable more precise simulatended meanings during training. Moreover, our task tion of behaviors across diverse contexts, leveraging the introduces two layers of complexity: first, the model extensive knowledge embedded in LLMs. These agents must infer the correct meaning under a false-belief con- not only have a more sophisticated array of cognitive dition; second, it must map that inferred meaning to an appropriate action. 1Code and dataset available on GitHub: https://github.com/

This approach ofers several advantages. Following agneselombardi/Concordia_ToM functions for adaptive decision-making but also engage model responsible for the generation of the environment in natural language communication with one another, is called Game Master (GM).2 further enriching their interactive capabilities. Figure 1 illustrates the structure of the simulation in Concordia. The GM functions as an intermediary be2.2. Theory of Mind Simulation with tween the agents and the environmental dynamics reAgents sulting from their actions. Specifically, the GM receives the agents’ actions and translates them into correspondTheory of Mind (ToM), defined as the ability to infer the ing observations, reflecting the environmental efects of beliefs and intentions of others [19], has been extensively those actions. Meanwhile, the agents formulate and exestudied in the context of LLMs to assess their capacity cute action strategies informed by their memory and the for handling complex tasks that require ToM reasoning. observations provided by the GM. These observations are A variety of text-based benchmarks, often inspired by subsequently updated to align with changes occurring established psycholinguistic tests such as the Sally-Anne within the environment. Observations, actions and event test [20], have been developed to evaluate this ability. statements are all English strings. The GM is also responWhile some findings suggest that LLMs demonstrate re- sible for maintaining and updating grounded variables, markable performance on ToM-related tasks [21], other advancing the clock and running the episode loop. studies highlight significant challenges faced by these In our simulation, agent actions are determined by models in making complex ToM inferences [22]. Conse- answering a Multiple-Choice Question Answer (MCQA). quently, the debate surrounding the extent of LLMs’ ToM The agents’ memories encompass all relevant background capabilities remains open. information necessary for action selection. To enhance

Previous works have formalized ToM as agents’ knowl- coherence, we incorporate a component termed Direct edge in various contexts, particularly to enhance collabo- Efect Externality following the environment update. This ration in multi-agent reinforcement learning settings [23] component determines whether the selected actions afand to improve the cooperative behaviors of LLM-based fect one or more agents and specifies the resulting efects. agents through explicit belief modeling [24]. However, This serves as a verification mechanism to ensure that these experiments are predominantly conducted in sim- the produced efects on the other player are coherent plified environments, such as the box game task, which with the selected action and with the inferred beliefs and difer significantly from the complexities of real-world desires (that are explicitly codified in the GM memory). social scenarios. On the other hand, previous attempts to model ToM and social interactions have primarily relied 4. Simulation on simplified ABMs to simulate developmental settings [25].

To the best of our knowledge, our study represents the ifrst attempt to utilize a Generative Agent-Based Model (GABM) to explore: We generated a total of 200 ToM simulations, grouped into 5 tasks. Each simulation involves two distinct characters, accompanied by a sequence of observations for each of them. The character memory is individually con1. whether LLMs exhibit ToM-like abilities in real- structed by randomizing the Big Five personality traits life scenarios and simulations involving prag- [27].

matic interpretation, like with ISA. The simulation concludes with the final utterance from 2. if we can efectively isolate mentalizing from one of the two characters, which can be interpreted literother variables, such as the memorization of lin- ally or non-literally. This final utterance is constructed to guistic context [26] and better assess whether a incorporate various pragmatic phenomena that require model truly demonstrates ToM capabilities rather the use of ToM. Specifically, the utterance can include than relying on surface-level statistical patterns. four types of Indirect Speech Acts (Indirect Requests, 3. whether prompting LLMs with GABM settings Indirect Suggestions, Indirect Declinations, and Indirect leads to more aligned and contextually appropri- Threats) and three forms of Verbal Irony (Sarcasm, Hyate outputs. perbole, and Rhetorical Questions).3 4. whether adding explicit agents’ second-order be- In each simulation, there is shared information availliefs and contextual information improves the able to both characters as well as character-specific model’s capacity to perform ToM tasks. memory, including their goals, locations, and first- and

3. Concordia

In Concordia [9], both the model of the environment and the model of individual behaviors are generative. The

2The name and the approach reflect the game Dungeons and Drag

ons, where the Game Master is the player that has the role of storytellling. 3All stimuli used in this study are manually constructed, with the exception of a subset of indirect requests, which are sourced from [28]. second-order beliefs (see Figure 2). We manipulate the 4.1. Stimuli agents’ knowledge in a manner analogous to a FalseBelief task4. Indeed, the specific memory of the agents Since each simulation replicates a false-belief pattern by is manipulated to evaluate whether the action (response) introducing an obstacle to the indirect interpretation of of the listener agent depends on accurately inferring the final utterance and manipulating agents’ awareness the beliefs of the other speaker agent, in alignment with of this obstacle, we designed 5 versions of the simulation ToM. This is achieved by both withholding explicit infor- (Figure 3). In these tasks, i.) the agents’ knowledge of mation about the other agent’s beliefs and providing it the obstacle is systematically varied through the informato the character. tion stored in their specific memory, and ii.) knowledge

The distinct design of each task controls for the agent’s variation determines whether the speaker’s sentence is beliefs and knowledge regarding the other agent’s beliefs. interpreted literally or not, iii.) which in turn prompts a Tasks 1, 2, and 3, take into account only agent’s first- certain action by the listener. This allows us to control order beliefs, whereas Tasks 4 and 5 involve second-order whether the action produced by the listener is conbeliefs (Figure 3). sistent with the most likely interpretation of the

In total, there are 8 stimuli for each linguistic phe- speaker’s sentence, given the agents’ knowledge in nomenon, resulting in 40 stimuli for each task. Ulti- the scenario. Thus, both interpretation and action are mately, the objective is to assess whether the selected contingent upon the ability to infer the beliefs and desires action by the listener aligns not only with the agent’s of the other agent. As illustrated in Figure 3, given a test own intentions and beliefs but also with the resulting item represented by the sentence : Can you open the consequences in the environment and their impact on window?, we have the following tasks: the other agent. The generated events are designed to ensure that they account for the beliefs and intentions of both agents.

Task 1 – the speaker is unaware of the obstacle (The handle is broken), while the listener is aware of it. The listener is expected to interpret S with the non-literal meaning (i.e., indirect request), and thus the most likely action would be to inform the speaker that the window cannot be opened.5 5The intended meaning here is I am asking you to open the window, 4The False-Belief task is a widely used method to investigate ToM [29]. It enables a clear distinction between an agent’s true belief and their awareness of another individual’s difering (false) belief. beliefs. Even in agent-based models where agents acquire information about the situation and context, it is essential to possess knowledge of the other agent’s beliefs in order to select actions that are coherent with the situational context. • Task 4 and Task 5 – They are extended versions of Task 1 and Task 3, respectively, incorporating second-order beliefs. In Task 5, the interpretation of S is expected to be literal: Since both agents are aware that the handle is broken, the intended meaning of S should be I want to know if the window can be opened despite the broken handle.

This manipulation of character knowledge allows us to investigate whether and how the interpretation of an utterance varies depending on the belief states of the speaker and the listener. In the first three tasks, the model is provided only with character-specific knowledge, simulating real-world conversational dynamics in which speakers must infer others’ mental states based on context. Here, the listener interprets the utterance based solely on their own knowledge, and any correct or incorrect understanding of the intended meaning arises from inferences about the speaker’s beliefs. In contrast, Tasks 4 and 5 introduce explicit representations of others’ beliefs in the form of second-order beliefs (e.g., Mark knows that Kyle knows that the window is broken). In these tasks, the listener has access not only to their own knowledge but also to the knowledge state Task 2 – The agents’ beliefs are reversed com- of the speaker. Consequently, action selection depends pared to those in Task 1.6 In this scenario, the on i.) the model’s capacity to reason over second-order default interpretation is the non-literal one, but beliefs and ii.) its integration of this information with the listener is expected to attempt to open the its own knowledge. This setup allows us to distinguish window based on its own belief that the window between first-order and second-order ToM capabilities can be opened. in model behavior.

Task 3 – Both agents are aware of the obstacle (The handle is broken), but there is no explicit knowledge of the other agent’s belief. The 5. Experiments expected listener’s action is to inform the speaker that the window cannot be opened, like in Task 1. This scenario becomes particularly informative when compared to Task 5 below, where both agents are explicitly provided with second-order

The first phase of our experiment is formulated as a

Multi-Choice Question Answering (MCQA) problem, in which the model is provided with an agent’s memories and observations, followed by a question regarding the agent’s likely next action, along with four possible answer choices (see Figure 6, Appendix A.1). Concordia performs a separate API call for each agent, ensuring that reflecting the speaker’s desire ( D: to open the window) and belief (B: the window is not broken). However, the listener knows that the window handle is broken and, therefore, that the window can- it generates an independent response. The four answer not be opened. Therefore, the listener holds a diferent belief B1 choices correspond to the possible responses derived from that of the speaker. If the listener lacks knowledge about the from diferent simulation scenarios (see Figure 2). At speaker’s beliefs and desires, the interpretation of S may default to the time of the experiments, Concordia had not been apsnyocnh-oliltienrgaulimsteicaneixnpge.rTimhiesnptsh,ewnohmereendoenfaaulilgtninstweripthr efintdatiniognssfroofmten adapted for open-source models yet. Therefore, we opted prevail when they are more conventionalized than the literal ones for GPT-4o-mini,7 which has demonstrated state-of-the[30, 31]. art performances across a wide range of ToM tasks. 6The listener lacks knowledge of the obstacle and thus holds belief B, while the speaker holds belief B1 (cf. previous footnote). 7Prompted 22 November 2024

In the second phase, the GM processes all actions per- files and we compared it with the expected response for formed by the agents, along with a summary of each that task. agent’s situational context. This information is used to For the evaluation of generated text from the Direct prompt the model using a Chain of Thought approach Efect Externality component, we extracted relevant in[32]. First, the model generates an event statement that formation and additionally prompt GPT-4o-mini to asupdates the environment to reflect the consequences of sess the coherence of the efect with the agent’s action the performed action – efectively logging what has oc- and scenario. This process yields the following evalucurred (Figure 7, Appendix A.2). Then, the model evalu- ation template (see Appendix A.2) for each agent: Sceates whether the action has an impact on the agents and nario (summary of agent’s observation and belief) determines the nature of this impact as part of the Direct + attempted action of agent X + Known and/or UnEfect Externality component. If the event directly afects known efect + coherence rating (on a scale from 1 to an agent, both known and unknown efects are gener- 5, generated by the model).9 This structured approach enated. Agents’ intentions and actions are integrated by the sures a systematic assessment of how well the predicted GM within the prompting phase that queries for efects, efects align with the agent’s intended actions within the requiring the model to consider multiple perspectives given scenario. simultaneously to generate the appropriate outcomes The use of “LLM-as-a-Judge”, where LLMs are em(Figure 8, Appendix A.2).8 ployed as evaluators for complex tasks, has been shown to be a reliable assessment method [33]. Thus, we em5.1. Evaluation ploy this method to assess the model’s ability to connect actions to social context and to cross-check the coherTo evaluate whether the attempted actions of each agent ence it attributes to the efects it generates. Specifically, align with their intentions in the MCQA task, we ex- in the Direct Efect Externality component, a Chain-oftracted the generated text for each agent from the HTML 8All memories, prompts, and relevant information are systematically stored in HTML files for documentation and analysis. HTML versions are accessible through the GitHub link. 9When the model determines that there are no direct efects on the agents, it must assign a coherence rating of 0. This ensures that the evaluation framework accurately distinguishes between scenarios where actions produce meaningful consequences and those where no direct impact occurs. setting in the scenarios.

6.2. Causal-Efect Coherence

Thought (CoT) is generated based on the event statement produced by the GM after the attempted action – this statement serves as a summary of the efects that the action produces. However, in our evaluation template, we compare coherence against the initial scenario summary that we originally provided to the model. This way, we determine whether, at the end of the cycle, the efect on the agent remains truly coherent with the given scenario and the agent’s beliefs, rather than merely aligning with additional efects generated by the model itself.

Following this automated evaluation, two diferent expert annotators checked the assigned ratings to verify their accuracy and to ensure that the consequences are meaningfully related to the corresponding actions and scenarios. Meanwhile, the assessment of ToM capabilities is derived from the MCQA task.

In this analysis, we aim to investigate whether the GABM setting leads to more contextually aligned and appropriate outputs. We compared the efects generated by the model in response to agent actions with both the predeifned scenario and the beliefs assigned to the agents. We then assessed whether the model itself considers these efects coherent by assigning a coherence rating on a scale from 1 to 5. Following this automated evaluation, we manually reviewed the model’s ratings to assess their accuracy.

As illustrated in Figure 5, the model assigns notably low coherence ratings to efects that it itself generates, with a maximum average rating of 2.11 on a scale from 1 to 5. The observed discrepancy between the selected 6. Results and Discussion action and the generated consequences highlights the model’s dificulty in integrating situational context with 6.1. Actions and Theory of Mind utterance interpretation in a coherent manner. The CoT In the initial phase of our experiment, we aim to utilize reasoning often reflects limited contextual awareness, GPT-4o-mini to replicate ToM-like abilities while simul- focusing primarily on short-range dependencies rather taneously assessing its capacity to perform ToM tasks than engaging in the broader reasoning processes neceswithin a simulated real-life scenario. Our objective is to sary to produce coherent cause-efect relationships. To determine whether this approach enables an independent better illustrate this contrast, we included the model’s evaluation of ToM capabilities, separate from the influ- self-evaluation of its outputs and compared these judgence of linguistic context. Then, we seek to determine ments with those of human annotators. This comparison whether incorporating explicit representations of agents’ underscores a critical distinction: during generation (i.e., beliefs enhances the model’s performance on ToM tasks. in the CoT), the model is required to actively infer and Additionally, we aim to explore potential diferences in reason about the situational context in order to produce a the model’s handling of first-order versus second-order logically coherent narrative. However, when evaluating ToM beliefs. its own output, the model can rely on the full textual

Figure 4 illustrates the percentage of correctly selected context and potentially draw on patterns and examples actions for each task and linguistic phenomenon. The present in its training data. Interestingly, in this evaluaconsistently low accuracy observed across tasks and lin- tive mode, the model’s coherence judgments align more guistic phenomena indicates that the model struggles to closely with human assessments—likely because the task select context-appropriate actions, and by extension, to resembles familiar forms of pattern recognition, rather derive the correct interpretation of utterances through than the more demanding process of causal reasoning ToM-like reasoning. This finding is particularly notewor- required during generation. thy when considered within the broader context of recent ToM-related studies, many of which—especially those 7. Conclusion focusing on OpenAI models—have suggested a more optimistic picture of such capabilities [21]. Our objective was to utilize the Generative Agent-Based

No clear pattern emerges across tasks, nor is there a sig- Model Concordia to reframe ToM tasks and investigate nificant diference between first-order and second-order whether mentalizing abilities could be isolated from other belief tasks. This lack of systematic variation suggests confounding variables typically present in promptingthat the model does not exhibit ToM-like abilities, as its based evaluations. Specifically, we aimed to reproduce a responses do not consistently reflect any process similar standard False-Belief task within a complex social simulato belief attribution or true mental inferencing. There- tion. To achieve this, we carefully designed stimuli involvfore, the GABM is not able to use either first- or ing uncommon social situations to determine whether second-order beliefs – despite the fact that these modeling a rich situational context and assigning explichave been explicitly given to it – to interpret the itly to the model first- and second-order beliefs would speaker’s sentence consistently with the knowledge aid it in making the correct inferences and producing an action consistent with the knowledge scenario. the False-Belief task remains a widely used and valu

The results presented in Section 6.1 underscore a grow- able benchmark for testing ToM, it is also a well-known ing concern in the Theory of Mind (ToM) research com- paradigm likely to appear in post-training data. This munity: The challenge of designing tasks that ef- raises legitimate concerns about whether models are genfectively isolate ToM-like abilities in LLM from uinely reasoning about beliefs or simply learning how confounding variables. Our findings raise important to solve familiar tasks through exposure. Furthermore, questions about the mechanisms driving ToM-like per- although the False-Belief task is well-established in huformance in state-of-the-art LLMs and the true nature man cognitive testing, the conditions under which it of their so-called emergent abilities. For example, while is administered difer significantly from those we can 8. Limitations replicate in computational models. While we maintain that it remains a useful tool for evaluating ToM-like capabilities, we argue that it should be supplemented This study has several limitations. First, it relies heavwith additional constraints and more indirect test- ily on the model’s self-evaluation, introducing a risk of ing methods – such as connecting utterance inter- circular reasoning. pretation with action selection, as we do in our work Human evaluation was limited to two annotators, re– rather than relying solely on metalinguistic judgments. stricting claims about inter-annotator reliability. AddiOur results lend support to the memorization hypothe- tionally, we used pre-existing components of Concordia sis, suggesting that current LLMs may not truly reason rather than developing tools specifically designed for about propositional attitudes but instead exploit learned ToM assessment. Our analysis focused solely on GPTstatistical patterns present in their training data. 4o-mini, limiting generalizability across models. Finally,

Additionally, the model does not consistently select we evaluated outputs only, without investigating the incoherent efects in response to actions, indicating that we ternal mechanisms underlying the model’s ToM-related are still far from developing frameworks that accurately reasoning. model complex social scenarios. However, employing these agent-based simulations as evaluation methods rep- References resents a promising research direction. It is reasonable to conclude that LLMs remain far from producing fully aligned and contextually coherent outputs in tasks requiring deep social reasoning. We conclude that to isolate “mentalizing” processes, we should rely on more complex scenarios, focusing on assessing functional ToM rather than merely literal ToM [34]. 1467-8624.00304. [30] R. W. Gibbs, A new look at literal meaning in understanding what is said and implicated, Journal of Pragmatics 34 (2002) 457–486. doi:10.1016/

S0378-2166(01)00046-7. [31] R. W. Gibbs, Do people always process the literal meanings of indirect requests?, Journal of Experimental Psychology: Learning, Memory, and Cognition 9 (1983) 524–533. [32] J. Wei, X. Wang, D. Schuurmans, M. Bosma,

B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-ofthought prompting elicits reasoning in large language models, 2023. arXiv:2201.11903. [33] J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li,

Y. Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y. Wang, W. Gao, L. Ni, J. Guo, A survey on llm-as-a-judge,

ArXiv (2025). URL: http://arxiv.org/abs/2411.15594. [34] M. Riemer, Z. Ashktorab, D. Bounefouf, P. Das,

M. Liu, J. D. Weisz, M. Campbell, Position: Theory Figure 6: API call to LLM reproducing a Multi-Choice Quesof mind benchmarks are broken for large language tion Answering task models, 2025. arXiv:2412.19726. [35] S. Wu, S. Yang, Z. Chen, Q. Su, Rethinking pragmatics in large language models: Towards within the social context. This involved evaluating the open-ended evaluation and preference tuning, in: model’s ability to utilize CoT reasoning to produce meanY. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Pro- ingful event statements and generate coherent efects of ceedings of the 2024 Conference on Empirical events on agents.

Methods in Natural Language Processing, As- In the GABM setting, the model must retrieve previous sociation for Computational Linguistics, Miami, information to determine the correct efect, yet in some Florida, USA, 2024. URL: https://aclanthology.org/ cases, it appears to rely only on the most recent portion 2024.emnlp-main.1258/. doi:10.18653/v1/2024. of text. This issue is evident in Figure 8, where, despite emnlp-main.1258. the CoT explicitly containing the player’s belief that the agent notices his sister, this information is lost during the prior CoT steps that summarize observations and actions A. Appendix into an event statement (Figure 7). The event statement represents a generalized efect of an agent’s action in the environment and is sent back to agents as an observation.

A.1. Simulation display It serves as the basis for evaluating whether an action At the conclusion of the simulation, all relevant informa- has an efect on the agents themselves. tion is collected within the Game Master (GM), allowing Due to this loss of information, the generated efect can us to retrieve segments of the Chain of Thought (CoT) sometimes become entirely incoherent with the initial used by the model to determine both the event statement context. This misalignment is reflected in the model’s and its efects on the agents. While this framework ofers own coherence ratings, which capture the inconsistency a range of possibilities for modeling social situations, we between the intended efect and the final output. specifically chose to replicate simple false-belief tasks Figure 7 presents an example of an event statement using Concordia to evaluate whether mentalizing pro- generated based on the attempted action of one of the cesses could be efectively isolated and to assess whether agents. Figure 8 illustrates the subsequent process of enriching the social context enhances the emergence of determining the efects of the action on the agent, conToM-like abilities. sidering both the action itself and the event statement.

To achieve this, we implemented two distinct evalua- For clarity, we chose to highlight two of the most controtion tasks. First, we employed a Multiple-Choice Ques- versial examples in this discussion. tion Answering (MCQA) task, in which the model had to select an agent’s actions based on their desires and beliefs (Figure 6. Subsequently, we shifted our focus to assessing the general coherence of the model’s generated actions

A.2. Evaluation Details To evaluate the coherence of model-generated text in

relation to the scenario and the agents’ attempted actions, we employed the following assessment template. This template was also used to verify the model’s ratings and determine their alignment with our own judgments. Template is based on that created by Wu et al. [35]: We request your evaluation of the AI model’s response in relation to the given scenario. Specifically, consider the scenario involving two agents and their beliefs, assessing whether the modelgenerated efects align coherently with the agents’ actions and context.

Evaluate the response based on the following criteria: Social Understanding – Does the model grasp the social dynamics and pragmatic nuances of the scenario? Appropriateness – Is the response contextually relevant and suitable for the scenario? Insightfulness – Does the answer demonstrate a deep understanding of intentions, implicature, deceit, irony, sarcasm, humor, metaphor, etc.? Completeness – How well does the response capture the essential elements of the scenario?

Agentivity – Is the model’s response coherent with the agents’ attempted actions? Scoring: Assign a score from 1 to 5 for

each category. Compute a final rating based on these scores. If no efect is provided, assign 0. Output only a single numeric value representing the final rating (1–5).

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase and reword. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.