<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>T. Winograd, Shifting viewpoints: Artificial intelli-
gence and human-computer interaction, Artificial
Intelligence</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j</article-id>
      <title-group>
        <article-title>Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Agnese Lombardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CoLing Lab, Department of Philology</institution>
          ,
          <addr-line>Literature and Linguistics</addr-line>
          ,
          <institution>University of Pisa</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1986</year>
      </pub-date>
      <volume>170</volume>
      <issue>2006</issue>
      <fpage>1226</fpage>
      <lpage>1240</lpage>
      <abstract>
        <p>Language is fundamental to human cooperation, facilitating not only the exchange of information but also the coordination of actions through shared interpretations of situational contexts. This study explores whether the Generative Agent-Based Model (GABM) Concordia can efectively model Theory of Mind (ToM) within simulated real-world environments. Specifically, we assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context, rather than relying on linguistic memorization. Our findings reveal a critical limitation: GPT-4 frequently fails to select actions based on belief attribution, suggesting that apparent ToM-like abilities observed in previous studies may stem from shallow statistical associations rather than true reasoning. Additionally, the model struggles to generate coherent causal efects from agent actions, exposing dificulties in processing complex social interactions. These results challenge current statements about emergent ToM-like capabilities in LLMs and highlight the need for more rigorous, action-based evaluation frameworks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Theory of Mind</kwd>
        <kwd>Generative Agent-Based Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CLiC-it 2025: Eleventh Italian Conference on Computational
Linguistics, September 24 — 26, 2025, Cagliari, Italy
* Corresponding author
$ agnese.lombardi@phd.unipi.it (A. Lombardi);
alessandro.lenci@unipi.it (A. Lenci)
 https://agneselombardi.github.io/ (A. Lombardi)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License of Mind (ToM) and the capacity for “mentalizing”, that
Attribution 4.0 International (CC BY 4.0).
is the ability to reason about others’ mental states, to Kim et al. [11], we adhere to the two key criteria for a
efectively link language with actions within a given sit- ToM task outlined by Quesque and Rossetti [12]:
nonuational context. A crucial aspect of the ToM involved merging and mentalizing.
in communication are second-order beliefs, express- The non-merging criterion requires that evaluation
ing an agent’s mental states about the content of the tasks ensure a clear distinction between an agent’s own
other agent’s mental states (e.g., John believes that Marks mental state and that of others. This distinction is often
believes that q). Addressing the formalization and commu- absent in many LLM evaluations, as these models
typinication of intentions thus necessitates an understanding cally process the entire conversation as input, granting
of language as a form of communicative action. This them “omniscient knowledge”. Consequently, it becomes
approach inherently entails the consideration of extralin- challenging to determine whether a model’s response
guistic factors, as demonstrated in studies on multimodal reflects a character’s belief or results from its
comprecommunication [8], and requires more sophisticated mod- hensive access to the conversation history. In contrast,
els of situational contexts to comprehensively capture our approach explicitly separates the mental states of
the interplay between language use and interpretation. characters and ensures that their actions are determined</p>
      <p>Traditionally, the evaluation of Large Language models solely by their individual knowledge and intentions.
(LLMs) has largely overlooked the relationship between The mentalizing criterion stipulates that lower-level
language and action, instead focusing primarily on the cognitive processes should not account for successful
percommunicative context and dialogue. This omission is, formance on ToM tasks. If a simpler explanation sufices,
in part, due to the inherent challenges associated with it should be preferred over a more complex one when
inassessing the agentive aspect of language and its connec- terpreting results. In our framework, we introduce a clear
tion to actions. distinction: the speaker’s responses and actions can be</p>
      <p>This study proposes the use of the Generative Agent- directly inferred from world-state correlations, whereas
Based Model (GABM) Concordia [9] to embed utter- the listener’s responses and actions necessitate a more
ances and narratives within a situational context. The intricate mentalizing process. This process requires
reagoal is to determine whether reproducing such complex soning about language, context, intentions, beliefs, and
scenarios – closely resembling real-world environments desires. To further support this distinction, we present
– can facilitate the discrimination between intended and multiple versions of the same narrative, systematically
literal meanings. Our primary research objective is to altering agents’ knowledge to encourage diverse
interassess Theory of Mind (ToM) abilities, operationalized in pretations.
this experiment as the capacity to infer intended meaning Our results reveal a critical limitation: Modeling
situabased on extralinguistic factors. tional context through real-world simulations is
insufi</p>
      <p>Rather than directly prompting the model to interpret cient to elicit ToM-like abilities in the model. Specifically,
the meaning of an utterance, we ask it to identify the most GPT-4 frequently selects actions without
appropriprobable action that the listener would choose, given ately interpreting utterances and the belief context,
specific preconditions. This approach is justified by the demonstrating a clear divergence from the ToM
caassumption that each interpretation of an utterance is pabilities observed in humans1.
linked to a set of possible actions.</p>
      <p>Our experiment takes into account the overlap
between literal and non-literal meanings and the inference 2. Related Work
processes required for the listener to comprehend the
intended meaning of an utterance. In our stimuli, we 2.1. Generative Agent-Based Models
incorporate utterances that allow for both direct and in- Generative Agent-Based Models (GABMs) represent a
direct interpretations. Thus, diferent actions may arise significant departure from traditional agent-based
moddepending on how the same utterance is understood. els, which have typically been employed at a relatively</p>
      <p>To control for conventional utterance-action associa- high level of abstraction. Moreover, the application of
tions, we adapt the False-Belief task [10] into a novel ex- traditional models has been largely confined to specific
perimental format. By evaluating action selection rather domains, such as empirical social research [13], market
than meaning comprehension directly, we minimize con- simulations [14], and computational sociology [15]. By
cerns that the model may have been exposed to the in- contrast, GABMs [16, 17, 18] enable more precise
simulatended meanings during training. Moreover, our task tion of behaviors across diverse contexts, leveraging the
introduces two layers of complexity: first, the model extensive knowledge embedded in LLMs. These agents
must infer the correct meaning under a false-belief con- not only have a more sophisticated array of cognitive
dition; second, it must map that inferred meaning to an
appropriate action. 1Code and dataset available on GitHub: https://github.com/</p>
      <p>This approach ofers several advantages. Following agneselombardi/Concordia_ToM
functions for adaptive decision-making but also engage model responsible for the generation of the environment
in natural language communication with one another, is called Game Master (GM).2
further enriching their interactive capabilities. Figure 1 illustrates the structure of the simulation in
Concordia. The GM functions as an intermediary
be2.2. Theory of Mind Simulation with tween the agents and the environmental dynamics
reAgents sulting from their actions. Specifically, the GM receives
the agents’ actions and translates them into
correspondTheory of Mind (ToM), defined as the ability to infer the ing observations, reflecting the environmental efects of
beliefs and intentions of others [19], has been extensively those actions. Meanwhile, the agents formulate and
exestudied in the context of LLMs to assess their capacity cute action strategies informed by their memory and the
for handling complex tasks that require ToM reasoning. observations provided by the GM. These observations are
A variety of text-based benchmarks, often inspired by subsequently updated to align with changes occurring
established psycholinguistic tests such as the Sally-Anne within the environment. Observations, actions and event
test [20], have been developed to evaluate this ability. statements are all English strings. The GM is also
responWhile some findings suggest that LLMs demonstrate re- sible for maintaining and updating grounded variables,
markable performance on ToM-related tasks [21], other advancing the clock and running the episode loop.
studies highlight significant challenges faced by these In our simulation, agent actions are determined by
models in making complex ToM inferences [22]. Conse- answering a Multiple-Choice Question Answer (MCQA).
quently, the debate surrounding the extent of LLMs’ ToM The agents’ memories encompass all relevant background
capabilities remains open. information necessary for action selection. To enhance</p>
      <p>Previous works have formalized ToM as agents’ knowl- coherence, we incorporate a component termed Direct
edge in various contexts, particularly to enhance collabo- Efect Externality following the environment update. This
ration in multi-agent reinforcement learning settings [23] component determines whether the selected actions
afand to improve the cooperative behaviors of LLM-based fect one or more agents and specifies the resulting efects.
agents through explicit belief modeling [24]. However, This serves as a verification mechanism to ensure that
these experiments are predominantly conducted in sim- the produced efects on the other player are coherent
plified environments, such as the box game task, which with the selected action and with the inferred beliefs and
difer significantly from the complexities of real-world desires (that are explicitly codified in the GM memory).
social scenarios. On the other hand, previous attempts to
model ToM and social interactions have primarily relied 4. Simulation
on simplified ABMs to simulate developmental settings
[25].</p>
      <p>To the best of our knowledge, our study represents the
ifrst attempt to utilize a Generative Agent-Based Model
(GABM) to explore:
We generated a total of 200 ToM simulations, grouped
into 5 tasks. Each simulation involves two distinct
characters, accompanied by a sequence of observations for
each of them. The character memory is individually
con1. whether LLMs exhibit ToM-like abilities in real- structed by randomizing the Big Five personality traits
life scenarios and simulations involving prag- [27].</p>
      <p>matic interpretation, like with ISA. The simulation concludes with the final utterance from
2. if we can efectively isolate mentalizing from one of the two characters, which can be interpreted
literother variables, such as the memorization of lin- ally or non-literally. This final utterance is constructed to
guistic context [26] and better assess whether a incorporate various pragmatic phenomena that require
model truly demonstrates ToM capabilities rather the use of ToM. Specifically, the utterance can include
than relying on surface-level statistical patterns. four types of Indirect Speech Acts (Indirect Requests,
3. whether prompting LLMs with GABM settings Indirect Suggestions, Indirect Declinations, and Indirect
leads to more aligned and contextually appropri- Threats) and three forms of Verbal Irony (Sarcasm,
Hyate outputs. perbole, and Rhetorical Questions).3
4. whether adding explicit agents’ second-order be- In each simulation, there is shared information
availliefs and contextual information improves the able to both characters as well as character-specific
model’s capacity to perform ToM tasks. memory, including their goals, locations, and first- and</p>
    </sec>
    <sec id="sec-2">
      <title>3. Concordia</title>
      <p>In Concordia [9], both the model of the environment and
the model of individual behaviors are generative. The</p>
      <sec id="sec-2-1">
        <title>2The name and the approach reflect the game Dungeons and Drag</title>
        <p>ons, where the Game Master is the player that has the role of
storytellling.
3All stimuli used in this study are manually constructed, with the
exception of a subset of indirect requests, which are sourced from
[28].
second-order beliefs (see Figure 2). We manipulate the 4.1. Stimuli
agents’ knowledge in a manner analogous to a
FalseBelief task4. Indeed, the specific memory of the agents Since each simulation replicates a false-belief pattern by
is manipulated to evaluate whether the action (response) introducing an obstacle to the indirect interpretation of
of the listener agent depends on accurately inferring the final utterance and manipulating agents’ awareness
the beliefs of the other speaker agent, in alignment with of this obstacle, we designed 5 versions of the simulation
ToM. This is achieved by both withholding explicit infor- (Figure 3). In these tasks, i.) the agents’ knowledge of
mation about the other agent’s beliefs and providing it the obstacle is systematically varied through the
informato the character. tion stored in their specific memory, and ii.) knowledge</p>
        <p>The distinct design of each task controls for the agent’s variation determines whether the speaker’s sentence is
beliefs and knowledge regarding the other agent’s beliefs. interpreted literally or not, iii.) which in turn prompts a
Tasks 1, 2, and 3, take into account only agent’s first- certain action by the listener. This allows us to control
order beliefs, whereas Tasks 4 and 5 involve second-order whether the action produced by the listener is
conbeliefs (Figure 3). sistent with the most likely interpretation of the</p>
        <p>In total, there are 8 stimuli for each linguistic phe- speaker’s sentence, given the agents’ knowledge in
nomenon, resulting in 40 stimuli for each task. Ulti- the scenario. Thus, both interpretation and action are
mately, the objective is to assess whether the selected contingent upon the ability to infer the beliefs and desires
action by the listener aligns not only with the agent’s of the other agent. As illustrated in Figure 3, given a test
own intentions and beliefs but also with the resulting item represented by the sentence : Can you open the
consequences in the environment and their impact on window?, we have the following tasks:
the other agent. The generated events are designed to
ensure that they account for the beliefs and intentions of
both agents.</p>
        <p>Task 1 – the speaker is unaware of the
obstacle (The handle is broken), while the listener
is aware of it. The listener is expected to
interpret S with the non-literal meaning (i.e., indirect
request), and thus the most likely action would
be to inform the speaker that the window cannot
be opened.5
5The intended meaning here is I am asking you to open the window,
4The False-Belief task is a widely used method to investigate ToM
[29]. It enables a clear distinction between an agent’s true belief
and their awareness of another individual’s difering (false) belief.
beliefs. Even in agent-based models where agents
acquire information about the situation and
context, it is essential to possess knowledge of the
other agent’s beliefs in order to select actions that
are coherent with the situational context.
• Task 4 and Task 5 – They are extended
versions of Task 1 and Task 3, respectively,
incorporating second-order beliefs. In Task 5,
the interpretation of S is expected to be literal:
Since both agents are aware that the handle is
broken, the intended meaning of S should be I
want to know if the window can be opened despite
the broken handle.</p>
        <p>This manipulation of character knowledge allows us
to investigate whether and how the interpretation of an
utterance varies depending on the belief states of the
speaker and the listener. In the first three tasks, the
model is provided only with character-specific
knowledge, simulating real-world conversational dynamics in
which speakers must infer others’ mental states based
on context. Here, the listener interprets the utterance
based solely on their own knowledge, and any correct or
incorrect understanding of the intended meaning arises
from inferences about the speaker’s beliefs. In contrast,
Tasks 4 and 5 introduce explicit representations of
others’ beliefs in the form of second-order beliefs
(e.g., Mark knows that Kyle knows that the window is
broken). In these tasks, the listener has access not only to
their own knowledge but also to the knowledge state
Task 2 – The agents’ beliefs are reversed com- of the speaker. Consequently, action selection depends
pared to those in Task 1.6 In this scenario, the on i.) the model’s capacity to reason over second-order
default interpretation is the non-literal one, but beliefs and ii.) its integration of this information with
the listener is expected to attempt to open the its own knowledge. This setup allows us to distinguish
window based on its own belief that the window between first-order and second-order ToM capabilities
can be opened. in model behavior.</p>
        <p>Task 3 – Both agents are aware of the obstacle
(The handle is broken), but there is no explicit
knowledge of the other agent’s belief. The 5. Experiments
expected listener’s action is to inform the speaker
that the window cannot be opened, like in Task
1. This scenario becomes particularly
informative when compared to Task 5 below, where both
agents are explicitly provided with second-order</p>
      </sec>
      <sec id="sec-2-2">
        <title>The first phase of our experiment is formulated as a</title>
        <p>Multi-Choice Question Answering (MCQA) problem, in
which the model is provided with an agent’s memories
and observations, followed by a question regarding the
agent’s likely next action, along with four possible
answer choices (see Figure 6, Appendix A.1). Concordia
performs a separate API call for each agent, ensuring that
reflecting the speaker’s desire ( D: to open the window) and belief
(B: the window is not broken). However, the listener knows that
the window handle is broken and, therefore, that the window can- it generates an independent response. The four answer
not be opened. Therefore, the listener holds a diferent belief B1 choices correspond to the possible responses derived
from that of the speaker. If the listener lacks knowledge about the from diferent simulation scenarios (see Figure 2). At
speaker’s beliefs and desires, the interpretation of S may default to the time of the experiments, Concordia had not been
apsnyocnh-oliltienrgaulimsteicaneixnpge.rTimhiesnptsh,ewnohmereendoenfaaulilgtninstweripthr efintdatiniognssfroofmten adapted for open-source models yet. Therefore, we opted
prevail when they are more conventionalized than the literal ones for GPT-4o-mini,7 which has demonstrated
state-of-the[30, 31]. art performances across a wide range of ToM tasks.
6The listener lacks knowledge of the obstacle and thus holds belief
B, while the speaker holds belief B1 (cf. previous footnote). 7Prompted 22 November 2024</p>
        <p>In the second phase, the GM processes all actions per- files and we compared it with the expected response for
formed by the agents, along with a summary of each that task.
agent’s situational context. This information is used to For the evaluation of generated text from the Direct
prompt the model using a Chain of Thought approach Efect Externality component, we extracted relevant
in[32]. First, the model generates an event statement that formation and additionally prompt GPT-4o-mini to
asupdates the environment to reflect the consequences of sess the coherence of the efect with the agent’s action
the performed action – efectively logging what has oc- and scenario. This process yields the following
evalucurred (Figure 7, Appendix A.2). Then, the model evalu- ation template (see Appendix A.2) for each agent:
Sceates whether the action has an impact on the agents and nario (summary of agent’s observation and belief)
determines the nature of this impact as part of the Direct + attempted action of agent X + Known and/or
UnEfect Externality component. If the event directly afects known efect + coherence rating (on a scale from 1 to
an agent, both known and unknown efects are gener- 5, generated by the model).9 This structured approach
enated. Agents’ intentions and actions are integrated by the sures a systematic assessment of how well the predicted
GM within the prompting phase that queries for efects, efects align with the agent’s intended actions within the
requiring the model to consider multiple perspectives given scenario.
simultaneously to generate the appropriate outcomes The use of “LLM-as-a-Judge”, where LLMs are
em(Figure 8, Appendix A.2).8 ployed as evaluators for complex tasks, has been shown
to be a reliable assessment method [33]. Thus, we
em5.1. Evaluation ploy this method to assess the model’s ability to connect
actions to social context and to cross-check the
coherTo evaluate whether the attempted actions of each agent ence it attributes to the efects it generates. Specifically,
align with their intentions in the MCQA task, we ex- in the Direct Efect Externality component, a
Chain-oftracted the generated text for each agent from the HTML
8All memories, prompts, and relevant information are
systematically stored in HTML files for documentation and analysis. HTML
versions are accessible through the GitHub link.
9When the model determines that there are no direct efects on the
agents, it must assign a coherence rating of 0. This ensures that the
evaluation framework accurately distinguishes between scenarios
where actions produce meaningful consequences and those where
no direct impact occurs.
setting in the scenarios.</p>
        <sec id="sec-2-2-1">
          <title>6.2. Causal-Efect Coherence</title>
          <p>Thought (CoT) is generated based on the event statement
produced by the GM after the attempted action – this
statement serves as a summary of the efects that the
action produces. However, in our evaluation template, we
compare coherence against the initial scenario summary
that we originally provided to the model. This way, we
determine whether, at the end of the cycle, the efect on
the agent remains truly coherent with the given scenario
and the agent’s beliefs, rather than merely aligning with
additional efects generated by the model itself.</p>
          <p>Following this automated evaluation, two diferent
expert annotators checked the assigned ratings to verify
their accuracy and to ensure that the consequences are
meaningfully related to the corresponding actions and
scenarios. Meanwhile, the assessment of ToM capabilities
is derived from the MCQA task.</p>
          <p>In this analysis, we aim to investigate whether the GABM
setting leads to more contextually aligned and
appropriate outputs. We compared the efects generated by the
model in response to agent actions with both the
predeifned scenario and the beliefs assigned to the agents. We
then assessed whether the model itself considers these
efects coherent by assigning a coherence rating on a
scale from 1 to 5. Following this automated evaluation,
we manually reviewed the model’s ratings to assess their
accuracy.</p>
          <p>As illustrated in Figure 5, the model assigns notably
low coherence ratings to efects that it itself generates,
with a maximum average rating of 2.11 on a scale from
1 to 5. The observed discrepancy between the selected
6. Results and Discussion action and the generated consequences highlights the
model’s dificulty in integrating situational context with
6.1. Actions and Theory of Mind utterance interpretation in a coherent manner. The CoT
In the initial phase of our experiment, we aim to utilize reasoning often reflects limited contextual awareness,
GPT-4o-mini to replicate ToM-like abilities while simul- focusing primarily on short-range dependencies rather
taneously assessing its capacity to perform ToM tasks than engaging in the broader reasoning processes
neceswithin a simulated real-life scenario. Our objective is to sary to produce coherent cause-efect relationships. To
determine whether this approach enables an independent better illustrate this contrast, we included the model’s
evaluation of ToM capabilities, separate from the influ- self-evaluation of its outputs and compared these
judgence of linguistic context. Then, we seek to determine ments with those of human annotators. This comparison
whether incorporating explicit representations of agents’ underscores a critical distinction: during generation (i.e.,
beliefs enhances the model’s performance on ToM tasks. in the CoT), the model is required to actively infer and
Additionally, we aim to explore potential diferences in reason about the situational context in order to produce a
the model’s handling of first-order versus second-order logically coherent narrative. However, when evaluating
ToM beliefs. its own output, the model can rely on the full textual</p>
          <p>Figure 4 illustrates the percentage of correctly selected context and potentially draw on patterns and examples
actions for each task and linguistic phenomenon. The present in its training data. Interestingly, in this
evaluaconsistently low accuracy observed across tasks and lin- tive mode, the model’s coherence judgments align more
guistic phenomena indicates that the model struggles to closely with human assessments—likely because the task
select context-appropriate actions, and by extension, to resembles familiar forms of pattern recognition, rather
derive the correct interpretation of utterances through than the more demanding process of causal reasoning
ToM-like reasoning. This finding is particularly notewor- required during generation.
thy when considered within the broader context of recent
ToM-related studies, many of which—especially those 7. Conclusion
focusing on OpenAI models—have suggested a more
optimistic picture of such capabilities [21]. Our objective was to utilize the Generative Agent-Based</p>
          <p>No clear pattern emerges across tasks, nor is there a sig- Model Concordia to reframe ToM tasks and investigate
nificant diference between first-order and second-order whether mentalizing abilities could be isolated from other
belief tasks. This lack of systematic variation suggests confounding variables typically present in
promptingthat the model does not exhibit ToM-like abilities, as its based evaluations. Specifically, we aimed to reproduce a
responses do not consistently reflect any process similar standard False-Belief task within a complex social
simulato belief attribution or true mental inferencing. There- tion. To achieve this, we carefully designed stimuli
involvfore, the GABM is not able to use either first- or ing uncommon social situations to determine whether
second-order beliefs – despite the fact that these modeling a rich situational context and assigning
explichave been explicitly given to it – to interpret the itly to the model first- and second-order beliefs would
speaker’s sentence consistently with the knowledge aid it in making the correct inferences and producing an
action consistent with the knowledge scenario. the False-Belief task remains a widely used and
valu</p>
          <p>The results presented in Section 6.1 underscore a grow- able benchmark for testing ToM, it is also a well-known
ing concern in the Theory of Mind (ToM) research com- paradigm likely to appear in post-training data. This
munity: The challenge of designing tasks that ef- raises legitimate concerns about whether models are
genfectively isolate ToM-like abilities in LLM from uinely reasoning about beliefs or simply learning how
confounding variables. Our findings raise important to solve familiar tasks through exposure. Furthermore,
questions about the mechanisms driving ToM-like per- although the False-Belief task is well-established in
huformance in state-of-the-art LLMs and the true nature man cognitive testing, the conditions under which it
of their so-called emergent abilities. For example, while is administered difer significantly from those we can
8. Limitations
replicate in computational models. While we maintain
that it remains a useful tool for evaluating ToM-like
capabilities, we argue that it should be supplemented This study has several limitations. First, it relies
heavwith additional constraints and more indirect test- ily on the model’s self-evaluation, introducing a risk of
ing methods – such as connecting utterance inter- circular reasoning.
pretation with action selection, as we do in our work Human evaluation was limited to two annotators,
re– rather than relying solely on metalinguistic judgments. stricting claims about inter-annotator reliability.
AddiOur results lend support to the memorization hypothe- tionally, we used pre-existing components of Concordia
sis, suggesting that current LLMs may not truly reason rather than developing tools specifically designed for
about propositional attitudes but instead exploit learned ToM assessment. Our analysis focused solely on
GPTstatistical patterns present in their training data. 4o-mini, limiting generalizability across models. Finally,</p>
          <p>Additionally, the model does not consistently select we evaluated outputs only, without investigating the
incoherent efects in response to actions, indicating that we ternal mechanisms underlying the model’s ToM-related
are still far from developing frameworks that accurately reasoning.
model complex social scenarios. However, employing
these agent-based simulations as evaluation methods rep- References
resents a promising research direction. It is reasonable
to conclude that LLMs remain far from producing fully
aligned and contextually coherent outputs in tasks
requiring deep social reasoning. We conclude that to isolate
“mentalizing” processes, we should rely on more complex
scenarios, focusing on assessing functional ToM rather
than merely literal ToM [34].
1467-8624.00304.
[30] R. W. Gibbs, A new look at literal meaning in
understanding what is said and implicated,
Journal of Pragmatics 34 (2002) 457–486. doi:10.1016/</p>
          <p>S0378-2166(01)00046-7.
[31] R. W. Gibbs, Do people always process the literal
meanings of indirect requests?, Journal of
Experimental Psychology: Learning, Memory, and
Cognition 9 (1983) 524–533.
[32] J. Wei, X. Wang, D. Schuurmans, M. Bosma,</p>
          <p>B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou,
Chain-ofthought prompting elicits reasoning in large
language models, 2023. arXiv:2201.11903.
[33] J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li,</p>
          <p>Y. Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y. Wang,
W. Gao, L. Ni, J. Guo, A survey on llm-as-a-judge,</p>
          <p>ArXiv (2025). URL: http://arxiv.org/abs/2411.15594.
[34] M. Riemer, Z. Ashktorab, D. Bounefouf, P. Das,</p>
          <p>M. Liu, J. D. Weisz, M. Campbell, Position: Theory Figure 6: API call to LLM reproducing a Multi-Choice
Quesof mind benchmarks are broken for large language tion Answering task
models, 2025. arXiv:2412.19726.
[35] S. Wu, S. Yang, Z. Chen, Q. Su, Rethinking
pragmatics in large language models: Towards within the social context. This involved evaluating the
open-ended evaluation and preference tuning, in: model’s ability to utilize CoT reasoning to produce
meanY. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Pro- ingful event statements and generate coherent efects of
ceedings of the 2024 Conference on Empirical events on agents.</p>
          <p>Methods in Natural Language Processing, As- In the GABM setting, the model must retrieve previous
sociation for Computational Linguistics, Miami, information to determine the correct efect, yet in some
Florida, USA, 2024. URL: https://aclanthology.org/ cases, it appears to rely only on the most recent portion
2024.emnlp-main.1258/. doi:10.18653/v1/2024. of text. This issue is evident in Figure 8, where, despite
emnlp-main.1258. the CoT explicitly containing the player’s belief that the
agent notices his sister, this information is lost during the
prior CoT steps that summarize observations and actions
A. Appendix into an event statement (Figure 7). The event statement
represents a generalized efect of an agent’s action in the
environment and is sent back to agents as an observation.</p>
          <p>A.1. Simulation display It serves as the basis for evaluating whether an action
At the conclusion of the simulation, all relevant informa- has an efect on the agents themselves.
tion is collected within the Game Master (GM), allowing Due to this loss of information, the generated efect can
us to retrieve segments of the Chain of Thought (CoT) sometimes become entirely incoherent with the initial
used by the model to determine both the event statement context. This misalignment is reflected in the model’s
and its efects on the agents. While this framework ofers own coherence ratings, which capture the inconsistency
a range of possibilities for modeling social situations, we between the intended efect and the final output.
specifically chose to replicate simple false-belief tasks Figure 7 presents an example of an event statement
using Concordia to evaluate whether mentalizing pro- generated based on the attempted action of one of the
cesses could be efectively isolated and to assess whether agents. Figure 8 illustrates the subsequent process of
enriching the social context enhances the emergence of determining the efects of the action on the agent,
conToM-like abilities. sidering both the action itself and the event statement.</p>
          <p>To achieve this, we implemented two distinct evalua- For clarity, we chose to highlight two of the most
controtion tasks. First, we employed a Multiple-Choice Ques- versial examples in this discussion.
tion Answering (MCQA) task, in which the model had to
select an agent’s actions based on their desires and beliefs
(Figure 6. Subsequently, we shifted our focus to assessing
the general coherence of the model’s generated actions</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>A.2. Evaluation Details</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>To evaluate the coherence of model-generated text in</title>
        <p>relation to the scenario and the agents’ attempted
actions, we employed the following assessment template.
This template was also used to verify the model’s ratings
and determine their alignment with our own judgments.
Template is based on that created by Wu et al. [35]:
We request your evaluation of the AI
model’s response in relation to the given
scenario. Specifically, consider the
scenario involving two agents and their
beliefs, assessing whether the
modelgenerated efects align coherently with
the agents’ actions and context.</p>
        <p>Evaluate the response based on the
following criteria:
Social Understanding – Does the model
grasp the social dynamics and pragmatic
nuances of the scenario?
Appropriateness – Is the response
contextually relevant and suitable for the
scenario?
Insightfulness – Does the answer
demonstrate a deep understanding of intentions,
implicature, deceit, irony, sarcasm, humor,
metaphor, etc.?
Completeness – How well does the
response capture the essential elements of
the scenario?</p>
      </sec>
      <sec id="sec-2-4">
        <title>Agentivity – Is the model’s response coherent with the agents’ attempted actions?</title>
      </sec>
      <sec id="sec-2-5">
        <title>Scoring: Assign a score from 1 to 5 for</title>
        <p>each category. Compute a final rating
based on these scores. If no efect is
provided, assign 0. Output only a single
numeric value representing the final rating
(1–5).</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase
and reword. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>