<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Experimental designs and facets of evidence for computational theory of mind</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Joel Michelson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deepayan Sanyal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Ainooson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuan Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maithilee Kunda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vanderbilt University Department of Computer Science</institution>
          ,
          <addr-line>Nashville TN</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The competitive feeding paradigm is one of several experimental setups intended to test whether nonverbal subjects possess skills related to Theory of Mind. Competitive feeding focuses on the relationship between seeing and knowing. In this paper, we describe a highly-customizeable implementation of the competitive feeding paradigm for computational agents in a gridworld environment. We explore various modifications to the setup including shared rewards, alternate sequences of timed events, and asymmetrical values, that allow us to replicate a wide breadth of tests designed to study the social cognition skills of humans and animals. Finally, we describe how this paradigm can be expanded upon and used as a benchmark test to investigate social reasoning in artificially intelligent models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Theory of mind</kwd>
        <kwd>machine learning</kwd>
        <kwd>social cognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        One critical element of social cognition research is Theory of Mind (ToM), described originally
by Premack and Woodruf in 1978 as a “system of inferences" regarding the mental states of
others [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Specifically, mental states, which are unobservable, may only be inferred to both
exist and relate to observable data. Because of their subjective nature, ToM skills and the
mechanisms that produce them—in humans and other animals—are not thoroughly understood.
Their detection and measurement has been and remains the subject of a lengthy ongoing debate.
      </p>
      <p>A well-studied example of potential ToM reasoning in the animal kingdom is that of Western
scrub-jays, who instinctively cache their food to save it for later. They tend to re-cache their food
if they believe their behavior was observed by a competitor, who might try to pilfer the hidden
prize. In doing so, they keep track of which individual witnesses are privy to information about
diferent cache sites [ 2]. At first glance, such sophisticated behavior seems to imply that the jays
are capable of inferring other competitors’ mental states. By careful observation, however, it
becomes apparent that directly observable information, e.g. “Polly’s head was oriented towards
this particular cache in the past” is suficient for a successful re-caching strategy, without need
for any mentalization, e.g. “Polly knows there is food here". Jays also display some degree of
successful transfer between the roles of hiding and seeking: birds which have been thieves are
more likely to re-cache their food when observed by competitors [3]. Could this pattern be
evidence for experience projection? Although it is interesting behavior, this observation also
fails to provide strong evidence for any reasoning about competitors’ internal states, as it can
be explained by ToM-free models [4].</p>
      <p>While most literature on ToM focuses on humans and non-human animals, there exists a
wealth of knowledge to be questioned, tested, and discovered in the realm of artificial intelligence.
Michelson et al. [5] highlight the need for a standardized battery of tests that can be used by many
to evaluate AI models’ theory of mind skills. They describe several criteria and desiderata that
make social cognition benchmark tests amiable to artificial intelligence researchers. Numerous
tests of animal cognition examine ToM and related skills, including the popular Sally Anne test
[6], knower guesser paradigm [7] [8], and competitive feeding paradigm [9]. The text of this
paper covers the design, implementation, and use-cases of one such test environment—inspired
by the competitive feeding paradigm—that serves as a foundation for such a test battery. The
specific contributions of this paper include:
• A brief overview of the competitive feeding paradigm, a test framework designed to study
whether non-verbal animals understand concepts of seeing and knowing, as well as its
criticisms.
• A detailed description of the Standof environment, a gridworld framework for running
social cognition tests on computational agents.1
• Descriptions of how various specific modifications of competitive feeding under the
Standof framework allow for the measurement of a breadth of skills beyond those
captured by competitive feeding.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Povinelli and Vonk [10] point out the failure of existing paradigms for testing social cognition in
that these tests generally do not distinguish reasoning about observable behavior from reasoning
about unobservable mental states. Later, Penn and Povinelli provide a formalized definition of
ToM so that its presence in a subject can be more systematically measured and falsified [ 11].
They describe ToM as the presence of a function,   , which a cognitive agent (the subject)
may use to infer the mental state of another cognitive agent. As   is an inference, its output
must be based solely on the perceptual inputs available to the subject. This definition avoids any
specific interpretations of how   might be implemented or used. Compelling evidence of
  must be in the form of behavior that demonstrates “the necessity of an   in addition to
and distinct from the cognitive work that could have been performed without such a function.”</p>
      <p>The competitive feeding paradigm, which we describe in 2.1, is used by Penn and Povinelli
as a case study for its inability to detect ToM [11]. With a few modifications, however, a new
paradigm can be built that satisfies the requirements for proving and falsifying ToM hypotheses.</p>
      <sec id="sec-2-1">
        <title>2.1. The Competitive Feeding Paradigm</title>
        <p>The competitive feeding paradigm is a test setup designed to distinguish whether a non-verbal
subject will change its behavior to account for what it believes a conspecific knows, based on
1The Standof environment, along with instructions for generating the tests described in this paper, can be accessed
at http://github.com/aivaslab/standof
evidence relating to what the conspecific sees [9]. The subject and one other participant must
have an established social hierarchy, with the subject being ‘subordinate’ to the other ‘dominant’
participant.
2.1.1. Setup
The general setup of a competitive feeding test is as follows: The animals are kept in cages on
either side of a central room, the subject’s cage always opposite its one or more conspecifics’
cages. During “baiting” events, large and small food rewards, or treats, are placed or moved in
the central room. Although the placement of the treat is sometimes visible to the dominant,
after one or both baiting events occur, the dominant is no longer able to see the treat. Eventually,
both animals are released. Due to the nature of the social hierarchy, the subordinate will not
challenge the dominant if the two would attempt to reach the same treat. So, if the subject
believes the dominant will look for food in a particular location, we assume the subject will
avoid that location. The subject’s initial challenge, then, is determining where the dominant will
decide to go. Once released, the subject’s orientation or movement towards a treat is recorded.</p>
        <sec id="sec-2-1-1">
          <title>2.1.2. Baiting events</title>
          <p>During the baiting events, the dominant’s door might be partially open, allowing it to see the
baiting, or closed. By closing the dominant’s door at specified times, researchers create scenarios
in which it knows where the food is, it does not know, or it has a false belief about the food’s
presence or location (i.e. it knows where the food is initially, but is then unaware that the food
has been moved).</p>
          <p>By carefully observing what the dominant can and cannot see and then reasoning about
what the dominant knows, the subject might choose to alter its behavior to secure more food
for itself. For example, if the subject believes that the dominant does not know the larger food
pile’s location, the subject might try going there for a greater reward, when it would otherwise
leave the pile to the dominant.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.3. Variants</title>
          <p>Since its first use testing chimpanzees [ 9], multiple variants of competitive feeding have been
proposed, implemented, and run on various animal species. Hare et al. published a compelling
version of the test in 2001 featuring three experiments: “did”, “who”, and “which”, referring
to the subjects’ beliefs about whether conspecifics witness diferent baiting events [ 12]. “Did”
refers to the ability to distinguish whether an opponent did or did not observe an event, “who”
involves understanding who of multiple opponents observed an event, and “which” involves
understanding which of multiple baiting events an opponent observed. That test, and most
following it, compare the subject’s performance across at least four conditions: Informed,
Uninformed, Control Misinformed, and Misinformed. The names of these variants refer to
the dominant’s awareness of baiting events. In the former two setups, one baiting event takes
place, and the dominant is either aware or unaware of the food’s location. In the latter two, the
dominant is aware during one baiting event, but then is either aware or unaware of a second
in which the foods’ locations are swapped. The misinformed and control misinformed cases
can be likened to the Sally Anne test, as the subject is tasked with identifying the presence of a
change-of-location false belief.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Criticism</title>
        <p>Because our subject has access to its own mental state, it is of critical importance to falsify
the null hypothesis that it makes use of only its own mental state to determine its behavior.
Behaviors that could be explained as a learned response to superficial perceptual input, e.g ‘her
eyes being pointed toward the food indicates that I should go somewhere else’, do not sufice.</p>
        <p>In general, ToM allows an agent to behave as though some other portion of the environment
(read: another embodied agent) is expected to behave in accordance with a false belief. To be
convinced that the agent has ToM, its behaviors under all alternate assumptions of truth values
(and beliefs about truth values) must be known and compared. Penn and Povinelli describe
two diferent alternatives to the competitive feeding paradigm that might aid in making such a
comparison.</p>
        <p>The first, called the opaque visor experiment, is a modification of a task described in [ 13]. The
opaque visor experiment involves explicit generalization from novel first-person experience to
third-person reasoning: the subject is given time to experiment with multiple visors, the opacity
of which is only visible with physical proximity, before being evaluated about the visor’s efect
on an experimenter at a distance. Due to its emphasis on few-shot learning, the opaque visor
experiment lies beyond the scope of this paper. The second, which motivates this work, adds a
handful of modifications and variants meant to control for alternate explanations in animals’
‘passing’ behavior to Hare et al.’s competitive feeding paradigm [12].</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Systematic Competitive Feeding</title>
          <p>The improvements Penn and Povinelli suggest for a systematic competitive feeding paradigm
(SCFP) are slightly more complex, but provide much more satisfying answers to questions of
what, exactly, the subjects believe. To allow for satisfactory presumption of agents’ behavior, they
describe a specific training regime featuring steps that must be passed successfully, representing
successful understanding of the test’s fundamental components.</p>
          <p>In Stage 1, subjects are trained in the absence of dominant competitors until they demonstrate
proper goal-seeking behavior. Next, in Stage 2, they are trained to compete with a conspecific
(as in all other competitive feeding tests) for food, and only those who successfully concede
food to dominants are allowed to continue. If our subjects pass the first two stages, we can be
certain that they understand the basics of how their reward can be maximized.</p>
          <p>Finally, several variants are presented as test conditions in Stage 3. In this version, there
are several buckets (food locations), and food is always placed in two of them during the
baiting events. Because the number of buckets is usually greater than 2, the SCFP makes no
cross-experiment distinction between the “did” and “which” cases of Hare et al. [12].</p>
          <p>Instead of the four common test variants described above, the SCFP uses at least eight
scenarios to comprehensively judge the subjects’ understanding: Informed control, partially
uninformed, removed informed, removed uninformed, moved, replaced, misinformed, and
swapped. Like the four common competitive feeding variants, these scenarios difer from each
other only by schedules of obscuring, baiting, hiding, and releasing events, performed by the
experimenters. For full descriptions of each scenario, please refer to section 6b of Penn and
Povinelli 2007 [11].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The Standof Environment: A Gridworld Platform for</title>
    </sec>
    <sec id="sec-4">
      <title>Computational Theory of Mind Experiments</title>
      <p>The Standof Environment is a multiagent gridworld environment implemented as a
partiallyobservable Markov decision process using the PettingZoo API [14]. SuperSuit [15] wrappers
convert the environment’s inputs and outputs into formats which can interface directly with
of-the-shelf reinforcement learning paradigms in Stable-baselines3 [ 16] and RLlib [17]. Standof
replicates all SCFP variants as described in [11], and, as we will see, is capable of testing for
ToM skills in a wide variety of settings.</p>
      <sec id="sec-4-1">
        <title>3.1. Agents</title>
        <p>Agents’ views are bird’s-eye representations of their surroundings. These views are either
egocentric, in which the agent’s body always appears in the same relative location, and
orientation is aligned with the agent’s current direction, or allocentric, in which the entire world
is displayed with a uniform coordinate system, but areas outside the agent’s perception are
masked. In both cases, our agents’ bird’s-eye perceptions are notably diferent from real animals’
ifrst-person views, but we opt for the simpler and possibly easier perspective for the sake of
programmer friendliness. Agents’ action sets include movement of either of two kinds: directed
(forward, backward, rotate left, and rotate right) and cardinal (North, South, East, and West).</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Puppets</title>
        <p>The Standof environment supports multi-agent reinforcement learning, but its initial intent is
studying the behavior of a single subject. As a starting point, the subject’s conspecifics, be they
collaborators or competitors, are implemented as hard-coded puppets. These puppets behave
according to simple rulesets applied to their perceptions. Puppets appear identically to any
agent—subject included—other than optional visual features that distinguish their values (see
4.6). Puppets have an explicit memory of relevant information that they witness (namely, treat
locations), as well as basic navigation skills. Through this dynamic implementation, changing
the sequence of information presented to a puppet causes predictable changes in its behavior.
Various independent variables and environmental parameters can be edited to create diferent
experimental conditions, to which puppets respond automatically. Puppets’ behavior can be
otherwise specified by the user to any degree of granularity, and they can even be controlled by
custom artificially intelligent models. Note that the puppets’ hard-coded behavior is intended
as a starting point in absence of rational actors, though irrational behavior also warrants study.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Tutorial Stages</title>
        <p>All tests of social cognition are based on a number of assumptions about their subjects’ goals,
knowledge, and abilities. Animal subjects’ preferences for food are well-understood, and
fundamental knowledge—like that doors open and close, or how to navigate simple environments—can
generally be assumed without question or otherwise taught with repeated exposure.</p>
        <p>The Standof environment makes use of numerous ‘tiles’ with various behaviors and
afordances that, at evaluation time, the subjects are assumed to understand. Curtains and boxes
conceal their contents, treats grant rewards, gates (both transparent and opaque) open and close
without warning, and other agents move about of their own volition. These ‘commonsense’
facts (along with many others) are established in the environment’s provided tutorial stages,
which expose a subject to various hardcoded and randomized settings so that it can explore the
rules of the world, which imitate those of the other Standof conditions.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Competitive Feeding</title>
        <p>As a starting point, we shall introduce the Standof implementation of competitive feeding,
including all the systematic variants proposed by Penn and Povinelli [11]. In our computational
version of the test paradigm, we closely imitate many aspects of the competitive feeding design:
walls are opaque (occluded areas are masked by a special shadow color), gates may be opened
or closed (and opaque or transparent), and treats are baited according to the same schedules.</p>
        <p>Because the environment is a gridworld, many of these details are abstracted by a large degree.
Treats are objects that provide reward to reinforcement learning agents (and often terminate the
episode) when reached. The rewards granted by treats are dynamically determined following
certain rules to ensure predictable optimal behavior. For example, if there are  boxes, the ratio
of (positive) rewards between the larger and smaller treat must be greater than  : 1, otherwise
strategies like “always approach the smaller treat” become valid strategies for achieving maximal
total reward under undesirable circumstances. Likewise, a small negative reward placed on
empty boxes reduces the expected value of random guesses. Treats may be ‘hidden’ in boxes,
which obscure vision of the treat from both the dominant and (conditionally) the subordinate.
Note that in the original competitive feeding paradigm design, the subordinate is always able to
observe the treats’ locations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Independent Variables</title>
      <p>By modifying various small sets of environmental parameters, we may computationally imitate
many other social cognition experiments that have been performed on animals. Models’ skills
may be tested with a variety of lenses to gain insight into their fundamental capabilities and
weaknesses.</p>
      <p>All of the following variables can (and should) be investigated for transfer learning in a
standardized manner. Then, models of ToM may be evaluated for their generalization capacity
along various notable axes. As with scrub-jays, does experience with one role help an agent
understand another? The test setups may be studied as well to find the extent to which success
at a specific set of tests tends to predict other abilities.</p>
      <sec id="sec-5-1">
        <title>4.1. Agent priority</title>
        <p>In the standard competitive feeding tests, the subject is subordinate to its opponent, i.e. all else
equal, the subject is at a competitive disadvantage. In the Standof environment, this efect is
achieved and signaled to the agents via treat locations; they are slightly closer to the dominant
agent. On its own, allowing the subject to take on the role of the dominant invokes a trivially
easy task, having an identical solution to Stage 1 of SCFP. In conjunction with other changes
that we shall discuss below—especially visible decisions (see 4.3)—a dominant subject proves
quite useful, as its decision can alter the subordinate puppet’s behavior.</p>
        <p>A transfer learning experiment of difering agent priorities is similar to the role-reversal
experiment in [18]. Their experiment is of a collaborative nature, so note that the same
experiment could be run under diferent conditions for anticipation valence (see 4.2) and reward
sharing.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Anticipation valence</title>
        <p>In the competitive feeding tests, the subject is expected to look for a treat in an area where
it believes the dominant will not visit. Leslie and Polizzi find a significant diference between
positive and negative desires, that is, looking where something is versus is not located, in
the context of Sally Anne tests in human children [19]. A minor change in the rules of the
Standof environment inverts the negative valence in the competitive feeding paradigm: when
the dominant reaches a treat, the treat shall remain and its value for the subordinate shall be
increased to be maximal. Now, the task presented to the subject is arguably simpler: infer the
dominant’s goal, and adopt that goal as your own. There is no longer a need for extraneous
decision-making regarding selecting the best goal alternative once the dominant’s decision
has been identified. Many other social cognition experiments, including most that involve
collaboration, make use of positive anticipation.</p>
        <p>We signal valence using treats’ color for RGB inputs, and treats’ identity for rich inputs. If
the subject has the dominant priority (see 4.1), positive valence is achieved via reward sharing,
that is, the subject is rewarded for a subordinate puppet’s successful completion of the task.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Decision visibility</title>
        <p>While evaluating all competitive feeding tests, it is of critical importance that the dominant
(and, if the dominant has ToM, the subordinate) be given privacy while it decides which route
to take. Otherwise, one agent could use the behavior of the other to inform its decision—a
strategy that is clearly relevant to social cognition but interferes with our tests for attribution
of already-established beliefs.</p>
        <p>Allowing subordinate agents to make decisions while informed of other agents’ decisions
opens the possibility of testing imitation and emulation. By allowing the subordinate to view
the dominant’s decision before the decision is finalized, we can study the subordinate’s ability
to imitate (or avoid imitating, in the negative anticipation valence case). When the subordinate
and dominant have diferences in their abilities (be they perception, mental, or action), imitation
may be directly compared with emulation. For example, a subject (occupying an empty room)
emulating a teacher (slowed by clutter) could navigate the room more eficiently than the
teacher, as opposed to an imitating subject who would ineficiently copy the teacher’s behavior.</p>
        <p>When a dominant subject’ decisions are visible, it might behave in a manner that strategically
influences the subordinate puppet’s decision. In the shared reward, positive anticipation version
of this test, the subject’s goal is to lead its conspecific to the treat. This altruistic variant,
especially in conjunction with multiple value alignments, roughly evokes the Yummy-Yucky
test described by [20], in which a subject is tasked with using knowledge of preferences to assist
an experimenter.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Population size</title>
        <p>An agent might solve the SCFP as defined by labeling events as ‘seen by opponent’ or ‘unseen
by opponent’. In this case, although an opponent’s perception must be correctly inferred, it
is unclear whether an   compartmentalizes the knowledge of a single opponent. In other
words, we might pass all SCFP tests while operating under the assumption that all embodied
opponents have a shared mental state. Note that this assumption could be correct in cases
where opponents communicate with each other. In order to rule out this hypothesis, we must
test for the “who” ability.</p>
        <p>By increasing the population of puppets (each having individual vision-obscuring events), the
subject may only find success by keeping track of who sees each baiting event. To accomplish
this efect, multiple puppets are initialized, each in a separate starting room. Any number
of puppets might be informed during the baiting events. During the release event, only one
of the puppets is able to leave its cage. To pass these scenarios, the subject must determine
whether or not the released puppet specifically was made privy to the pertinent information. In
scenarios with more than one baiting event, the “informed” agent may or may not be informed
of the irrelevant event(s). In conjunction with positive anticipation and visible decisions, the
Standof task becomes similar to the knower guesser paradigm [ 7], another popular test of
social reasoning in animals.</p>
      </sec>
      <sec id="sec-5-5">
        <title>4.5. Obscuring source</title>
        <p>In the competitive feeding paradigm (real-life and Standof), participants’ vision is obscured
using opaque doors that occlude baiting events. These doors may be replaced by one of any
existing objects that have been established to be opaque (or not) during the agent’s training.
Numerous other methods may be devised for causing (and signaling) unawareness. Gaze, for one,
has been extensively studied in humans and animals. By instructing puppets (with directional
vision) to face away from the food during baitings, we can evoke a rudimentary replication of
experiments involving gaze-originated unawareness.</p>
      </sec>
      <sec id="sec-5-6">
        <title>4.6. Value alignment</title>
        <p>A core assumption of previously described experiments is that all agents value treats similarly,
yet a fundamental ToM skill involves empathizing with individuals with diferent preferences.
We provide two alternative sets of preferences inspired by Leslie and Polizzi 1998 [19]: A
negative-value agent prefers smaller treats to larger ones, and an avoidant agent prefers to
search boxes that contain no treats at all. Like anticipation valence, value alignment is signaled
by agents with alternate color or numeric identity schemes.</p>
      </sec>
      <sec id="sec-5-7">
        <title>4.7. Scenario complexity</title>
        <p>Just as we would like to investigate our subjects’ ability to compartmentalize their  
functions to multiple diferent embodied agents (or distinguish between multiple inferred
mental states), we might also test the complexity of   itself. Under what conditions, and to
what extent, is it able to represent and distinguish between multiple goal states? In the multiple
desires test [21] children are tested to study their comprehension of three diferent aspects of
multiple desires. We can imitate this test by releasing the subject only after the puppet reaches
its first goal, giving the puppet a chance to also reach a second goal before the subject.</p>
        <p>Memory robustness is a closely related, fundamental skill for successful attribution. By
increasing the complexity of the environment, an agent’s memory will need improvement to
succeed. The environment’s scale or the number of potential treat locations can be trivially
increased to achieve this efect. We may also increase the amount of time between baiting and
releasing, as well as the number of relevant and irrelevant events, to stymie our agents’ eforts
to retain relevant information.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Future work</title>
      <p>Although the Standof environment can be used to systematically investigate a wide variety
of skills, there are many aspects of social cognition that lie beyond its grasp. As mentioned
previously, this environment is one of a set called for in [5]. Much additional work remains to
be done in the task that is building models that solve our social reasoning tests.</p>
      <sec id="sec-6-1">
        <title>5.1. Independent variables not covered by Standof</title>
        <p>Further environments, likely with diferent fundamental setups, will be required to replicate the
design of social reasoning tests from the comparative cognition and developmental psychology
literature.</p>
        <p>Several classes of social cognition tests are not easily represented in the Standof
environment. One notable example is the goggles test (see opaque visor test described in 2.2), which
demonstrates projection from first-person experience [ 11][22]. An environment capable of
replicating this task would need to support both first-person viewpoints and memory sustained
across repeated sessions to allow for testing one- and few-shot learning.</p>
        <p>With significant modification, we hope to eventually cover a diverse set of tests which
diferentiate imitation and emulation. Just as the competitive feeding paradigm implementations
make use of multiple vision-obscuring sources, tests of emulation include several sources of
ineficient or unexpected conspecific behavior. These include irrationality or temporary inability
[23] [24], accidents [25], and even moral transgressions [26].</p>
        <p>Of particular note are tests involving deception beyond that which is allowed in ‘decision
viewing’ scenarios. Despite having the label of deception, these tests involve hiding and
communicating treats’ locations in both collaborative and competitive settings. The box-locking
task, for example, asks its participant to aid or thwart a puppet by misinforming them or by
physically preventing them from reaching their goal [27]. Other tasks involving deceptive
behaviors tend to require repeated sessions, including penny hiding [28] and, as mentioned in
1, hiding belongings from onlooking competitors.</p>
        <p>Similarly, we would like to point out that most of the inference the Standof environment
tests for is deductive in nature, although it is theoretically possible to test for abductive ToM
reasoning. An accurate model of another agent’s mental state should not only answer questions
of what the agent will do, but should answer questions of how and why the agent displayed
existing behavior. In the Standof environment, how and why are generally answered by visible
attributes of the environment, e.g. the opponent pursued the smaller goal because its body is
colored blue and therefore experience dictates that it must have negative-value nature. This
type of reasoning will likely prove necessary for successful one- and few-shot learning in ToM
scenarios, a powerful but dificult skill to master.</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. From generating baselines to solving the ToM riddle</title>
        <p>The overall dificulty of the various Standof tasks is an important question whose answer lies
beyond the scope of this paper. Several environmental parameters are included for practical
ease of implementation, e.g. allowing for allocentric perception and cardinal movement actions
might help with agents’ spatial memory, which is a complex skill in its own right.</p>
        <p>Many researchers have already made substantial headway towards artificial ToM, including
those with their own versions of social cognition tests mentioned above, for example Rabinowitz
et al., who test their models on a gridworld implementation of the Sally Anne test [29]. A wide
variety of models and strategies have been employed, including deep reinforcement learning,
Bayesian inference [30], and cognitive models [31]. A review of algorithms designed for ToM
reasoning can be found in Hernandez-leal et al. 2019 [32].</p>
        <p>Competitive feeding subjects might lack proper understanding of their rivals’ mental states,
but we, as scientists, must empathize with their struggle. We, too, have a long journey ahead
of us as we attempt to overcome our own lack of understanding, not just about mental states,
but about how mental states are understood. By continuing along this path of tests, with
foundations in comparative literature, we hope to help uncover the mysteries that allow us to
understand.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported in part by the Neurodiversity Inspired Science and Engineering (NISE)
NSF program grant DGE 19-22697 (K. Stassun, PI). We also extend thanks to our anonymous
reviewers for their helpful criticisms, comments, and suggestions.
[2] N. S. Clayton, J. M. Dally, N. J. Emery, Social cognition by food-caching corvids. the western
scrub-jay as a natural psychologist, Philosophical Transactions of the Royal Society B:
Biological Sciences 362 (2007) 507–522.
[3] N. J. Emery, N. S. Clayton, Efects of experience and social context on prospective caching
strategies by scrub jays, Nature 414 (2001) 443–446.
[4] E. Van der Vaart, R. Verbrugge, C. K. Hemelrijk, Corvid re-caching without ‘theory of
mind’: A model, PloS one 7 (2012) e32904.
[5] J. Michelson, D. Sanyal, J. Ainooson, Y. Yang, M. Kunda, Social cognition paradigms ex
machinas (2021).
[6] H. Wimmer, J. Perner, Beliefs about beliefs: Representation and constraining function
of wrong beliefs in young children’s understanding of deception, Cognition 13 (1983)
103–128.
[7] M. A. Udell, N. R. Dorey, C. D. Wynne, Can your dog read your mind? understanding the
causes of canine perspective taking, Learning &amp; Behavior 39 (2011) 289–302.
[8] T. Bugnyar, Knower–guesser diferentiation in ravens: others’ viewpoints matter,
Proceedings of the Royal Society B: Biological Sciences 278 (2011) 634–640.
[9] B. Hare, J. Call, B. Agnetta, M. Tomasello, Chimpanzees know what conspecifics do and
do not see, Animal Behaviour 59 (2000) 771–785.
[10] D. J. Povinelli, J. Vonk, We don’t need a microscope to explore the chimpanzee’s mind,</p>
      <p>Mind &amp; Language 19 (2004) 1–28.
[11] D. C. Penn, D. J. Povinelli, On the lack of evidence that non-human animals possess
anything remotely resembling a ‘theory of mind’, Philosophical Transactions of the Royal
Society B: Biological Sciences 362 (2007) 731–744.
[12] B. Hare, J. Call, M. Tomasello, Do chimpanzees know what conspecifics know?, Animal
behaviour 61 (2001) 139–151.
[13] C. M. Heyes, Theory of mind in nonhuman primates, Behavioral and brain sciences 21
(1998) 101–114.
[14] J. K. Terry, B. Black, N. Grammel, M. Jayakumar, A. Hari, R. Sullivan, L. Santos, C.
Dieffendahl, C. Horsch, R. Perez-Vicente, et al., Pettingzoo: Gym for multi-agent reinforcement
learning, Advances in Neural Information Processing Systems 34 (2021).
[15] J. K. Terry, B. Black, A. Hari, Supersuit: Simple microwrappers for reinforcement learning
environments, arXiv preprint arXiv:2008.08932 (2020).
[16] A. Rafin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3:
Reliable reinforcement learning implementations, Journal of Machine Learning Research
22 (2021) 1–8. URL: http://jmlr.org/papers/v22/20-1364.html.
[17] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gonzalez, M. Jordan,
I. Stoica, Rllib: Abstractions for distributed reinforcement learning, in: International
Conference on Machine Learning, PMLR, 2018, pp. 3053–3062.
[18] D. J. Povinelli, K. A. Parks, M. A. Novak, Role reversal by rhesus monkeys, but no evidence
of empathy, Animal Behaviour 44 (1992) 269–281.
[19] A. M. Leslie, P. Polizzi, Inhibitory processing in the false belief task: Two conjectures,</p>
      <p>Developmental science 1 (1998) 247–253.
[20] B. M. Repacholi, A. Gopnik, Early reasoning about desires: evidence from 14-and
18month-olds., Developmental psychology 33 (1997) 12.
[21] M. Bennett, L. Galpert, Children’s understanding of multiple desires, International Journal
of Behavioral Development 16 (1993) 15–33.
[22] K. Karg, M. Schmelz, J. Call, M. Tomasello, The goggles experiment: Can chimpanzees use
self-experience to infer what a competitor can see?, Animal Behaviour 105 (2015) 211–221.
[23] A. N. Meltzof, Infant imitation after a 1-week delay: long-term memory for novel acts
and multiple stimuli., Developmental psychology 24 (1988) 470.
[24] G. Gergely, H. Bekkering, I. Király, Rational imitation in preverbal infants, Nature 415
(2002) 755–755.
[25] J. Call, M. Tomasello, Distinguishing intentional from accidental actions in orangutans
(pongo pygmaeus), chimpanzees (pan troglodytes) and human children (homo sapiens).,
Journal of Comparative Psychology 112 (1998) 192.
[26] M. Killen, K. L. Mulvey, C. Richardson, N. Jampol, A. Woodward, The accidental
transgressor: Morally-relevant theory of mind, Cognition 119 (2011) 197–215.
[27] B. Sodian, U. Frith, Deception and sabotage in autistic, retarded and normal children,</p>
      <p>Journal of child psychology and psychiatry 33 (1992) 591–605.
[28] G. Gratch, Response alternation in children: A developmental study of orientations to
uncertainty, Vita humana (1964) 49–60.
[29] N. Rabinowitz, F. Perbet, F. Song, C. Zhang, S. A. Eslami, M. Botvinick, Machine theory of
mind, in: International conference on machine learning, PMLR, 2018, pp. 4218–4227.
[30] C. L. Baker, J. Jara-Ettinger, R. Saxe, J. B. Tenenbaum, Rational quantitative attribution of
beliefs, desires and percepts in human mentalizing, Nature Human Behaviour 1 (2017)
1–10.
[31] T. N. Nguyen, C. Gonzalez, Theory of mind from observation in cognitive models and
humans, Topics in Cognitive Science (2021).
[32] P. Hernandez-Leal, B. Kartal, M. E. Taylor, A survey and critique of multiagent deep
reinforcement learning, Autonomous Agents and Multi-Agent Systems 33 (2019) 750–797.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Premack</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Woodruf, Does the chimpanzee have a theory of mind?</article-title>
          ,
          <source>Behavioral and brain sciences 1</source>
          (
          <year>1978</year>
          )
          <fpage>515</fpage>
          -
          <lpage>526</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>