=Paper=
{{Paper
|id=Vol-3169/paper2
|storemode=property
|title=Evaluating Object Permanence in Embodied Agents using the Animal-AI Environment
|pdfUrl=https://ceur-ws.org/Vol-3169/paper2.pdf
|volume=Vol-3169
|authors=Konstantinos Voudouris,Niall Donnelly,Danaja Rutar,Ryan Burnell,John Burden,José Hernández-Orallo,Lucy Cheke
|dblpUrl=https://dblp.org/rec/conf/ijcai/VoudourisDRBBHC22
}}
==Evaluating Object Permanence in Embodied Agents using the Animal-AI Environment==
<pdf width="1500px">https://ceur-ws.org/Vol-3169/paper2.pdf</pdf>
<pre>
Evaluating Object Permanence in Embodied Agents using
the Animal-AI Environment
Konstantinos Voudouris1,2 , Niall Donnelly3 , Danaja Rutar1 , Ryan Burnell1 , John Burden1 ,
José Hernández-Orallo1,4 and Lucy G. Cheke1,2
1
  Leverhulme Centre for the Future of Intelligence, Cambridge, UK
2
  Department of Psychology, University of Cambridge, UK
3
  The College of Engineering, Mathematics, and Physical Sciences, University of Exeter, UK
4
  VRAIN, Universitat Politècnica de València, Spain


                                             Abstract
                                             Object permanence, the understanding and belief that objects continue to exist even when they are not directly observable, is
                                             important for any agent interacting with the world. Psychologists have been studying object permanence in animals for at
                                             least 50 years, and in humans for almost 50 more. In this paper, we apply the methodologies from psychology and cognitive
                                             science to present a novel testbed for evaluating whether artificial agents have object permanence. Built in the Animal-AI
                                             environment, Object-Permanence In Animal-Ai: GEneralisable Test Suites (O-PIAAGETS) improves on other benchmarks for
                                             assessing object permanence in terms of both size and validity. We discuss the layout of O-PIAAGETS and how it can be used
                                             to robustly evaluate OP in embodied agents.

                                             Keywords
                                             Object Permanence, AI Evaluation, Embodied Agents, Animal-AI Environment, Developmental Psychology, Comparative
                                             Cognition


1. Introduction                                                                                                       The relation between object reidentification and OP is
                                                                                                                      manifest: when an object passes out of view, we believe
Object Permanence (OP) is the understanding and belief                                                                that it continues to exist. When it passes back into view,
that objects continue to exist even when they are not                                                                 we use knowledge about objects to determine whether
directly observable. In behavioural terms, an agent has                                                               this is the same object we saw previously. Here, we use
OP when they behave as though objects continue to exist                                                               OP to mean both classical OP and object reidentification.
when they cannot see them. Human adults use OP to rea-                                                                   OP has proven difficult to build into AI systems. Deep
son about how objects behave and interact in the external                                                             Reinforcement Learning systems perform significantly
world. Credited as the first to empirically investigate this                                                          worse than human children when solving problems in-
capability, Jean Piaget observed how infants develop the                                                              volving OP [13]. Tracking objects under partial occlu-
tendency to search for objects that became occluded [1].                                                              sion appears to be difficult for modern computer vision
Piaget’s insights have been extended considerably by de-                                                              methods [14]. The need for AI agents with robust OP
velopmental and comparative psychologists, usually in                                                                 is important for creating trustworthy embodied AI such
the visual modality [2, 3, 4], although OP is an amodal                                                               as self-driving cars. Furthermore, robust object tracking
phenomenon [5].                                                                                                       under occlusion would have many applications in the
   Humans and some animals appear to understand that                                                                  field of robotics. However, the methods for evaluating
objects continue to exist independently of them, with the                                                             whether an agent has OP suffer from a lack of precision,
same properties. However, when an object reappears,                                                                   reliability, and validity. Developmental and comparative
what makes us reidentify this as the same object as be-                                                               psychologists have been investigating OP in biological
fore? Object reidentification has been studied in visual                                                              agents for around a century, developing many experi-
cognition research with adults [6, 7, 8] and primates [9],                                                            mental paradigms along the way. Until now, AI research
and in developmental psychology with infants [10, 11, 12].                                                            has not applied these methods to AI evaluation [15]. In
EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022,                                                     this paper, we outline a new test battery, built in the
Vienna, Austria                                                                                                       Animal-AI Environment [16] for evaluating whether em-
Envelope-Open kv301@cam.ac.uk (K. Voudouris);                                                                         bodied artificial agents have OP: the Object-Permanence
mail.niall.donnelly@gmail.com (N. Donnelly); dr571@cam.ac.uk                                                          in Animal-Ai: GEneralisable Test Suites. O-PIAAGETS is a
(D. Rutar); rb967@cam.ac.uk (R. Burnell); jjb205@cam.ac.uk                                                            novel attempt to use experiments designed for investigat-
(J. Burden); jorallo@upv.es (J. Hernández-Orallo); lgc23@cam.ac.uk
(L. G. Cheke)
                                                                                                                      ing whether biological agents have OP for AI evaluation.
Orcid 0000-0001-8453-3557 (K. Voudouris)                                                                              First, we examine why OP is a challenge for AI research.
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      Second, we critically review existing OP testbeds. Third,
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
we outline the structure of the test battery and how it          [21], or robust inductive and abductive learning heuris-
can be used to robustly investigate whether agents have          tics and biases [20, 19, 8]. It is therefore not as simple as
OP. Finally, we discuss how O-PIAAGETS can be used               imputing a Principle of Persistence to build AI systems
for evaluation and how it improves on existing testbeds          with OP.
in the field.
                                                                 2.2. Existing Evaluation Methods for OP
2. Background and Motivations                                         in AI
                                                                 AI researchers, particularly those working on computer
2.1. The Logical Problem of OP                                   vision, embodied agents, and robotics, are interested in
OP may appear to be a trivial capacity for an agent to have.     building AI systems capable of robustly reasoning about
The agent must simply understand that objects continue           visual scenes, in a similar way to how humans and ani-
to exist when they are not directly observable. Indeed,          mals do. Researchers have built several evaluation frame-
Renée Baillargeon and colleagues [17] hypothesise that           works for assessing whether embodied artificial agents
children are born with a Principle of Persistence, which         and computer vision systems have OP.
states exactly this [18, 19]. Why, then, can’t we endow AI          Lampinen et al. [23] built OP tasks in a 3D Unity envi-
systems with such a principle, bias, or heuristic? Can’t         ronment. Here, the agent was fixed as it watched three
we simply tell an agent that objects continue existing           boxes. Periodically, objects would leap out of the three
when they are occluded? Fields [20, 19] has discussed            boxes, simultaneously or sequentially with or without
how the notion of a Principle of Persistence is untenable,       a refractory time lag. The agent would then be turned
due to the Frame Problem (FP).                                   away from the boxes, released, and asked to go to the
   The FP implies that endowing an agent, biological or          box with a particular object. If it chose the correct box, it
artificial, with a principle of persistence is not trivial. It   was rewarded, similar to tasks used with human infants
cannot be overcome with a representation as simple as            [24] and non-human primates [4]. Crosby et al. [16]
objects continue to exist even when they aren’t observable.      developed a series of 90 OP tests as part of the Animal-AI
In its raw form, the FP demonstrates that when logically         Testbed and Olympics, inspired and directly developed by
describing the effects of particular actions on objects in       developmental and comparative psychology. Some work
a domain, we must also describe ad nauseam all the non-          has been done comparing embodied deep reinforcement
effects of those actions on those objects. As Fields [19]        learning agents to humans on these tasks. Children aged
says, it amounts to having to describe everything that           6-10, with limited training, significantly outperformed
doesn’t change in the universe as a result of turning off the    Deep Reinforcement Learning systems on the OP tasks
fridge (p. 443). In a domain where objects have certain          in the Animal-AI Testbed [13], indicating there is room
properties that can change over time, as in all real-world       to improve these systems until they reach human-level
scenarios, the FP implies that we can’t simply say that          performance. Leibo et al. [25] developed Psychlab for
the objects stay the same over time, without describing          probing psychophysical phenomena in Deep Reinforce-
which properties remain unchanged and when [21].                 ment Learning systems using cognitive science methods
   When an agent can observe everything in a domain,             and qualitatively comparing performance with human
and re-update what has and has not changed at every              participants, but they did not investigate OP.
timestep, the FP rarely raises any issues. However, when            Having OP is not only applicable to embodied agents,
objects become occluded, it becomes important to track           but also to passive computer vision systems engaged in
which properties of those objects do and do not change           object tracking. The Localisation Annotations Compo-
and when, in order to identify other objects as identi-          sitional Actions and TEmporal Reasoning (LA-CATER)
cal or different. For example, imagine a lion watching a         dataset [26] is prominent in computer vision research.
small antelope pass behind some bushes and then seeing           LA-CATER contains 14000 video scenes where objects
a large antelope emerge at the other side. It becomes            can move in three dimensions, contain, and carry each
useful to know that antelope don’t change size over such         other. Several tasks in this dataset happen to behave
time periods, and therefore the smaller antelope contin-         similarly to OP experiments used in psychology. For
ues to exist because of the persistence of its size (and         example, one task involves an object being occluded by
other) properties. It also becomes useful to know that           one of three identical ‘cups’; once occluded, the cups are
the antelope doesn’t change when the lion changes their          moved relative to each other. This bears resemblance to
perspective, or occludes the antelope through its own            the cup-tasks used in the Primate Cognition Test Battery
actions, an analogue of the Simultaneous Location and            [4] (see Figure 3) or in the Užgiris and Hunt [24] test
Mapping (SLAM) problem in robotics [22]. Overcoming              battery for infants. Other benchmark datasets include
the FP either requires sophisticated deductive techniques        ParallelDomain (PD) and KITTI [27]. PD is a synthetic
                                                                 dataset designed to test occlusions in driving scenarios.
It contains 210 photo-realistic driving scenarios in city        [31, 32, 33]. LA-CATER and the procedurally generated
environments, from 3 camera angles, creating a dataset of        test sets mentioned earlier were generated according to
630. KITTI [28] has 21 labelled videos of real-world city        a series of rules, with training, validation, and test sets
scenes, in which cars, pedestrians, and other objects pass       divided arbitrarily. The PD and KITTI datasets were
behind each other and become partially or fully occluded,        generated and collected non-procedurally, but again, the
a small fraction of the total KITTI dataset [27].                distinction between training and test sets is often arbi-
   Piloto et al. [29] directly applied a measurement frame-      trary [27].
work innovated in developmental psychology to probe                 Moving from i.i.d. to o.o.d. test data promotes robust-
physics knowledge in artificial systems, including OP. Vi-       ness in AI systems. Developing a testbed for OP in which
olation of Expectation has been used by the neo-Piagetian        training and test data are kept distinct means that we can
school of developmental psychology [3], investigating in-        be more certain that AI systems have OP if they perform
fants’ knowledge about the world by determining when             successfully, rather than overfitting to the data distribu-
they are surprised to see something, violating their ex-         tion. This means we can evaluate whether an AI has an
pectations. For example, infants at about 4.5 months tend        ability corresponding to OP, rather than a propensity for
to show surprise (by looking more) if an object appears          solving some distribution of tasks that require it1 [35, 36].
to change size whilst occluded [10, 11]. Piloto et al. pro-         O.o.d. testing enables researchers to have grounds to
cedurally generated 28 3-second videos that emulated a           say they are testing for the presence of abilities. However,
small subset of these studies, and used Kullback-Leibler         selecting a test distribution must be guided by some prin-
divergence as the AI equivalent of looking time. They            ciple that tells us why the training and test distributions
demonstrated the utility of this technique for probing           are meaningfully related. This takes us to the second prob-
physical knowledge in computer vision systems.                   lem for OP evaluation in AI: that testing lacks internal
   In both computer vision and embodied AI, several              validity. Developmental and comparative psychologists
methods for detecting when agents have OP have been              have developed numerous experimental designs to test
proposed. However, with the exception of the work of             for the presence of cognitive abilities in biological agents,
Piloto et al. [29] and Lampinen et al. [23] at DeepMind          introducing numerous controls to eliminate alternative
and the Animal-AI Testbed and Olympics, little attention         explanations. As a point of reference, let’s take the classic
has been paid to systematically applying the methodolo-          A-not-B paradigm for testing OP. Participants are pre-
gies of psychology to try to understand and evaluate OP          sented with an object of interest that is hidden for several
in artificial agents.                                            trials at location A. In the AI context, this amounts to a
                                                                 training distribution around location A (with variance
2.3. Problems with Current Evaluation                            corresponding to minor differences between trials). To
                                                                 test the participant to find an object of interest at A, true
     Frameworks for OP                                           OP understanding as an explanation is conflated with
Two main problems exist with the methods for evaluating          other explanations in terms of memorising spatial loca-
whether AI has OP. The first problem is that most of             tion or returning to a previously rewarding location as
these benchmarks and testbeds use independent-and-               infants under 9 months, and many animals, do [1, 37].
identically-distributed (i.i.d.) test data, meaning testing      To eliminate (some of) these explanations, in the test
data is drawn from the same distribution as training data.       condition, participants are faced with an object hidden
This especially applies to LA-CATER, PD, and KITTI. The          at location B. The testing distribution now includes ob-
second problem is a lack of internal validity. Sufficient        jects hidden at B, and the relation between the two is
controls to eliminate alternative explanations for certain       meaningful in the context of OP, because an agent needs
behaviours are often lacking.                                    OP to solve the task. The logic is that one would only
   The problem with i.i.d. testing data is that it is in prin-   perform well on training (A-only) and testing (B-only)
ciple impossible to distinguish between an agent that has        if one had OP. Of course, there are further alternative
OP and one using problem-irrelevant shortcuts to max-            explanations for correct search at locations A and B, such
imise reward, appearing as if they have OP. This means           as simply searching where the experimenter’s hand has
that even if we had an agent that genuinely had OP, our          just been [24]. So internal validity tends to increase the
evaluation methods limit how certain we can be of that.
Geirhos et al. [30] argue that an effective measure against            1
                                                                         For example, an anonymous reviewer pointed out that Deep-
this is to test AIs on out-of-distribution (o.o.d.) test data,    Mind’s FTW agent [34] arguably has object permanence, since it can
where training data and test data are drawn from dif-             successfully fight players who duck for cover in a 3D capture-the-flag
ferent (but meaningfully related) distributions. This is          game. While this is certainly evidence for OP in an artificial agent,
                                                                  it remains speculative for now, since FTW has not yet been tested
related to the notion of transfer tasks in developmental          on an internally valid, out-of-distribution test set like O-PIAAGETS
and comparative psychology. The move from i.i.d to o.o.d.        - although O-PIAAGETS itself is not yet developed for testing such a
testing is still not mainstream, but is gaining prominence        multi-agent system.
more diversity in training and test data there is, as they
become mutually controlling.
   Psychologically-inspired testbeds for evaluating OP
in AI systems, such as Piloto et al. [29], Lampinen et al.
[23] and Crosby et al. [16], remain small and so internal
validity remains relatively low. The confluence of low
internal validity in some testbeds and the lack of o.o.d.
testing means that even if an AI system genuinely has OP,
our evaluation frameworks and metrics are not internally
valid enough to show this. In this paper, we propose a
novel large testbed for conducting o.o.d. testing with
high internal validity.
                                                             Figure 1: The Animal-AI Environment. A bird’s eye view of
                                                             the arena is given top centre. The various objects that can
3. Introducing O-PIAAGETS                                    populate it are shown and described in the text.
In the previous section, we established three things:
    1. OP poses a challenging logical problem. It is not
                                                            and red lava zones), pink ramps, and transparent and
       trivial for an agent to have OP.
                                                            opaque2 blocks and tunnels (see Figure 1). These objects
    2. Computer vision and embodied agent results sug-      can be any size, constrained only by the dimensions of
       gest that trained computer architectures solve       the arena and the fact that two objects can’t occupy the
       tasks involving OP at a level significantly lower    same location (apart from lava zones). The lights can also
       than that of humans.                                 be switched on or off for preset periods of time, removing
    3. Current benchmarks and testbeds for evaluating       all visual information (see Figure 7 for an example).
       whether AI systems possess OP have limitations          Points are gained and lost through contact with re-
       such that even if an AI system had OP, we might      wards of differing size and significance, and punishments
       not be able to tell with reasonable certainty.       of differing severity. Obtaining a yellow sphere increases
Our novel testbed, Object-Permanence in Animal-Ai: points. Obtaining a green sphere also increases points
GEneralisable Test Suites (O-PIAAGETS), overcomes the and is episode-ending. Obtaining a red sphere decreases
limitations of other testbeds by applying an out-of- points and is episode ending, as does touching red lava
distribution testing framework on a large internally valid zones. All spheres can be stationary or in motion through
set of tasks adapted from comparative and developmental all three dimensions. Points start at 0 and decrease lin-
psychology and visual cognition research.                   early with each timestep over an episode, creating time
   O-PIAAGETS uses the Animal-AI Environment to gen- pressure and therefore motivation for fast and decisive
erate individual tasks for training and testing, based on action.
theoretical and empirical findings in the psychology lit-
erature. The testbed has an internal structure in which 3.2. Structure of O-PIAAGETS
certain tasks are designed to test certain aspects of OP
understanding. There is also a tailored training curricu- O-PIAAGETS adapts some tasks from the open-source
lum to ensure out-of-distribution testing, and more direct Animal-AI Testbed, but mostly includes new ones. It
comparison between biological and artificial machines. currently contains 5000 tasks, divided into four suites,
This work complements and extend the work of Piloto although it continues to expand as new features are re-
et al. [29], Lampinen et al. [23], Crosby et al. [16], and leased for the Animal-AI Environment. There are three
Voudouris et al. [13].                                      suites which test different aspects of OP and one suite
                                                            which contains controls for non-OP based explanations.
                                                            The three suites were motivated a priori by Brian Scholl’s
3.1. The Animal-AI Environment                              [7] exposition of OP research. Here, Scholl reviews work
The Animal-AI Environment [16] is a 3D world with Eu- on OP from across research in psychology, neuroscience,
clidean geometry and Newtonian physics built in Unity and philosophy, arguing that OP appears to be under-
[38]. The environment contains several objects, a single pinned by three key cognitive strategies. Humans appear
agent, and a finite number of actions it can perform (move to reason about objects under occlusion as (a) existing on
and rotate in x-z plane). The agent is situated in a square continuous spatiotemporal trajectories, (b) maintaining
arena. The arena can be populated with appetitive (green certain properties, such as size, but not necessarily oth-
and yellow spheres) and aversive stimuli (red spheres            2
                                                                 Of any RGB colour combination
ers, such as colour, and (c) existing as unified cohesive
wholes. O-PIAAGETS therefore contains a Spatiotem-
poral Continuity suite, a Persistence Through Prop-
erty Change suite, and a Cohesion suite. Each suite is
subdivided based on the psychology and AI research into
sub-suites testing different aspects of the suites. Those
sub-suites are subdivided into experimental paradigms
from the psychology literature. To maintain high inter-
nal validity, each sub-suite has at least 3 experimental
paradigms. These are further divided into tasks which
are specific instantiations of an experimental paradigm as
used in specific experiments. These tasks are composed
of instances, that are procedurally generated variations
of the global structure of the task, such as right and left
versions or versions with goals of different sizes or in
different positions. Finally, these instances are composed
of variants, which are procedurally generated variations
of the local structure of instances, with changes to the
colours of walls and the starting orientation of the agent.
In every test below, the objective is simple: maximise
reward. This involves obtaining yellow and green re- Figure 2: An example detour task. To get to the reward, the
wards while avoiding red rewards and ‘lava’, as quickly agent must navigate around the wall and up the ramp. This
as possible.                                                means that the goal will go out of view through the movement
                                                               of the agent.

3.2.1. Spatiotemporal Continuity
The Spatiotemporal Continuity suite examines how par-          a tunnel and come out of the other side (Burke, 1952).
ticipants reason about objects as persisting in the same       However, if the second object appears later than expected
spatiotemporal region, given initial starting velocities       or on a different trajectory, we do not identify it as the
and other interacting objects. This suite is divided into      same object [6] (see Figure 4). The Tunnel Effect tasks
two sub-suites: egocentric OP and allocentric OP.              enable us to probe where OP ‘breaks’ in the agent in
   Egocentric OP pertains to reasoning about objects per-      question, and how it compares to human performance.
sisting when they pass out of view through the actions of      In the Tunnel Effect tasks here and below, the agent
the agent. This allows us to evaluate how well an agent        is frozen until they have observed the whole scene, so
can learn about the identity and location of objects in        they don’t miss the important occlusion events we are
a region while also moving around that region, a vari-         probing, eliminating a potential explanation for why an
ant of the SLAM problem in robotics. An example of             agent failed on these tasks.
an egocentric OP task is a detour task where a goal is            In line with developments of the Animal-AI Environ-
observable but inaccessible behind an obstacle. The way        ment, we will introduce allocentric OP tasks involving
to obtain it is to detour around the obstacle such that        containment in stationary and moving containers, as
the goal is temporarily left out of sight. The logic here is   done in the LA-CATER and Lampinen et al. [23] testbeds
that one would only execute the detouring behaviour if         discussed earlier.
one believed that the goal would still exist when one has
finished detouring (see Figure 2).
                                                               3.2.2. Persistence Through Property Change
   Allocentric OP pertains to reasoning about objects that
pass out of view not because of the actions of the agent,      The second suite of tests extends the Tunnel Effect tasks,
but because they become occluded by another object.            investigating which properties of an object must change
The Cup Task in Figure 3 is an example [4]. A goal is          under occlusion for the post-occlusion object and pre-
hidden inside a ‘cup’ for some time. To succeed, the agent     occlusion to be classified as different. Scholl [7] reports
would need to search in the correct ‘cup’.                     that the Tunnel Effect is not disrupted by colour or
   The Tunnel Effect paradigm is a second example. An          shape change, only size changes, and the spatiotemporal
object passes behind an occluder, and another emerges          changes in the previous sub-suite [39, 9, 6, 40]. Wilcox
some time later. If the second object appears as a human       and Baillargeon [10] present evidence that the Tunnel
would expect it to, given the first object’s trajectory, we    Effect is disrupted by colour, shape, and texture changes.
perceive it as though the first object has gone through        O-PIAAGETS permits more control over the timing and
Figure 3: An example allocentric task inspired by the Primate
Cognition Test Battery [4]. Red arrows indicate goals, pale     Figure 4: A Tunnel Effect task. Humans would perceive the
arrow indicates agent.                                          object in 1A as the same as the object in 1B, but the object in
                                                                2A as different to the object in 2B, because of the impossible
                                                                trajectory.

nature of changes, so can be used for empirical study
with humans to investigate these inconsistent results, as
well as to analyse under what conditions OP breaks in
AI agents.
   Currently, this suite only contains one sub-suite, test-
ing the Tunnel Effect with apparent size change under
occlusion. However, in line with developments in the
Animal-AI Environment, we are building sub-suites for
apparent shape, colour, and pattern change. An example
of a task in which size appears to change is provided in
Figure 5. The post-occlusion object is smaller than the
pre-occlusion object, so the agent must search for two
distinct objects, not just the visible one.
                                                                Figure 5: A tunnel effect task manipulating the property of
                                                                size.
3.2.3. Object Cohesion
Scholl [7] argues that OP is not disrupted in human adults
when the contours are partially or completely removed           left and seek out the large goal behind the wall, or turn
from a visual object representation, so long as size does       right and seek out the entirely visible smaller goal. The
not appear to change. Humans, and many animals, as-             smaller goal is larger than the hole in the wall, so agents
sume that an object is of constant size [41], even when         that compare the number of green pixels visible at one
contour information is partially occluded or completely         time without understanding that size remains constant
removed and replaced with point lights [42, 43]. Cur-           under (brief) occlusion will make the wrong choice.
rently, this suite contains only one sub-suite, examining
size constancy under partial occlusion. An example of
this is the aperture task in Figure 6, innovated for O-
PIAAGETS based on discussion in Scholl [7]. Agents
watch a large green goal roll behind a wall with a small
hole in it. It is then released and given the choice to turn
                                                                     they possess OP. If they perform poorly on both, then
                                                                     there is some issue with understanding the environment
                                                                     or how to interact with it. If they perform well on the OP
                                                                     tasks but not the controls, then we have counter-intuitive
                                                                     evidence that OP can be decoupled from other abilities
                                                                     required to solve tasks in the environment.

                                                                     3.3.2. Paradigms, Instances, Variants
                                                                     Within the test suites themselves, two measures have
                                                                     been taken to increase internal validity. First, each task
                                                                     has several instances and variants. We have procedurally
                                                                     generated many versions of the same task that are mirror
                                                                     images of each other (left/right versions), have rewards
                                                                     and goals in different positions, or use different kinds
                                                                     of occluders. This counterbalanced design allows us to
Figure 6: The Aperture Task. A and B are before and after par-       detect when agents are solving tasks through problem-
tial occlusion. Parts 1 and 3 are variants of the same instance,
                                                                     irrelevant shortcuts. For example, in the aperture task
with differing wall colours. Part 2 is a different instance of the
                                                                     in Figure 6.1, an agent with a bias to turning left might
aperture task, the mirror image of Part 1.
                                                                     appear to succeed, but would not succeed at the instance
                                                                     which is a mirror image of this task, as in 6.2. These
                                                                     instances have many variants, changing the colour (often
3.3. Increasing Internal Validity                                    randomly) and initial orientation of the agent, as seen in
3.3.1. Control Suite                                                 6.3. This allows us to control for policies such as search
                                                                     behind the grey obstacle, which may be successful in some
The fourth suite is a set of control tests that serve to de-         tasks but does not indicate OP.
termine whether agents can solve tests not measuring OP.                Second, the inclusion of several experimental
There are two sub-suites here. The first is an introduction          paradigms in each sub-suite means they are mutually
to the environment, introducing basic controls and the               controlling. The philosophy of science tells us that
objects present in the environment. These tasks allow an             no single experiment would be able to diagnose the
agent, human or artificial, to learn which objects increase          presence or absence of OP [44, 45], because there are
reward and which objects decrease reward, and which                  always alternative explanations that could be appealed
objects are inert. Agents that fail some or all the tasks in         to. Using several distinct experimental paradigms means
the above three suites might not be failing because they             that they can control for each other and help eliminate
lack OP, but because they do not, for example, navigate              these alternative explanations. The cup task in Figure 3
towards green rewards or away from red lava, or under-               could be solved by a policy of navigating to where the
stand the utility of ramps for movement in the up-down               reward was last seen [46], which is not necessarily the
plane. The second sub-suite contains further control tests           same as understanding that the object continues to exist
for the OP tasks in the previous three suites. These are             even though the agent can’t see it. An adaptation of
tests that do not require OP to be solved, but introduce             Chiandetti and Vallortigara’s [47] paradigm controls for
the kind of landscapes and choices an agent might have               this (see Figure 7). Here, the agent watches a reward roll
to make. This means we can determine whether poor                    away from across lava. Then the lights go out, removing
performance on the OP tasks was a result of a lack of OP,            visual information for a short period. When the lights
or a lack of understanding of the landscapes those tests             go back on, the goal is not visible. However, there is
took place in. Since every task in the test battery will             only one place it can be. Going to where the reward was
require other abilities distinct from OP, these controls             last seen would end in failure, by touching lava, and the
allow developers to check whether errors are a result of             position of the goal before the lights out provides no cue
a lack of OP or a lack of some other ability. These tasks            as to whether the agent should go right or left. The use
can either be used in training or for further testing. An            of several experimental paradigms in each sub-suite has
example would be Figure 2 but without a grey wall and                the effect of reducing the likelihood of confounds that
with a pink ramp the length of the blue platform. This in-           we have not foreseen.
creases internal validity, because if agents performs well
on the control task, but not well on the equivalent OP
task, then we have reason to believe that they lack OP. If
they perform well on both, we have reason to believe that
                                                             5. Future Directions and
                                                                Conclusions
                                                             Using O-PIAAGETS, developers can robustly evaluate
                                                             whether artificial embodied agents have OP using the
                                                             methodologies of cognitive science. It improves on other
                                                             benchmarks and testbeds in the field both in terms of
                                                             its size, internal validity, and ability to detect the pres-
                                                             ence of robust and generalisable OP in artificial systems.
                                                             O-PIAAGETS is going through the final stages of develop-
                                                             ment for general release of Version 1.0, including around
                                                             5000 tasks using the current Animal-AI Version 3.0.1.
                                                             After validation with human participants and the devel-
                                                             opment of baseline agents to characterise state-of-the-art
                                                             performance in O-PIAAGETS, it will be expanded to in-
Figure 7: A task inspired by Chiandetti and Vallortigara’s   clude containment tasks, point lights, and shape, colour,
[47] study with day-old chicks.                              and pattern changes. In its final form, O-PIAAGETS will
                                                             provide a comprehensive and robust evaluation frame-
                                                             work for assessing OP in artificial agents.

4. Evaluating OP using
   O-PIAAGETS                                                Acknowledgments
                                                             We thank the anonymous reviewers for their comments.
4.1. Out-of-Distribution Testing                             This work was funded by the Future of Life Institute,
O-PIAAGETS facilitates out-of-distribution testing by FLI, under grant RFP2-152, EU’s Horizon 2020 research
providing a tailored training set using the control suite, and innovation programme under grant agreement No.
and a separate test set using the three test suites. The 952215 (TAILOR), US DARPA HR00112120007 (RECoG-
control suite contains tasks where the positions and ori- AI), and an ESRC DTP scholarship to KV, ES/P000738/1.
entations of objects is specified and tasks where those
positions are randomly generated, providing in principle References
a very large amount of training data that is on a different
distribution to the test data.                                [1] J. Piaget, The Origins of Intelligence In The Child,
                                                                  Routledge & Kegan Paul, Ltd., 1923.
4.2. Measurement Layouts                                      [2] R. Baillargeon, E. S. Spelke, S. Wasserman, Object
                                                                  permanence in five-month-old infants, Cognition
Each variant in O-PIAAGETS is tagged with its position            20 (1985) 191–208. URL: https://linkinghub.elsevier.
in the test battery (i.e., what suite, sub-suite, experimen-      com/retrieve/pii/0010027785900083. doi:10.1016/
tal paradigm, etc., it is a member of) as well as features        0010- 0277(85)90008- 3 .
such as goal sizes, the abilities an agent might require      [3] R. Baillargeon, J. Li, Y. Gertner, D. Wu, How
in addition to OP to solve it, and the other variants it          Do Infants Reason about Physical Events?, in:
controls for. This leads to an incredibly rich dataset for        U. Goswami (Ed.), The Wiley-Blackwell Handbook
evaluating agents beyond merely aggregating their score           of Childhood Cognitive Development, 2010, pp.
or success across the test suites. For example, develop-          11–48. Publisher: Wiley Online Library.
ers can explore how relevant and irrelevant features of       [4] E. Herrmann, J. Call, M. V. Hernàndez-Lloreda,
the tests, such as goal size, occluder colour, or right/left      B. Hare, M. Tomasello,              Humans Have
variants, correlate with performance [48], and use this           Evolved Specialized Skills of Social Cogni-
to evaluate whether an agent has OP or is using other             tion: The Cultural Intelligence Hypothesis,
policies to solve OP tasks. For example, assuming any             Science 317 (2007) 1360–1366. URL: https:
agent interacting with O-PIAAGETS will make errors,               //www.science.org/doi/10.1126/science.1146282.
including humans [13], it is important to evaluate how            doi:10.1126/science.1146282 .
those errors are distributed. By hypothesis, an agent         [5] J. G. Bremner, A. M. Slater, S. P. Johnson, Percep-
with OP will produce random error, uncorrelated with              tion of Object Persistence: The Origins of Object
experimental paradigms, goal sizes, or the colours of             Permanence in Infancy, Child Development Per-
occluders.                                                        spectives 9 (2015) 7–13. URL: https://onlinelibrary.
     wiley.com/doi/abs/10.1111/cdep.12098.                         1109/CVPR.2018.00044 .
     doi:10.1111/cdep.12098 , _eprint: https://on-            [15] D. Gunning, Machine Common Sense Concept
     linelibrary.wiley.com/doi/pdf/10.1111/cdep.12098.             Paper (2018) 18.
 [6] J. I. Flombaum, B. J. Scholl, A temporal same-object     [16] M. Crosby, B. Beyret, M. Shanahan, J. Hernández-
     advantage in the tunnel effect: Facilitated change            Orallo, L. Cheke, M. Halina, The animal-AI testbed
     detection for persisting objects., Journal of Experi-         and competition, in: NeurIPS 2019 competition and
     mental Psychology: Human Perception and Perfor-               demonstration track, PMLR, 2020, pp. 164–176.
     mance 32 (2006) 840–853. URL: http://doi.apa.org/        [17] R. Baillargeon, Innate Ideas Revisited: For a Prin-
     getdoi.cfm?doi=10.1037/0096-1523.32.4.840. doi:10.            ciple of Persistence in Infants’ Physical Reason-
     1037/0096- 1523.32.4.840 .                                    ing, Perspectives on Psychological Science 3 (2008)
 [7] B. J. Scholl, Object Persistence in Philosophy and            2–13. URL: https://doi.org/10.1111/j.1745-6916.2008.
     Psychology, Mind & Language 22 (2007) 563–591.                00056.x. doi:10.1111/j.1745- 6916.2008.00056.
     URL: https://onlinelibrary.wiley.com/doi/abs/10.              x , publisher: SAGE Publications Inc.
     1111/j.1468-0017.2007.00321.x.         doi:10.1111/j.    [18] Z. Pylyshyn, Perception, representation, and the
     1468- 0017.2007.00321.x , _eprint: https://on-                world: The FINST that binds, in: D. Dedrick,
     linelibrary.wiley.com/doi/pdf/10.1111/j.1468-                 L. Trick (Eds.), Computation, Cognition, and
     0017.2007.00321.x.                                            Pylyshyn, MIT Press, 2009, pp. 3–48. Google-Books-
 [8] J. I. Flombaum, B. J. Scholl, L. R. Santos, Spatiotem-        ID: zZ3HGOug3k8C.
     poral priority as a fundamental principle of object      [19] C. Fields,         How humans solve the frame
     persistence, The origins of object knowledge (2009)           problem,         Journal of Experimental & The-
     135–164. Publisher: Citeseer.                                 oretical Artificial Intelligence 25 (2013)
 [9] J. I. Flombaum, S. M. Kundey, L. R. Santos, B. J.             441–456.          URL:      http://www.tandfonline.
     Scholl, Dynamic Object Individuation in Rhesus                com/doi/abs/10.1080/0952813X.2012.741624.
     Macaques: A Study of the Tunnel Effect, Psycholog-            doi:10.1080/0952813X.2012.741624 .
     ical Science 15 (2004) 795–800. URL: https://doi.org/    [20] C. A. Fields, The Principle of Persistence, Leib-
     10.1111/j.0956-7976.2004.00758.x. doi:10.1111/j.              niz’s Law, and the Computational Task of Ob-
     0956- 7976.2004.00758.x , publisher: SAGE Pub-                ject Re-Identification, Human Development 56
     lications Inc.                                                (2013) 147–166. URL: https://www.jstor.org/stable/
[10] T. Wilcox, R. Baillargeon, Object individuation in            26764659, publisher: S. Karger AG.
     infancy: The use of featural information in reason-      [21] M. Shanahan, Solving the Frame Problem: A
     ing about occlusion events, Cognitive psychology              Mathematical Investigation of the Common Sense
     37 (1998) 97–155. Publisher: Elsevier.                        Law of Inertia, MIT Press, 1997. Google-Books-ID:
[11] T. Wilcox, Object individuation: infants’ use                 z8zR3Ds7xKQC.
     of shape, size, pattern, and color, Cognition 72         [22] R. Muñoz-Salinas, M. J. Marín-Jimenez, R. Medina-
     (1999) 125–166. URL: https://linkinghub.elsevier.             Carnicer,         SPM-SLAM: Simultaneous local-
     com/retrieve/pii/S0010027799000359. doi:10.1016/              ization and mapping with squared planar
     S0010- 0277(99)00035- 9 .                                     markers,           Pattern Recognition 86 (2019)
[12] T. Wilcox, C. Chapa,             Priming infants to           156–171.        URL:      https://www.sciencedirect.
     attend to color and pattern information                       com/science/article/pii/S0031320318303224.
     in an individuation task,              Cognition 90           doi:10.1016/j.patcog.2018.09.003 .
     (2004)     265–302.      URL:      https://linkinghub.   [23] A. Lampinen, S. Chan, A. Banino, F. Hill, Towards
     elsevier.com/retrieve/pii/S0010027703001471.                  mental time travel: a hierarchical memory for rein-
     doi:10.1016/S0010- 0277(03)00147- 1 .                         forcement learning agents, in: Advances in Neural
[13] K. Voudouris, M. Crosby, B. Beyret, J. Hernández-             Information Processing Systems, volume 34, Cur-
     Orallo, M. Shanahan, M. Halina, L. Cheke, Di-                 ran Associates, Inc., 2021, pp. 28182–28195. URL:
     rect Human-AI Comparison in the Animal-AI En-                 https://proceedings.neurips.cc/paper/2021/hash/
     vironment, Technical Report, PsyArXiv, 2021. URL:             ed519dacc89b2bead3f453b0b05a4a8b-Abstract.
     https://psyarxiv.com/me3xy/. doi:10.31234/osf.                html.
     io/me3xy , type: article.                                [24] I. C. Uzgiris, J. M. Hunt, Assessment in infancy: Or-
[14] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri,             dinal scales of psychological development, Assess-
     D. Tran, Detect-and-Track: Efficient Pose Estima-             ment in infancy: Ordinal scales of psychological
     tion in Videos, in: 2018 IEEE/CVF Conference on               development, University of Illinois Press, Cham-
     Computer Vision and Pattern Recognition, IEEE,                paign, IL, US, 1975. Pages: xi, 263.
     Salt Lake City, UT, 2018, pp. 350–359. URL: https:       [25] J. Z. Leibo, C. d. M. d’Autume, D. Zoran,
     //ieeexplore.ieee.org/document/8578142/. doi:10.              D. Amos, C. Beattie, K. Anderson, A. G. Castañeda,
     M. Sanchez, S. Green, A. Gruslys, S. Legg, D. Hass-        ris, G. Lever, A. G. Castañeda, C. Beattie, N. C.
     abis, M. M. Botvinick, Psychlab: A Psychology Lab-         Rabinowitz, A. S. Morcos, A. Ruderman, N. Son-
     oratory for Deep Reinforcement Learning Agents,            nerat, T. Green, L. Deason, J. Z. Leibo, D. Silver,
     2018. URL: http://arxiv.org/abs/1801.08116, number:        D. Hassabis, K. Kavukcuoglu, T. Graepel, Human-
     arXiv:1801.08116 arXiv:1801.08116 [cs, q-bio].             level performance in 3D multiplayer games with
[26] A. Shamsian, O. Kleinfeld, A. Globerson,                   population-based reinforcement learning, Sci-
     G. Chechik,        Learning Object Permanence              ence 364 (2019) 859–865. URL: https://www.science.
     from Video, arXiv:2003.10469 [cs] (2020). URL:             org/doi/full/10.1126/science.aau6249. doi:10.1126/
     http://arxiv.org/abs/2003.10469, arXiv: 2003.10469.        science.aau6249 , publisher: American Associa-
[27] P. Tokmakov, A. Jabri, J. Li, A. Gaidon, Object            tion for the Advancement of Science.
     Permanence Emerges in a Random Walk along             [35] J. Hernández-Orallo,            Evaluation in arti-
     Memory, arXiv:2204.01784 [cs] (2022). URL: http:           ficial intelligence:        from task-oriented to
     //arxiv.org/abs/2204.01784, arXiv: 2204.01784.             ability-oriented measurement,           Artificial In-
[28] A. Geiger, P. Lenz, R. Urtasun, Are we ready for           telligence Review 48 (2017) 397–447. URL:
     autonomous driving? The KITTI vision bench-                https://doi.org/10.1007/s10462-016-9505-7.
     mark suite,      in: 2012 IEEE Conference on               doi:10.1007/s10462- 016- 9505- 7 .
     Computer Vision and Pattern Recognition, 2012,        [36] J. Hernández-Orallo, The Measure of All Minds:
     pp. 3354–3361. doi:10.1109/CVPR.2012.6248074 ,             Evaluating Natural and Artificial Intelligence, Cam-
     iSSN: 1063-6919.                                           bridge University Press, 2017.
[29] L. Piloto, A. Weinstein, D. TB, A. Ahuja, M. Mirza,   [37] E. Triana, R. Pasnak, Object permanence in cats and
     G. Wayne, D. Amos, C.-c. Hung, M. Botvinick,               dogs, Animal Learning & Behavior 9 (1981) 135–139.
     Probing Physics Knowledge Using Tools from                 URL: http://link.springer.com/10.3758/BF03212035.
     Developmental Psychology, Technical Report                 doi:10.3758/BF03212035 .
     arXiv:1804.01128, arXiv, 2018. URL: http://arxiv.     [38] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper,
     org/abs/1804.01128, arXiv:1804.01128 [cs] type: ar-        C. Elion, C. Goy, Y. Gao, H. Henry, M. Mattar,
     ticle.                                                     D. Lange, Unity: A General Platform for Intelli-
[30] R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel,        gent Agents, Technical Report arXiv:1809.02627,
     W. Brendel, M. Bethge, F. A. Wichmann, Shortcut            arXiv, 2020. URL: http://arxiv.org/abs/1809.02627,
     learning in deep neural networks, Nature Ma-               arXiv:1809.02627 [cs, stat] type: article.
     chine Intelligence 2 (2020) 665–673. URL: https:      [39] L. Burke,         On the Tunnel Effect,          Quar-
     //www.nature.com/articles/s42256-020-00257-z.              terly Journal of Experimental Psychology
     doi:10.1038/s42256- 020- 00257- z , number: 11             4 (1952) 121–138. URL: http://journals.
     Publisher: Nature Publishing Group.                        sagepub.com/doi/10.1080/17470215208416611.
[31] A. Agrawal, D. Batra, D. Parikh, A. Kembhavi,              doi:10.1080/17470215208416611 .
     Don’t Just Assume; Look and Answer: Overcom-          [40] A. Michotte, G. Thines, G. Crabbé, Les complements
     ing Priors for Visual Question Answering, in:              amodaux des structures perceptives (Amodal com-
     2018 IEEE/CVF Conference on Computer Vision                pletion of perceptual structures), Studia Psycho-
     and Pattern Recognition, IEEE, Salt Lake City, UT,         logica. Publications Universitaires de Louvain.[GV]
     2018, pp. 4971–4980. URL: https://ieeexplore.ieee.         (1964).
     org/document/8578620/. doi:10.1109/CVPR.2018.         [41] C. Fields, Trajectory Recognition as the Basis for
     00522 .                                                    Object Individuation: A Functional Model of Ob-
[32] M. Crosby,         Building Thinking Machines              ject File Instantiation and Object-Token Encoding,
     by Solving Animal Cognition Tasks,           Minds         Frontiers in Psychology 2 (2011). URL: https://www.
     and Machines 30 (2020) 589–615. URL: https://              frontiersin.org/article/10.3389/fpsyg.2011.00049.
     doi.org/10.1007/s11023-020-09535-6. doi:10.1007/      [42] G. Johansson, Configurations in Event Perception
     s11023- 020- 09535- 6 .                                    (Uppsala, Sweden, Almqvist and Wiksell. Johans-
[33] D. Teney, E. Abbasnejad, K. Kafle, R. Shrestha,            son, G.(1973). Visual perception of biological mo-
     C. Kanan, A. van den Hengel, On the Value                  tion and a model for its analysis. Perception and
     of Out-of-Distribution Testing: An Example                 Psychophysics 14 (1950) 201–211.
     of Goodhart’ s Law, in: Advances in Neural            [43] G. Johansson,         Rigidity, Stability, and Mo-
     Information Processing Systems, volume 33,                 tion in Perceptual Space, Nordisk Psykologi
     Curran Associates, Inc., 2020, pp. 407–417. URL:           10 (1958) 191–202. URL: https://doi.org/10.1080/
     https://proceedings.neurips.cc/paper/2020/hash/            00291463.1958.10780387. doi:10.1080/00291463.
     045117b0e0a11a242b9765e79cbf113f-Abstract.html.            1958.10780387 , publisher: Routledge _eprint:
[34] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Mar-         https://doi.org/10.1080/00291463.1958.10780387.
[44] C. Buckner, Understanding associative and cogni-
     tive explanations in comparative psychology, The
     Routledge handbook of philosophy of animal minds
     (2017) 409–419. Publisher: Routledge.
[45] M. Dacey, Evidence in Default: Rejecting de-
     fault models of animal minds,            The British
     Journal for the Philosophy of Science (2021)
     714799. URL: https://www.journals.uchicago.edu/
     doi/10.1086/714799. doi:10.1086/714799 .
[46] I. M. Pepperberg, M. R. Willner, L. B. Gravitz, Devel-
     opment of Piagetian Object Permanence in a Grey
     Parrot (Psittacus erithacus), Journal of Comparative
     Psychology 111 (1997) 22.
[47] C. Chiandetti, G. Vallortigara, Intuitive physi-
     cal reasoning about occluded objects by inexpe-
     rienced chicks, Proceedings of the Royal Soci-
     ety B: Biological Sciences 278 (2011) 2621–2627.
     URL: https://royalsocietypublishing.org/doi/full/
     10.1098/rspb.2010.2381. doi:10.1098/rspb.2010.
     2381 , publisher: Royal Society.
[48] R. Burnell, J. Burden, D. Rutar, K. Voudouris,
     L. Cheke, J. Hernandez-Orallo, Not a Number:
     Identifying Instance Features for Capability-
     Oriented Evaluation, forthcoming, p. 9. URL:
     https://ryanburnell.com/wp-content/uploads/
     Burnell-et-al-2022-Not-a-Number.pdf.

</pre>