=Paper=
{{Paper
|id=Vol-3169/paper2
|storemode=property
|title=Evaluating Object Permanence in Embodied Agents using the Animal-AI Environment
|pdfUrl=https://ceur-ws.org/Vol-3169/paper2.pdf
|volume=Vol-3169
|authors=Konstantinos Voudouris,Niall Donnelly,Danaja Rutar,Ryan Burnell,John Burden,José Hernández-Orallo,Lucy Cheke
|dblpUrl=https://dblp.org/rec/conf/ijcai/VoudourisDRBBHC22
}}
==Evaluating Object Permanence in Embodied Agents using the Animal-AI Environment==
Evaluating Object Permanence in Embodied Agents using the Animal-AI Environment Konstantinos Voudouris1,2 , Niall Donnelly3 , Danaja Rutar1 , Ryan Burnell1 , John Burden1 , José Hernández-Orallo1,4 and Lucy G. Cheke1,2 1 Leverhulme Centre for the Future of Intelligence, Cambridge, UK 2 Department of Psychology, University of Cambridge, UK 3 The College of Engineering, Mathematics, and Physical Sciences, University of Exeter, UK 4 VRAIN, Universitat Politècnica de València, Spain Abstract Object permanence, the understanding and belief that objects continue to exist even when they are not directly observable, is important for any agent interacting with the world. Psychologists have been studying object permanence in animals for at least 50 years, and in humans for almost 50 more. In this paper, we apply the methodologies from psychology and cognitive science to present a novel testbed for evaluating whether artificial agents have object permanence. Built in the Animal-AI environment, Object-Permanence In Animal-Ai: GEneralisable Test Suites (O-PIAAGETS) improves on other benchmarks for assessing object permanence in terms of both size and validity. We discuss the layout of O-PIAAGETS and how it can be used to robustly evaluate OP in embodied agents. Keywords Object Permanence, AI Evaluation, Embodied Agents, Animal-AI Environment, Developmental Psychology, Comparative Cognition 1. Introduction The relation between object reidentification and OP is manifest: when an object passes out of view, we believe Object Permanence (OP) is the understanding and belief that it continues to exist. When it passes back into view, that objects continue to exist even when they are not we use knowledge about objects to determine whether directly observable. In behavioural terms, an agent has this is the same object we saw previously. Here, we use OP when they behave as though objects continue to exist OP to mean both classical OP and object reidentification. when they cannot see them. Human adults use OP to rea- OP has proven difficult to build into AI systems. Deep son about how objects behave and interact in the external Reinforcement Learning systems perform significantly world. Credited as the first to empirically investigate this worse than human children when solving problems in- capability, Jean Piaget observed how infants develop the volving OP [13]. Tracking objects under partial occlu- tendency to search for objects that became occluded [1]. sion appears to be difficult for modern computer vision Piaget’s insights have been extended considerably by de- methods [14]. The need for AI agents with robust OP velopmental and comparative psychologists, usually in is important for creating trustworthy embodied AI such the visual modality [2, 3, 4], although OP is an amodal as self-driving cars. Furthermore, robust object tracking phenomenon [5]. under occlusion would have many applications in the Humans and some animals appear to understand that field of robotics. However, the methods for evaluating objects continue to exist independently of them, with the whether an agent has OP suffer from a lack of precision, same properties. However, when an object reappears, reliability, and validity. Developmental and comparative what makes us reidentify this as the same object as be- psychologists have been investigating OP in biological fore? Object reidentification has been studied in visual agents for around a century, developing many experi- cognition research with adults [6, 7, 8] and primates [9], mental paradigms along the way. Until now, AI research and in developmental psychology with infants [10, 11, 12]. has not applied these methods to AI evaluation [15]. In EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022, this paper, we outline a new test battery, built in the Vienna, Austria Animal-AI Environment [16] for evaluating whether em- Envelope-Open kv301@cam.ac.uk (K. Voudouris); bodied artificial agents have OP: the Object-Permanence mail.niall.donnelly@gmail.com (N. Donnelly); dr571@cam.ac.uk in Animal-Ai: GEneralisable Test Suites. O-PIAAGETS is a (D. Rutar); rb967@cam.ac.uk (R. Burnell); jjb205@cam.ac.uk novel attempt to use experiments designed for investigat- (J. Burden); jorallo@upv.es (J. Hernández-Orallo); lgc23@cam.ac.uk (L. G. Cheke) ing whether biological agents have OP for AI evaluation. Orcid 0000-0001-8453-3557 (K. Voudouris) First, we examine why OP is a challenge for AI research. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Second, we critically review existing OP testbeds. Third, CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) we outline the structure of the test battery and how it [21], or robust inductive and abductive learning heuris- can be used to robustly investigate whether agents have tics and biases [20, 19, 8]. It is therefore not as simple as OP. Finally, we discuss how O-PIAAGETS can be used imputing a Principle of Persistence to build AI systems for evaluation and how it improves on existing testbeds with OP. in the field. 2.2. Existing Evaluation Methods for OP 2. Background and Motivations in AI AI researchers, particularly those working on computer 2.1. The Logical Problem of OP vision, embodied agents, and robotics, are interested in OP may appear to be a trivial capacity for an agent to have. building AI systems capable of robustly reasoning about The agent must simply understand that objects continue visual scenes, in a similar way to how humans and ani- to exist when they are not directly observable. Indeed, mals do. Researchers have built several evaluation frame- Renée Baillargeon and colleagues [17] hypothesise that works for assessing whether embodied artificial agents children are born with a Principle of Persistence, which and computer vision systems have OP. states exactly this [18, 19]. Why, then, can’t we endow AI Lampinen et al. [23] built OP tasks in a 3D Unity envi- systems with such a principle, bias, or heuristic? Can’t ronment. Here, the agent was fixed as it watched three we simply tell an agent that objects continue existing boxes. Periodically, objects would leap out of the three when they are occluded? Fields [20, 19] has discussed boxes, simultaneously or sequentially with or without how the notion of a Principle of Persistence is untenable, a refractory time lag. The agent would then be turned due to the Frame Problem (FP). away from the boxes, released, and asked to go to the The FP implies that endowing an agent, biological or box with a particular object. If it chose the correct box, it artificial, with a principle of persistence is not trivial. It was rewarded, similar to tasks used with human infants cannot be overcome with a representation as simple as [24] and non-human primates [4]. Crosby et al. [16] objects continue to exist even when they aren’t observable. developed a series of 90 OP tests as part of the Animal-AI In its raw form, the FP demonstrates that when logically Testbed and Olympics, inspired and directly developed by describing the effects of particular actions on objects in developmental and comparative psychology. Some work a domain, we must also describe ad nauseam all the non- has been done comparing embodied deep reinforcement effects of those actions on those objects. As Fields [19] learning agents to humans on these tasks. Children aged says, it amounts to having to describe everything that 6-10, with limited training, significantly outperformed doesn’t change in the universe as a result of turning off the Deep Reinforcement Learning systems on the OP tasks fridge (p. 443). In a domain where objects have certain in the Animal-AI Testbed [13], indicating there is room properties that can change over time, as in all real-world to improve these systems until they reach human-level scenarios, the FP implies that we can’t simply say that performance. Leibo et al. [25] developed Psychlab for the objects stay the same over time, without describing probing psychophysical phenomena in Deep Reinforce- which properties remain unchanged and when [21]. ment Learning systems using cognitive science methods When an agent can observe everything in a domain, and qualitatively comparing performance with human and re-update what has and has not changed at every participants, but they did not investigate OP. timestep, the FP rarely raises any issues. However, when Having OP is not only applicable to embodied agents, objects become occluded, it becomes important to track but also to passive computer vision systems engaged in which properties of those objects do and do not change object tracking. The Localisation Annotations Compo- and when, in order to identify other objects as identi- sitional Actions and TEmporal Reasoning (LA-CATER) cal or different. For example, imagine a lion watching a dataset [26] is prominent in computer vision research. small antelope pass behind some bushes and then seeing LA-CATER contains 14000 video scenes where objects a large antelope emerge at the other side. It becomes can move in three dimensions, contain, and carry each useful to know that antelope don’t change size over such other. Several tasks in this dataset happen to behave time periods, and therefore the smaller antelope contin- similarly to OP experiments used in psychology. For ues to exist because of the persistence of its size (and example, one task involves an object being occluded by other) properties. It also becomes useful to know that one of three identical ‘cups’; once occluded, the cups are the antelope doesn’t change when the lion changes their moved relative to each other. This bears resemblance to perspective, or occludes the antelope through its own the cup-tasks used in the Primate Cognition Test Battery actions, an analogue of the Simultaneous Location and [4] (see Figure 3) or in the Užgiris and Hunt [24] test Mapping (SLAM) problem in robotics [22]. Overcoming battery for infants. Other benchmark datasets include the FP either requires sophisticated deductive techniques ParallelDomain (PD) and KITTI [27]. PD is a synthetic dataset designed to test occlusions in driving scenarios. It contains 210 photo-realistic driving scenarios in city [31, 32, 33]. LA-CATER and the procedurally generated environments, from 3 camera angles, creating a dataset of test sets mentioned earlier were generated according to 630. KITTI [28] has 21 labelled videos of real-world city a series of rules, with training, validation, and test sets scenes, in which cars, pedestrians, and other objects pass divided arbitrarily. The PD and KITTI datasets were behind each other and become partially or fully occluded, generated and collected non-procedurally, but again, the a small fraction of the total KITTI dataset [27]. distinction between training and test sets is often arbi- Piloto et al. [29] directly applied a measurement frame- trary [27]. work innovated in developmental psychology to probe Moving from i.i.d. to o.o.d. test data promotes robust- physics knowledge in artificial systems, including OP. Vi- ness in AI systems. Developing a testbed for OP in which olation of Expectation has been used by the neo-Piagetian training and test data are kept distinct means that we can school of developmental psychology [3], investigating in- be more certain that AI systems have OP if they perform fants’ knowledge about the world by determining when successfully, rather than overfitting to the data distribu- they are surprised to see something, violating their ex- tion. This means we can evaluate whether an AI has an pectations. For example, infants at about 4.5 months tend ability corresponding to OP, rather than a propensity for to show surprise (by looking more) if an object appears solving some distribution of tasks that require it1 [35, 36]. to change size whilst occluded [10, 11]. Piloto et al. pro- O.o.d. testing enables researchers to have grounds to cedurally generated 28 3-second videos that emulated a say they are testing for the presence of abilities. However, small subset of these studies, and used Kullback-Leibler selecting a test distribution must be guided by some prin- divergence as the AI equivalent of looking time. They ciple that tells us why the training and test distributions demonstrated the utility of this technique for probing are meaningfully related. This takes us to the second prob- physical knowledge in computer vision systems. lem for OP evaluation in AI: that testing lacks internal In both computer vision and embodied AI, several validity. Developmental and comparative psychologists methods for detecting when agents have OP have been have developed numerous experimental designs to test proposed. However, with the exception of the work of for the presence of cognitive abilities in biological agents, Piloto et al. [29] and Lampinen et al. [23] at DeepMind introducing numerous controls to eliminate alternative and the Animal-AI Testbed and Olympics, little attention explanations. As a point of reference, let’s take the classic has been paid to systematically applying the methodolo- A-not-B paradigm for testing OP. Participants are pre- gies of psychology to try to understand and evaluate OP sented with an object of interest that is hidden for several in artificial agents. trials at location A. In the AI context, this amounts to a training distribution around location A (with variance 2.3. Problems with Current Evaluation corresponding to minor differences between trials). To test the participant to find an object of interest at A, true Frameworks for OP OP understanding as an explanation is conflated with Two main problems exist with the methods for evaluating other explanations in terms of memorising spatial loca- whether AI has OP. The first problem is that most of tion or returning to a previously rewarding location as these benchmarks and testbeds use independent-and- infants under 9 months, and many animals, do [1, 37]. identically-distributed (i.i.d.) test data, meaning testing To eliminate (some of) these explanations, in the test data is drawn from the same distribution as training data. condition, participants are faced with an object hidden This especially applies to LA-CATER, PD, and KITTI. The at location B. The testing distribution now includes ob- second problem is a lack of internal validity. Sufficient jects hidden at B, and the relation between the two is controls to eliminate alternative explanations for certain meaningful in the context of OP, because an agent needs behaviours are often lacking. OP to solve the task. The logic is that one would only The problem with i.i.d. testing data is that it is in prin- perform well on training (A-only) and testing (B-only) ciple impossible to distinguish between an agent that has if one had OP. Of course, there are further alternative OP and one using problem-irrelevant shortcuts to max- explanations for correct search at locations A and B, such imise reward, appearing as if they have OP. This means as simply searching where the experimenter’s hand has that even if we had an agent that genuinely had OP, our just been [24]. So internal validity tends to increase the evaluation methods limit how certain we can be of that. Geirhos et al. [30] argue that an effective measure against 1 For example, an anonymous reviewer pointed out that Deep- this is to test AIs on out-of-distribution (o.o.d.) test data, Mind’s FTW agent [34] arguably has object permanence, since it can where training data and test data are drawn from dif- successfully fight players who duck for cover in a 3D capture-the-flag ferent (but meaningfully related) distributions. This is game. While this is certainly evidence for OP in an artificial agent, it remains speculative for now, since FTW has not yet been tested related to the notion of transfer tasks in developmental on an internally valid, out-of-distribution test set like O-PIAAGETS and comparative psychology. The move from i.i.d to o.o.d. - although O-PIAAGETS itself is not yet developed for testing such a testing is still not mainstream, but is gaining prominence multi-agent system. more diversity in training and test data there is, as they become mutually controlling. Psychologically-inspired testbeds for evaluating OP in AI systems, such as Piloto et al. [29], Lampinen et al. [23] and Crosby et al. [16], remain small and so internal validity remains relatively low. The confluence of low internal validity in some testbeds and the lack of o.o.d. testing means that even if an AI system genuinely has OP, our evaluation frameworks and metrics are not internally valid enough to show this. In this paper, we propose a novel large testbed for conducting o.o.d. testing with high internal validity. Figure 1: The Animal-AI Environment. A bird’s eye view of the arena is given top centre. The various objects that can 3. Introducing O-PIAAGETS populate it are shown and described in the text. In the previous section, we established three things: 1. OP poses a challenging logical problem. It is not and red lava zones), pink ramps, and transparent and trivial for an agent to have OP. opaque2 blocks and tunnels (see Figure 1). These objects 2. Computer vision and embodied agent results sug- can be any size, constrained only by the dimensions of gest that trained computer architectures solve the arena and the fact that two objects can’t occupy the tasks involving OP at a level significantly lower same location (apart from lava zones). The lights can also than that of humans. be switched on or off for preset periods of time, removing 3. Current benchmarks and testbeds for evaluating all visual information (see Figure 7 for an example). whether AI systems possess OP have limitations Points are gained and lost through contact with re- such that even if an AI system had OP, we might wards of differing size and significance, and punishments not be able to tell with reasonable certainty. of differing severity. Obtaining a yellow sphere increases Our novel testbed, Object-Permanence in Animal-Ai: points. Obtaining a green sphere also increases points GEneralisable Test Suites (O-PIAAGETS), overcomes the and is episode-ending. Obtaining a red sphere decreases limitations of other testbeds by applying an out-of- points and is episode ending, as does touching red lava distribution testing framework on a large internally valid zones. All spheres can be stationary or in motion through set of tasks adapted from comparative and developmental all three dimensions. Points start at 0 and decrease lin- psychology and visual cognition research. early with each timestep over an episode, creating time O-PIAAGETS uses the Animal-AI Environment to gen- pressure and therefore motivation for fast and decisive erate individual tasks for training and testing, based on action. theoretical and empirical findings in the psychology lit- erature. The testbed has an internal structure in which 3.2. Structure of O-PIAAGETS certain tasks are designed to test certain aspects of OP understanding. There is also a tailored training curricu- O-PIAAGETS adapts some tasks from the open-source lum to ensure out-of-distribution testing, and more direct Animal-AI Testbed, but mostly includes new ones. It comparison between biological and artificial machines. currently contains 5000 tasks, divided into four suites, This work complements and extend the work of Piloto although it continues to expand as new features are re- et al. [29], Lampinen et al. [23], Crosby et al. [16], and leased for the Animal-AI Environment. There are three Voudouris et al. [13]. suites which test different aspects of OP and one suite which contains controls for non-OP based explanations. The three suites were motivated a priori by Brian Scholl’s 3.1. The Animal-AI Environment [7] exposition of OP research. Here, Scholl reviews work The Animal-AI Environment [16] is a 3D world with Eu- on OP from across research in psychology, neuroscience, clidean geometry and Newtonian physics built in Unity and philosophy, arguing that OP appears to be under- [38]. The environment contains several objects, a single pinned by three key cognitive strategies. Humans appear agent, and a finite number of actions it can perform (move to reason about objects under occlusion as (a) existing on and rotate in x-z plane). The agent is situated in a square continuous spatiotemporal trajectories, (b) maintaining arena. The arena can be populated with appetitive (green certain properties, such as size, but not necessarily oth- and yellow spheres) and aversive stimuli (red spheres 2 Of any RGB colour combination ers, such as colour, and (c) existing as unified cohesive wholes. O-PIAAGETS therefore contains a Spatiotem- poral Continuity suite, a Persistence Through Prop- erty Change suite, and a Cohesion suite. Each suite is subdivided based on the psychology and AI research into sub-suites testing different aspects of the suites. Those sub-suites are subdivided into experimental paradigms from the psychology literature. To maintain high inter- nal validity, each sub-suite has at least 3 experimental paradigms. These are further divided into tasks which are specific instantiations of an experimental paradigm as used in specific experiments. These tasks are composed of instances, that are procedurally generated variations of the global structure of the task, such as right and left versions or versions with goals of different sizes or in different positions. Finally, these instances are composed of variants, which are procedurally generated variations of the local structure of instances, with changes to the colours of walls and the starting orientation of the agent. In every test below, the objective is simple: maximise reward. This involves obtaining yellow and green re- Figure 2: An example detour task. To get to the reward, the wards while avoiding red rewards and ‘lava’, as quickly agent must navigate around the wall and up the ramp. This as possible. means that the goal will go out of view through the movement of the agent. 3.2.1. Spatiotemporal Continuity The Spatiotemporal Continuity suite examines how par- a tunnel and come out of the other side (Burke, 1952). ticipants reason about objects as persisting in the same However, if the second object appears later than expected spatiotemporal region, given initial starting velocities or on a different trajectory, we do not identify it as the and other interacting objects. This suite is divided into same object [6] (see Figure 4). The Tunnel Effect tasks two sub-suites: egocentric OP and allocentric OP. enable us to probe where OP ‘breaks’ in the agent in Egocentric OP pertains to reasoning about objects per- question, and how it compares to human performance. sisting when they pass out of view through the actions of In the Tunnel Effect tasks here and below, the agent the agent. This allows us to evaluate how well an agent is frozen until they have observed the whole scene, so can learn about the identity and location of objects in they don’t miss the important occlusion events we are a region while also moving around that region, a vari- probing, eliminating a potential explanation for why an ant of the SLAM problem in robotics. An example of agent failed on these tasks. an egocentric OP task is a detour task where a goal is In line with developments of the Animal-AI Environ- observable but inaccessible behind an obstacle. The way ment, we will introduce allocentric OP tasks involving to obtain it is to detour around the obstacle such that containment in stationary and moving containers, as the goal is temporarily left out of sight. The logic here is done in the LA-CATER and Lampinen et al. [23] testbeds that one would only execute the detouring behaviour if discussed earlier. one believed that the goal would still exist when one has finished detouring (see Figure 2). 3.2.2. Persistence Through Property Change Allocentric OP pertains to reasoning about objects that pass out of view not because of the actions of the agent, The second suite of tests extends the Tunnel Effect tasks, but because they become occluded by another object. investigating which properties of an object must change The Cup Task in Figure 3 is an example [4]. A goal is under occlusion for the post-occlusion object and pre- hidden inside a ‘cup’ for some time. To succeed, the agent occlusion to be classified as different. Scholl [7] reports would need to search in the correct ‘cup’. that the Tunnel Effect is not disrupted by colour or The Tunnel Effect paradigm is a second example. An shape change, only size changes, and the spatiotemporal object passes behind an occluder, and another emerges changes in the previous sub-suite [39, 9, 6, 40]. Wilcox some time later. If the second object appears as a human and Baillargeon [10] present evidence that the Tunnel would expect it to, given the first object’s trajectory, we Effect is disrupted by colour, shape, and texture changes. perceive it as though the first object has gone through O-PIAAGETS permits more control over the timing and Figure 3: An example allocentric task inspired by the Primate Cognition Test Battery [4]. Red arrows indicate goals, pale Figure 4: A Tunnel Effect task. Humans would perceive the arrow indicates agent. object in 1A as the same as the object in 1B, but the object in 2A as different to the object in 2B, because of the impossible trajectory. nature of changes, so can be used for empirical study with humans to investigate these inconsistent results, as well as to analyse under what conditions OP breaks in AI agents. Currently, this suite only contains one sub-suite, test- ing the Tunnel Effect with apparent size change under occlusion. However, in line with developments in the Animal-AI Environment, we are building sub-suites for apparent shape, colour, and pattern change. An example of a task in which size appears to change is provided in Figure 5. The post-occlusion object is smaller than the pre-occlusion object, so the agent must search for two distinct objects, not just the visible one. Figure 5: A tunnel effect task manipulating the property of size. 3.2.3. Object Cohesion Scholl [7] argues that OP is not disrupted in human adults when the contours are partially or completely removed left and seek out the large goal behind the wall, or turn from a visual object representation, so long as size does right and seek out the entirely visible smaller goal. The not appear to change. Humans, and many animals, as- smaller goal is larger than the hole in the wall, so agents sume that an object is of constant size [41], even when that compare the number of green pixels visible at one contour information is partially occluded or completely time without understanding that size remains constant removed and replaced with point lights [42, 43]. Cur- under (brief) occlusion will make the wrong choice. rently, this suite contains only one sub-suite, examining size constancy under partial occlusion. An example of this is the aperture task in Figure 6, innovated for O- PIAAGETS based on discussion in Scholl [7]. Agents watch a large green goal roll behind a wall with a small hole in it. It is then released and given the choice to turn they possess OP. If they perform poorly on both, then there is some issue with understanding the environment or how to interact with it. If they perform well on the OP tasks but not the controls, then we have counter-intuitive evidence that OP can be decoupled from other abilities required to solve tasks in the environment. 3.3.2. Paradigms, Instances, Variants Within the test suites themselves, two measures have been taken to increase internal validity. First, each task has several instances and variants. We have procedurally generated many versions of the same task that are mirror images of each other (left/right versions), have rewards and goals in different positions, or use different kinds of occluders. This counterbalanced design allows us to Figure 6: The Aperture Task. A and B are before and after par- detect when agents are solving tasks through problem- tial occlusion. Parts 1 and 3 are variants of the same instance, irrelevant shortcuts. For example, in the aperture task with differing wall colours. Part 2 is a different instance of the in Figure 6.1, an agent with a bias to turning left might aperture task, the mirror image of Part 1. appear to succeed, but would not succeed at the instance which is a mirror image of this task, as in 6.2. These instances have many variants, changing the colour (often 3.3. Increasing Internal Validity randomly) and initial orientation of the agent, as seen in 3.3.1. Control Suite 6.3. This allows us to control for policies such as search behind the grey obstacle, which may be successful in some The fourth suite is a set of control tests that serve to de- tasks but does not indicate OP. termine whether agents can solve tests not measuring OP. Second, the inclusion of several experimental There are two sub-suites here. The first is an introduction paradigms in each sub-suite means they are mutually to the environment, introducing basic controls and the controlling. The philosophy of science tells us that objects present in the environment. These tasks allow an no single experiment would be able to diagnose the agent, human or artificial, to learn which objects increase presence or absence of OP [44, 45], because there are reward and which objects decrease reward, and which always alternative explanations that could be appealed objects are inert. Agents that fail some or all the tasks in to. Using several distinct experimental paradigms means the above three suites might not be failing because they that they can control for each other and help eliminate lack OP, but because they do not, for example, navigate these alternative explanations. The cup task in Figure 3 towards green rewards or away from red lava, or under- could be solved by a policy of navigating to where the stand the utility of ramps for movement in the up-down reward was last seen [46], which is not necessarily the plane. The second sub-suite contains further control tests same as understanding that the object continues to exist for the OP tasks in the previous three suites. These are even though the agent can’t see it. An adaptation of tests that do not require OP to be solved, but introduce Chiandetti and Vallortigara’s [47] paradigm controls for the kind of landscapes and choices an agent might have this (see Figure 7). Here, the agent watches a reward roll to make. This means we can determine whether poor away from across lava. Then the lights go out, removing performance on the OP tasks was a result of a lack of OP, visual information for a short period. When the lights or a lack of understanding of the landscapes those tests go back on, the goal is not visible. However, there is took place in. Since every task in the test battery will only one place it can be. Going to where the reward was require other abilities distinct from OP, these controls last seen would end in failure, by touching lava, and the allow developers to check whether errors are a result of position of the goal before the lights out provides no cue a lack of OP or a lack of some other ability. These tasks as to whether the agent should go right or left. The use can either be used in training or for further testing. An of several experimental paradigms in each sub-suite has example would be Figure 2 but without a grey wall and the effect of reducing the likelihood of confounds that with a pink ramp the length of the blue platform. This in- we have not foreseen. creases internal validity, because if agents performs well on the control task, but not well on the equivalent OP task, then we have reason to believe that they lack OP. If they perform well on both, we have reason to believe that 5. Future Directions and Conclusions Using O-PIAAGETS, developers can robustly evaluate whether artificial embodied agents have OP using the methodologies of cognitive science. It improves on other benchmarks and testbeds in the field both in terms of its size, internal validity, and ability to detect the pres- ence of robust and generalisable OP in artificial systems. O-PIAAGETS is going through the final stages of develop- ment for general release of Version 1.0, including around 5000 tasks using the current Animal-AI Version 3.0.1. After validation with human participants and the devel- opment of baseline agents to characterise state-of-the-art performance in O-PIAAGETS, it will be expanded to in- Figure 7: A task inspired by Chiandetti and Vallortigara’s clude containment tasks, point lights, and shape, colour, [47] study with day-old chicks. and pattern changes. In its final form, O-PIAAGETS will provide a comprehensive and robust evaluation frame- work for assessing OP in artificial agents. 4. Evaluating OP using O-PIAAGETS Acknowledgments We thank the anonymous reviewers for their comments. 4.1. Out-of-Distribution Testing This work was funded by the Future of Life Institute, O-PIAAGETS facilitates out-of-distribution testing by FLI, under grant RFP2-152, EU’s Horizon 2020 research providing a tailored training set using the control suite, and innovation programme under grant agreement No. and a separate test set using the three test suites. The 952215 (TAILOR), US DARPA HR00112120007 (RECoG- control suite contains tasks where the positions and ori- AI), and an ESRC DTP scholarship to KV, ES/P000738/1. entations of objects is specified and tasks where those positions are randomly generated, providing in principle References a very large amount of training data that is on a different distribution to the test data. [1] J. Piaget, The Origins of Intelligence In The Child, Routledge & Kegan Paul, Ltd., 1923. 4.2. Measurement Layouts [2] R. Baillargeon, E. S. Spelke, S. Wasserman, Object permanence in five-month-old infants, Cognition Each variant in O-PIAAGETS is tagged with its position 20 (1985) 191–208. URL: https://linkinghub.elsevier. in the test battery (i.e., what suite, sub-suite, experimen- com/retrieve/pii/0010027785900083. doi:10.1016/ tal paradigm, etc., it is a member of) as well as features 0010- 0277(85)90008- 3 . such as goal sizes, the abilities an agent might require [3] R. Baillargeon, J. Li, Y. Gertner, D. Wu, How in addition to OP to solve it, and the other variants it Do Infants Reason about Physical Events?, in: controls for. This leads to an incredibly rich dataset for U. Goswami (Ed.), The Wiley-Blackwell Handbook evaluating agents beyond merely aggregating their score of Childhood Cognitive Development, 2010, pp. or success across the test suites. For example, develop- 11–48. Publisher: Wiley Online Library. ers can explore how relevant and irrelevant features of [4] E. Herrmann, J. Call, M. V. Hernàndez-Lloreda, the tests, such as goal size, occluder colour, or right/left B. Hare, M. Tomasello, Humans Have variants, correlate with performance [48], and use this Evolved Specialized Skills of Social Cogni- to evaluate whether an agent has OP or is using other tion: The Cultural Intelligence Hypothesis, policies to solve OP tasks. For example, assuming any Science 317 (2007) 1360–1366. URL: https: agent interacting with O-PIAAGETS will make errors, //www.science.org/doi/10.1126/science.1146282. including humans [13], it is important to evaluate how doi:10.1126/science.1146282 . those errors are distributed. By hypothesis, an agent [5] J. G. Bremner, A. M. Slater, S. P. Johnson, Percep- with OP will produce random error, uncorrelated with tion of Object Persistence: The Origins of Object experimental paradigms, goal sizes, or the colours of Permanence in Infancy, Child Development Per- occluders. spectives 9 (2015) 7–13. URL: https://onlinelibrary. wiley.com/doi/abs/10.1111/cdep.12098. 1109/CVPR.2018.00044 . doi:10.1111/cdep.12098 , _eprint: https://on- [15] D. Gunning, Machine Common Sense Concept linelibrary.wiley.com/doi/pdf/10.1111/cdep.12098. Paper (2018) 18. [6] J. I. Flombaum, B. J. Scholl, A temporal same-object [16] M. Crosby, B. Beyret, M. Shanahan, J. Hernández- advantage in the tunnel effect: Facilitated change Orallo, L. Cheke, M. Halina, The animal-AI testbed detection for persisting objects., Journal of Experi- and competition, in: NeurIPS 2019 competition and mental Psychology: Human Perception and Perfor- demonstration track, PMLR, 2020, pp. 164–176. mance 32 (2006) 840–853. URL: http://doi.apa.org/ [17] R. Baillargeon, Innate Ideas Revisited: For a Prin- getdoi.cfm?doi=10.1037/0096-1523.32.4.840. doi:10. ciple of Persistence in Infants’ Physical Reason- 1037/0096- 1523.32.4.840 . ing, Perspectives on Psychological Science 3 (2008) [7] B. J. Scholl, Object Persistence in Philosophy and 2–13. URL: https://doi.org/10.1111/j.1745-6916.2008. Psychology, Mind & Language 22 (2007) 563–591. 00056.x. doi:10.1111/j.1745- 6916.2008.00056. URL: https://onlinelibrary.wiley.com/doi/abs/10. x , publisher: SAGE Publications Inc. 1111/j.1468-0017.2007.00321.x. doi:10.1111/j. [18] Z. Pylyshyn, Perception, representation, and the 1468- 0017.2007.00321.x , _eprint: https://on- world: The FINST that binds, in: D. Dedrick, linelibrary.wiley.com/doi/pdf/10.1111/j.1468- L. Trick (Eds.), Computation, Cognition, and 0017.2007.00321.x. Pylyshyn, MIT Press, 2009, pp. 3–48. Google-Books- [8] J. I. Flombaum, B. J. Scholl, L. R. Santos, Spatiotem- ID: zZ3HGOug3k8C. poral priority as a fundamental principle of object [19] C. Fields, How humans solve the frame persistence, The origins of object knowledge (2009) problem, Journal of Experimental & The- 135–164. Publisher: Citeseer. oretical Artificial Intelligence 25 (2013) [9] J. I. Flombaum, S. M. Kundey, L. R. Santos, B. J. 441–456. URL: http://www.tandfonline. Scholl, Dynamic Object Individuation in Rhesus com/doi/abs/10.1080/0952813X.2012.741624. Macaques: A Study of the Tunnel Effect, Psycholog- doi:10.1080/0952813X.2012.741624 . ical Science 15 (2004) 795–800. URL: https://doi.org/ [20] C. A. Fields, The Principle of Persistence, Leib- 10.1111/j.0956-7976.2004.00758.x. doi:10.1111/j. niz’s Law, and the Computational Task of Ob- 0956- 7976.2004.00758.x , publisher: SAGE Pub- ject Re-Identification, Human Development 56 lications Inc. (2013) 147–166. URL: https://www.jstor.org/stable/ [10] T. Wilcox, R. Baillargeon, Object individuation in 26764659, publisher: S. Karger AG. infancy: The use of featural information in reason- [21] M. Shanahan, Solving the Frame Problem: A ing about occlusion events, Cognitive psychology Mathematical Investigation of the Common Sense 37 (1998) 97–155. Publisher: Elsevier. Law of Inertia, MIT Press, 1997. Google-Books-ID: [11] T. Wilcox, Object individuation: infants’ use z8zR3Ds7xKQC. of shape, size, pattern, and color, Cognition 72 [22] R. Muñoz-Salinas, M. J. Marín-Jimenez, R. Medina- (1999) 125–166. URL: https://linkinghub.elsevier. Carnicer, SPM-SLAM: Simultaneous local- com/retrieve/pii/S0010027799000359. doi:10.1016/ ization and mapping with squared planar S0010- 0277(99)00035- 9 . markers, Pattern Recognition 86 (2019) [12] T. Wilcox, C. Chapa, Priming infants to 156–171. URL: https://www.sciencedirect. attend to color and pattern information com/science/article/pii/S0031320318303224. in an individuation task, Cognition 90 doi:10.1016/j.patcog.2018.09.003 . (2004) 265–302. URL: https://linkinghub. [23] A. Lampinen, S. Chan, A. Banino, F. Hill, Towards elsevier.com/retrieve/pii/S0010027703001471. mental time travel: a hierarchical memory for rein- doi:10.1016/S0010- 0277(03)00147- 1 . forcement learning agents, in: Advances in Neural [13] K. Voudouris, M. Crosby, B. Beyret, J. Hernández- Information Processing Systems, volume 34, Cur- Orallo, M. Shanahan, M. Halina, L. Cheke, Di- ran Associates, Inc., 2021, pp. 28182–28195. URL: rect Human-AI Comparison in the Animal-AI En- https://proceedings.neurips.cc/paper/2021/hash/ vironment, Technical Report, PsyArXiv, 2021. URL: ed519dacc89b2bead3f453b0b05a4a8b-Abstract. https://psyarxiv.com/me3xy/. doi:10.31234/osf. html. io/me3xy , type: article. [24] I. C. Uzgiris, J. M. Hunt, Assessment in infancy: Or- [14] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, dinal scales of psychological development, Assess- D. Tran, Detect-and-Track: Efficient Pose Estima- ment in infancy: Ordinal scales of psychological tion in Videos, in: 2018 IEEE/CVF Conference on development, University of Illinois Press, Cham- Computer Vision and Pattern Recognition, IEEE, paign, IL, US, 1975. Pages: xi, 263. Salt Lake City, UT, 2018, pp. 350–359. URL: https: [25] J. Z. Leibo, C. d. M. d’Autume, D. Zoran, //ieeexplore.ieee.org/document/8578142/. doi:10. D. Amos, C. Beattie, K. Anderson, A. G. Castañeda, M. Sanchez, S. Green, A. Gruslys, S. Legg, D. Hass- ris, G. Lever, A. G. Castañeda, C. Beattie, N. C. abis, M. M. Botvinick, Psychlab: A Psychology Lab- Rabinowitz, A. S. Morcos, A. Ruderman, N. Son- oratory for Deep Reinforcement Learning Agents, nerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, 2018. URL: http://arxiv.org/abs/1801.08116, number: D. Hassabis, K. Kavukcuoglu, T. Graepel, Human- arXiv:1801.08116 arXiv:1801.08116 [cs, q-bio]. level performance in 3D multiplayer games with [26] A. Shamsian, O. Kleinfeld, A. Globerson, population-based reinforcement learning, Sci- G. Chechik, Learning Object Permanence ence 364 (2019) 859–865. URL: https://www.science. from Video, arXiv:2003.10469 [cs] (2020). URL: org/doi/full/10.1126/science.aau6249. doi:10.1126/ http://arxiv.org/abs/2003.10469, arXiv: 2003.10469. science.aau6249 , publisher: American Associa- [27] P. Tokmakov, A. Jabri, J. Li, A. Gaidon, Object tion for the Advancement of Science. Permanence Emerges in a Random Walk along [35] J. Hernández-Orallo, Evaluation in arti- Memory, arXiv:2204.01784 [cs] (2022). URL: http: ficial intelligence: from task-oriented to //arxiv.org/abs/2204.01784, arXiv: 2204.01784. ability-oriented measurement, Artificial In- [28] A. Geiger, P. Lenz, R. Urtasun, Are we ready for telligence Review 48 (2017) 397–447. URL: autonomous driving? The KITTI vision bench- https://doi.org/10.1007/s10462-016-9505-7. mark suite, in: 2012 IEEE Conference on doi:10.1007/s10462- 016- 9505- 7 . Computer Vision and Pattern Recognition, 2012, [36] J. Hernández-Orallo, The Measure of All Minds: pp. 3354–3361. doi:10.1109/CVPR.2012.6248074 , Evaluating Natural and Artificial Intelligence, Cam- iSSN: 1063-6919. bridge University Press, 2017. [29] L. Piloto, A. Weinstein, D. TB, A. Ahuja, M. Mirza, [37] E. Triana, R. Pasnak, Object permanence in cats and G. Wayne, D. Amos, C.-c. Hung, M. Botvinick, dogs, Animal Learning & Behavior 9 (1981) 135–139. Probing Physics Knowledge Using Tools from URL: http://link.springer.com/10.3758/BF03212035. Developmental Psychology, Technical Report doi:10.3758/BF03212035 . arXiv:1804.01128, arXiv, 2018. URL: http://arxiv. [38] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper, org/abs/1804.01128, arXiv:1804.01128 [cs] type: ar- C. Elion, C. Goy, Y. Gao, H. Henry, M. Mattar, ticle. D. Lange, Unity: A General Platform for Intelli- [30] R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, gent Agents, Technical Report arXiv:1809.02627, W. Brendel, M. Bethge, F. A. Wichmann, Shortcut arXiv, 2020. URL: http://arxiv.org/abs/1809.02627, learning in deep neural networks, Nature Ma- arXiv:1809.02627 [cs, stat] type: article. chine Intelligence 2 (2020) 665–673. URL: https: [39] L. Burke, On the Tunnel Effect, Quar- //www.nature.com/articles/s42256-020-00257-z. terly Journal of Experimental Psychology doi:10.1038/s42256- 020- 00257- z , number: 11 4 (1952) 121–138. URL: http://journals. Publisher: Nature Publishing Group. sagepub.com/doi/10.1080/17470215208416611. [31] A. Agrawal, D. Batra, D. Parikh, A. Kembhavi, doi:10.1080/17470215208416611 . Don’t Just Assume; Look and Answer: Overcom- [40] A. Michotte, G. Thines, G. Crabbé, Les complements ing Priors for Visual Question Answering, in: amodaux des structures perceptives (Amodal com- 2018 IEEE/CVF Conference on Computer Vision pletion of perceptual structures), Studia Psycho- and Pattern Recognition, IEEE, Salt Lake City, UT, logica. Publications Universitaires de Louvain.[GV] 2018, pp. 4971–4980. URL: https://ieeexplore.ieee. (1964). org/document/8578620/. doi:10.1109/CVPR.2018. [41] C. Fields, Trajectory Recognition as the Basis for 00522 . Object Individuation: A Functional Model of Ob- [32] M. Crosby, Building Thinking Machines ject File Instantiation and Object-Token Encoding, by Solving Animal Cognition Tasks, Minds Frontiers in Psychology 2 (2011). URL: https://www. and Machines 30 (2020) 589–615. URL: https:// frontiersin.org/article/10.3389/fpsyg.2011.00049. doi.org/10.1007/s11023-020-09535-6. doi:10.1007/ [42] G. Johansson, Configurations in Event Perception s11023- 020- 09535- 6 . (Uppsala, Sweden, Almqvist and Wiksell. Johans- [33] D. Teney, E. Abbasnejad, K. Kafle, R. Shrestha, son, G.(1973). Visual perception of biological mo- C. Kanan, A. van den Hengel, On the Value tion and a model for its analysis. Perception and of Out-of-Distribution Testing: An Example Psychophysics 14 (1950) 201–211. of Goodhart’ s Law, in: Advances in Neural [43] G. Johansson, Rigidity, Stability, and Mo- Information Processing Systems, volume 33, tion in Perceptual Space, Nordisk Psykologi Curran Associates, Inc., 2020, pp. 407–417. URL: 10 (1958) 191–202. URL: https://doi.org/10.1080/ https://proceedings.neurips.cc/paper/2020/hash/ 00291463.1958.10780387. doi:10.1080/00291463. 045117b0e0a11a242b9765e79cbf113f-Abstract.html. 1958.10780387 , publisher: Routledge _eprint: [34] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Mar- https://doi.org/10.1080/00291463.1958.10780387. [44] C. Buckner, Understanding associative and cogni- tive explanations in comparative psychology, The Routledge handbook of philosophy of animal minds (2017) 409–419. Publisher: Routledge. [45] M. Dacey, Evidence in Default: Rejecting de- fault models of animal minds, The British Journal for the Philosophy of Science (2021) 714799. URL: https://www.journals.uchicago.edu/ doi/10.1086/714799. doi:10.1086/714799 . [46] I. M. Pepperberg, M. R. Willner, L. B. Gravitz, Devel- opment of Piagetian Object Permanence in a Grey Parrot (Psittacus erithacus), Journal of Comparative Psychology 111 (1997) 22. [47] C. Chiandetti, G. Vallortigara, Intuitive physi- cal reasoning about occluded objects by inexpe- rienced chicks, Proceedings of the Royal Soci- ety B: Biological Sciences 278 (2011) 2621–2627. URL: https://royalsocietypublishing.org/doi/full/ 10.1098/rspb.2010.2381. doi:10.1098/rspb.2010. 2381 , publisher: Royal Society. [48] R. Burnell, J. Burden, D. Rutar, K. Voudouris, L. Cheke, J. Hernandez-Orallo, Not a Number: Identifying Instance Features for Capability- Oriented Evaluation, forthcoming, p. 9. URL: https://ryanburnell.com/wp-content/uploads/ Burnell-et-al-2022-Not-a-Number.pdf.