PhysWM: Physical World Models for Robot
Learning⋆
Marc Otto1,*,† , Octavio Arriaga2,*,† , Chandandeep Singh1,† , Jichen Guo2,† and
Frank Kirchner1,2
1
    Robotics Innovation Center, DFKI GmbH, Robert-Hooke-Straße 1, 28359 Bremen, Germany
2
    Robotics Research Group, University of Bremen, Bibliothekstraße 1, 28359 Bremen, Germany


                                         Abstract
                                         Within the last decade machine learning methods have shown remarkable results in pattern recogni-
                                         tion tasks and behavior learning. However, when applied to real-world robotics tasks, these approaches
                                         have limitations, such as sample inefficiency and limited generalization to out-of-distribution samples.
                                         Despite the availability of precise physics in simulation engines, model-based reinforcement learning
                                         (RL) resorts to learning an approximation of these dynamics. On the other hand, optimal control ap-
                                         proaches often assume a static, complete model of the world, addressing the simulation-reality gap by
                                         adding low level controllers. In order to handle these issues, we propose a hybrid simulator consisting
                                         of differentiable physics and rendering modules, which employ symbolic representations and reduce
                                         the model complexity of neural policies, while retaining gradient computation for model and behavior
                                         optimization. Moreover, this reduced parametric representation enables the use of Bayesian inference to
                                         estimate the uncertainty over physical parameters. This uncertainty quantification allows us to generate
                                         a curriculum of exploration behaviors for continuously improving the world model.

                                         Keywords
                                         differentiable physics, neural networks, uncertainty quantification, robot learning


1. Introduction
Despite remarkable success in pattern recognition and behavior learning tasks, developing
intelligent robots remains a challenge for artificial intelligence algorithms [1]. Robots require an
adaptable environment representation to plan movements under contact while interacting with
objects with unknown properties such as mass, friction or shape. Moreover, performing tasks
alongside humans requires autonomous systems that can explain their behavior and accurately
quantify their uncertainty. The current machine learning paradigm addresses these issues by
acquiring a large dataset of possible circumstances and testing generalization in an unseen
fraction of our collected samples. However, this formulation has certain problems within the
robotics domain [2]. The space of all possible robot experiences is too large and datasets for
NeSy’23: 17th International Workshop on Neural-Symbolic Learning and Reasoning, Certosa di Pontignano, Siena, Italy
*
  Corresponding author.
†
  These authors contributed equally.
    marc.otto@dfki.de (M. Otto); arriagac@uni-bremen.de (O. Arriaga); chandandeep.singh@dfki.de (C. Singh);
jichen@uni-bremen.de (J. Guo); frank.kirchner@dfki.de (F. Kirchner)
 0000-0002-5800-0578 (M. Otto); 0000-0002-8099-2534 (O. Arriaga); 0000-0003-4100-1002 (C. Singh);
0000-0002-0247-1987 (J. Guo); 0000-0002-1713-9784 (F. Kirchner)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                  REALITY
                  MODEL UPDATE                             Observations
                                                                                    Obtain
                                                                                    single-percept
     Prior for                                    BEHAVIOR GENERATION               Perform
      model                                                                         exploration
                         ...                      Model used
                                                                      Behavior      behavior
                                                  as Simulator
                                                                                    Goal-
 Initialization    System           Bayesian      Trajectory      Reinforcement     oriented
                   Identification   Inference     Optimization    Learning          behavior


Figure 1: PhysWM world model framework


certain robotic tasks require millions of samples. Moreover, optimizing millions of parameters
to train a model may require power-intensive GPUs. As pointed out by [3, 4], deep networks
and black-box AI in general ignore known physical equations and often remain uninterpretable
solutions.
   Thus, we propose an architecture that leverages new advances in inverse rendering [5, 6],
differentiable physics [7, 8], probabilistic programming [9, 10, 11] and curriculum learning [12]
to equip robots with an adaptable world model. We hypothesize, that our framework can be
used to generate efficient exploration behaviors in order to quantify the scene parameters.


2. Approach
We propose to use a hybrid differentiable physics simulation which is a combination of a
physics engine and neural networks, similar to the one illustrated in [13]. Extended with
parameter uncertainty, the simulation becomes a probabilistic graphical model. The model
parameters are updated when observations have been gathered by the execution of behaviors.
The iterative model update and behavior optimization are inspired by the Estimation-Exploration
Algorithm [14] and by maximizing model disagreement [15].
   Figure 1 presents our approach to obtaining and updating a world model using exploration
behaviors. Given a prior for all model parameters and an observation (an image of the scene),
one can use Bayesian inference with a differentiable renderer to obtain the posterior over those
parameters. These scene parameters include the object’s shape, pose, color material properties
as well as the scene’s lighting. These identified quantities are used by the hybrid simulation to
model interactions with a robot manipulator. Using our simulation, RL or trajectory optimization
can generate behaviors in order to explore the environment further; for example, by lifting an
object to validate its mass or pushing an object to obtain a friction model. Moreover, exploration
can also capture an image from another perspective in order to validate an object’s dimension.
The model is continuously updated using Bayesian inference until the model is accurate enough
to generate the goal-oriented behavior specified by a user.
   Tasks such as poking, pushing, pick-and-place, stacking, or billiard are used for evaluating
approaches for building world models [16, 17, 18]. For instance, poking is used in [17] to learn an
intuitive physical world model. We re-use these tasks as benchmarks and aim at pouring water
and learning curling behaviors for unknown objects to test our framework’s generalization.
3. Adaptable world model representation
Differentiable physics and rendering Differentiable physics engines can be used by learn-
ing methods and optimal control and provide gradients for the optimization criteria. An overview
of differentiable simulators and applications is outlined in Appendix A.1. The approach pro-
posed in this paper uses a combination of differentiable physics simulation and a differentiable
renderer to create a model of the world. Our differentiable simulator is hybrid and is augmented
with neural networks to make it more data-efficient and generalizable than data-driven models,
thereby allowing efficient reduction of sim-to-real gap [13]. The proposed rendering engine is
built on JAX [19], which enables it to render images on CPU, GPU, and TPU. Additionally, it
maintains compatibility with modern optimization libraries [20], deep learning frameworks
[21], probabilistic programming languages [9], and posterior sampling libraries [22].

Probabilistic graphical model The world model can be represented as a probabilistic graph-
ical model [23] in a hybrid differentiable physics simulation. It consists of multiple nodes
corresponding to the simulation parameters associated to an object in the environment. Links
between objects represent possible causal relationships. Optimizing probabilistic programs
often resorts to sampling algorithms such as Markov Chain Monte Carlo (MCMC), which are
known to be computationally expensive. In order to counter the computational costs of MCMC,
probabilistic programming languages have been built using hardware-accelerated kernels [24].
Nevertheless, world models can be learned, as in [25, 26, 18] and can be used as environment
representations for optimal control. Furthermore, the world model is not static but evolves w.r.t.
changes in the environment, and it can be explicitly updated using causal interventions [27].


4. Simulation parameter estimation
Optimization of parameter distributions Uncertainty and noise are two major concerns
in robotics [28]. Using active inference, agents improve the predictions made by the internal
world model and behave in a way that prevents the occurrence of ambiguity [29]. In order to
account for uncertainty, simulations are often enhanced by dynamics randomization or model
ensembles making robot behaviors learned in simulation more robust for transfer to the real
robot [30, 31, 32, 33]. Expert knowledge is required to set up the randomization mean and
variance, which has been reduced via adaptive domain randomization in [34]. Robot specific
choices of relevant parameters [35] and the computational effort of these sampling-based
methods, can be overcome when the uncertainty of parameters is part of the simulation, and it
is propagated to behavior outcomes. Thus, we propose to update model parameter distributions
instead of single values. Given our prior distributions and observations, we compute the
posterior of simulation parameters using MCMC as exemplified in Figure 3 in Appendix A.2.

Exploration Behaviors A white-box model of robot dynamics can be obtained by system
identification with robot movements, explicitly optimized to obtain suitable data, called excita-
tion trajectories. Our hybrid approach of model-based and data-driven simulation components
can profit from a similar exploration strategy. Ideally, one would compute the expected entropy
reduction of each possible action for probing the environment. As this doesn’t scale well to
large state spaces, local optimizations of the expected information gain have been proposed [36].
In [15], the idea of maximizing model disagreement is applied as an intrinsic motivation to
explore the candidate models’ areas of uncertainty. As we model simulation parameters as
distributions, we can define the loss function to explicitly reward uncertainty in the outcome.
We expect this approach to explore the environment more effectively than using the sampling-
based disagreement measure, since the latter relies on a fixed number of candidate models to
only approximate parameter uncertainty.


5. Curriculum Learning with Complexity levels
Complexity levels The parameter set of a world model can become arbitrarily large, as
the model is fine-tuned to represent reality more detailed. As shown by [37] a multi-fidelity
simulation for RL, in which the same environment is modeled by simulations of different
complexity, can reduce the training time spent in the more complex environments, including
real trials. Simulations of lower and higher fidelity share parameters, making one complexity
level profit from model improvements on another. We define an iterative approach to the
minimum required complexity [38] that describes a world model, a policy and a reward model
for a given task (see Table 2 in Appendix A.3). We expect that the search in the more abstract
simulation is faster and thus a global search is feasible.

Automatizing the curriculum For a highly autonomous system improving its world model
and behavior, the ability to switch between complexity levels is needed. In curriculum sets [39],
the agent focuses on improving the modules for which it is making most progress, while in active
domain randomization [33], the parameter distribution is adjusted automatically to select an
intermediate level of difficulty. This is the driver of a curriculum as the learning agent improves
on the given environment setting. Once the task is solved for the current simulation values,
settings that were previously too difficult, become feasible at intermediate difficulty. Extending
this principle to complexity levels, we can temporarily exclude reward model components as
well as physical aspects to focus learning on a part of the policy and parameters as exemplified
in Appendix A.4. We hypothesize that our learning framework will therefore allow sample
efficient model updates with MCMC and policy updates with RL.


6. Conclusion and Outlook
This work presents our approach to overcoming the sample inefficiency of data-driven methods
for estimating simulation parameters. In order to do so, behaviors for exploring the remaining
world model uncertainties are generated and the uncertainty is quantified explicitly or via
candidate model disagreement. For efficiently generating behaviors, deep RL and optimal
control can use the gradients provided by the differentiable simulation directly. Creating an
adaptable world model of appropriate complexity is addressed by automatizing a curriculum in
which the model complexity is adapted based on experience gathered in the given scenario.
Acknowledgments
This work has been performed in the PhysWM project funded by the German Aerospace Center
(DLR) with federal funds (Grant numbers 50RA2126A and 50RA2126B) from the German Federal
Ministry of Economic Affairs and Climate Action (BMWK). We would like to thank Dr.-Ing.
Alexander Fabisch and Dr. rer. nat. Shivesh Kumar as well the reviewers from the NeSy
workshop for their insightful comments on our paper.


References
 [1] G.-Z. Yang, J. Bellingham, P. E. Dupont, P. Fischer, L. Floridi, R. Full, N. Jacobstein, V. Kumar,
     M. McNutt, R. Merrifield, et al., The grand challenges of science robotics, Science robotics
     3 (2018) eaar7650.
 [2] J. Kober, J. A. Bagnell, J. Peters, Reinforcement learning in robotics: A survey, The
     International Journal of Robotics Research 32 (2013) 1238–1274.
 [3] R. Yu, P. Perdikaris, A. Karpatne, Physics-guided ai for large-scale spatiotemporal data,
     in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data
     Mining, 2021, pp. 4088–4089.
 [4] M. Lutter, J. Peters, Combining physics and deep learning to learn continuous-time
     dynamics models, arXiv preprint arXiv:2110.01894 (2021).
 [5] S. Laine, J. Hellsten, T. Karras, Y. Seol, J. Lehtinen, T. Aila, Modular primitives for high-
     performance differentiable rendering, ACM Transactions on Graphics 39 (2020).
 [6] W. Jakob, S. Speierer, N. Roussel, D. Vicini, Dr.jit: A just-in-time compiler for differentiable
     rendering, Transactions on Graphics (Proceedings of SIGGRAPH) 41 (2022). doi:10.1145/
     3528223.3530099.
 [7] C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, O. Bachem, Brax - a differentiable
     physics engine for large scale rigid body simulation, 2021. URL: http://github.com/google/
     brax.
 [8] E. Todorov, T. Erez, Y. Tassa, Mujoco: A physics engine for model-based control, in:
     2012 IEEE/RSJ international conference on intelligent robots and systems, IEEE, 2012, pp.
     5026–5033.
 [9] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi,
     M. Hoffman, R. A. Saurous, Tensorflow distributions, arXiv preprint arXiv:1711.10604
     (2017).
[10] M. F. Cusumano-Towner, F. A. Saad, A. K. Lew, V. K. Mansinghka, Gen: A general-purpose
     probabilistic programming system with programmable inference (2019) 221–236. URL:
     http://doi.acm.org/10.1145/3314221.3314642. doi:10.1145/3314221.3314642.
[11] D. Phan, N. Pradhan, M. Jankowiak, Composable effects for flexible and accelerated
     probabilistic programming in numpyro, arXiv preprint arXiv:1912.11554 (2019).
[12] R. Portelas, C. Colas, L. Weng, K. Hofmann, P.-Y. Oudeyer, Automatic Curriculum
     Learning For Deep RL: A Short Survey, in: Proceedings of the Twenty-Ninth Inter-
     national Joint Conference on Artificial Intelligence, International Joint Conferences
     on Artificial Intelligence Organization, Yokohama, Japan, 2020, pp. 4819–4825. URL:
     https://www.ijcai.org/proceedings/2020/671. doi:10.24963/ijcai.2020/671.
[13] E. Heiden, D. Millard, E. Coumans, Y. Sheng, G. S. Sukhatme, Neuralsim: Augmenting
     differentiable simulators with neural networks, in: 2021 IEEE International Conference on
     Robotics and Automation (ICRA), IEEE, 2021, pp. 9474–9481.
[14] J. Bongard, H. Lipson, Nonlinear System Identification Using Coevolution of Models
     and Tests, IEEE Transactions on Evolutionary Computation 9 (2005) 361–384. URL: http:
     //ieeexplore.ieee.org/document/1492385. doi:10.1109/TEVC.2005.850293.
[15] D. Pathak, D. Gandhi, A. Gupta, Self-Supervised Exploration via Disagreement, in:
     Proceedings of the 36th International Conference on Machine Learning, PMLR, 2019, pp.
     5062–5071. URL: https://proceedings.mlr.press/v97/pathak19a.html, iSSN: 2640-3498.
[16] O. Ahmed, F. Träuble, A. Goyal, A. Neitz, Y. Bengio, B. Schölkopf, M. Wüthrich, S. Bauer,
     Causalworld: A robotic manipulation benchmark for causal structure and transfer learning,
     arXiv preprint arXiv:2010.04296 (2020).
[17] P. Agrawal, A. Nair, P. Abbeel, J. Malik, S. Levine, Learning to Poke by Poking: Experiential
     Learning of Intuitive Physics, 2017. URL: http://arxiv.org/abs/1606.07419, arXiv:1606.07419
     [cs].
[18] O. Biza, T. Kipf, D. Klee, R. Platt, J.-W. van de Meent, L. L. Wong, Factored world models for
     zero-shot generalization in robotic manipulation, arXiv preprint arXiv:2202.05333 (2022).
[19] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula,
     A. Paszke, J. VanderPlas, S. Wanderman-Milne, et al., Jax: composable transformations of
     python+ numpy programs (2018).
[20] I. Babuschkin, K. Baumli, A. Bell, S. Bhupatiraju, J. Bruce, P. Buchlovsky, D. Budden, e. a.
     Cai, The DeepMind JAX Ecosystem, 2020. URL: http://github.com/deepmind.
[21] P. Kidger, C. Garcia, Equinox: neural networks in jax via callable pytrees and filtered
     transformations, arXiv preprint arXiv:2111.00254 (2021).
[22] J. Lao, R. Louf, Blackjax: A sampling library for JAX, 2020. URL: http://github.com/
     blackjax-devs/blackjax.
[23] N. Gothoskar, M. Cusumano-Towner, B. Zinberg, M. Ghavamizadeh, F. Pollok, A. Garrett,
     J. B. Tenenbaum, D. Gutfreund, V. K. Mansinghka, 3DP3: 3D Scene Perception via Proba-
     bilistic Programming, 2021. URL: http://arxiv.org/abs/2111.00312. doi:10.48550/arXiv.
     2111.00312, arXiv:2111.00312 [cs].
[24] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi,
     M. Hoffman, R. A. Saurous, TensorFlow Distributions (2017). URL: http://arxiv.org/abs/
     1711.10604. doi:10.48550/arXiv.1711.10604, arXiv:1711.10604 [cs, stat].
[25] P. Wu, A. Escontrela, D. Hafner, P. Abbeel, K. Goldberg, Daydreamer: World models for
     physical robot learning, in: Conference on Robot Learning, PMLR, 2023, pp. 2226–2240.
[26] L. Zhang, G. Yang, B. C. Stadie, World model as a graph: Learning latent landmarks for
     planning, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference
     on Machine Learning, volume 139 of Proceedings of Machine Learning Research, PMLR,
     2021, pp. 12611–12620. URL: https://proceedings.mlr.press/v139/zhang21x.html.
[27] J. Pearl, Causal inference, Causality: objectives and assessment (2010) 39–58.
[28] A. Fabisch, C. Petzoldt, M. Otto, F. Kirchner, A survey of behavior learning applications in
     robotics–state of the art and perspectives, arXiv preprint arXiv:1906.01868 (2019).
[29] T. Parr, G. Pezzulo, K. J. Friston, Active Inference: The Free Energy Principle in
     Mind, Brain, and Behavior, 2022. URL: https://direct.mit.edu/books/oa-monograph/5299/
     Active-InferenceThe-Free-Energy-Principle-in-Mind. doi:10.7551/mitpress/12441.
     001.0001.
[30] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, M. Hut-
     ter, Learning agile and dynamic motor skills for legged robots, Science Robotics
     4 (2019) eaau5872. URL: https://www.science.org/doi/abs/10.1126/scirobotics.aau5872.
     doi:10.1126/scirobotics.aau5872, publisher: American Association for the Ad-
     vancement of Science.
[31] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron,
     A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder,
     L. Weng, Q. Yuan, W. Zaremba, L. Zhang, Solving Rubik’s Cube with a Robot Hand,
     arXiv:1910.07113 [cs, stat] (2019). URL: http://arxiv.org/abs/1910.07113, arXiv: 1910.07113.
[32] Y. Wu, W. Yan, T. Kurutach, L. Pinto, P. Abbeel, Learning to Manipulate Deformable Objects
     without Demonstrations: 16th Robotics: Science and Systems, RSS 2020, Robotics (2020).
     URL: http://www.scopus.com/inward/record.url?scp=85127981155&partnerID=8YFLogxK.
     doi:10.15607/RSS.2020.XVI.065, publisher: MIT Press Journals.
[33] B. Mehta, M. Diaz, F. Golemo, C. J. Pal, L. Paull, Active Domain Randomization, in:
     Proceedings of the Conference on Robot Learning, PMLR, 2020, pp. 1162–1176. URL:
     https://proceedings.mlr.press/v100/mehta20a.html, iSSN: 2640-3498.
[34] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff, D. Fox, Closing the
     Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience, in:
     2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 8973–8979.
     doi:10.1109/ICRA.2019.8793789, iSSN: 2577-087X.
[35] Z. Xie, X. Da, M. van de Panne, B. Babich, A. Garg, Dynamics Randomization Revisited:
     A Case Study for Quadrupedal Locomotion, in: 2021 IEEE International Conference on
     Robotics and Automation (ICRA), 2021, pp. 4955–4961. doi:10.1109/ICRA48506.2021.
     9560837, iSSN: 2577-087X.
[36] A. T. Taylor, T. A. Berrueta, T. D. Murphey, Active learning in robotics: A review of control
     principles, Mechatronics 77 (2021) 102576. URL: https://www.sciencedirect.com/science/
     article/pii/S0957415821000659. doi:10.1016/j.mechatronics.2021.102576.
[37] M. Cutler, T. J. Walsh, J. P. How, Real-World Reinforcement Learning via Multifidelity Sim-
     ulators, IEEE Transactions on Robotics 31 (2015) 655–671. URL: http://ieeexplore.ieee.org/
     ielx7/8860/7117487/07106543.pdf?tp=&arnumber=7106543&isnumber=7117487. doi:10.
     1109/TRO.2015.2419431.
[38] M. Li, P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, Texts
     in Computer Science, Springer International Publishing, Cham, 2019. URL: http://link.
     springer.com/10.1007/978-3-030-11298-1. doi:10.1007/978-3-030-11298-1.
[39] C. Colas, P. Fournier, M. Chetouani, O. Sigaud, P.-Y. Oudeyer, CURIOUS: Intrinsically
     Motivated Modular Multi-Goal Reinforcement Learning, in: Proceedings of the 36th
     International Conference on Machine Learning, PMLR, 2019, pp. 1331–1340. URL: https:
     //proceedings.mlr.press/v97/colas19a.html, iSSN: 2640-3498.
[40] J. Liang, M. C. Lin, Differentiable physics simulation, in: ICLR 2020 Workshop on
     Integration of Deep Neural Models and Differential Equations, 2020.
[41] P. M. Wensing, S. Kim, J.-J. E. Slotine, Linear matrix inequalities for physically consistent
     inertial parameter identification: A statistical perspective on the mass distribution, IEEE
     Robotics and Automation Letters 3 (2017) 60–67.
[42] T. Lee, P. M. Wensing, F. C. Park, Geometric robot dynamic identification: A convex
     programming approach, IEEE Transactions on Robotics 36 (2019) 348–365.
[43] F. Meier, A. Wang, G. Sutanto, Y. Lin, P. Shah, Differentiable and learnable robot models,
     arXiv preprint arXiv:2202.11217 (2022).
[44] E. Heiden, D. Millard, H. Zhang, G. S. Sukhatme, Interactive differentiable simulation,
     arXiv preprint arXiv:1905.10706 (2019).
[45] A. Patel, S. L. Shield, S. Kazi, A. M. Johnson, L. T. Biegler, Contact-implicit trajectory
     optimization using orthogonal collocation, IEEE Robotics and Automation Letters 4 (2019)
     2242–2249.
[46] A. O. Onol, P. Long, T. Padlr, A comparative analysis of contact models in trajectory
     optimization for manipulation, in: 2018 IEEE/RSJ International Conference on Intelligent
     Robots and Systems (IROS), IEEE, 2018, pp. 1–9.
[47] K. Werling, D. Omens, J. Lee, I. Exarchos, C. K. Liu, Fast and feature-complete differentiable
     physics for articulated rigid bodies with contact, arXiv preprint arXiv:2103.16021 (2021).
[48] R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien,
     J. Bayer, A. Belikov, A. Belopolsky, et al., Theano: A python framework for fast computation
     of mathematical expressions, arXiv e-prints (2016) arXiv–1605.
[49] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,
     L. Antiga, A. Lerer, Automatic differentiation in pytorch (2017).
[50] D. Maclaurin, Modeling, inference and optimization with composable differentiable proce-
     dures, 2016.
[51] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic differentiation in
     machine learning: a survey, Journal of Marchine Learning Research 18 (2018) 1–43.
[52] F. de Avila Belbute-Peres, K. Smith, K. Allen, J. Tenenbaum, J. Z. Kolter, End-to-end
     differentiable physics for learning and control, Advances in neural information processing
     systems 31 (2018).
[53] J. Degrave, M. Hermans, J. Dambre, et al., A differentiable physics engine for deep learning
     in robotics, Frontiers in neurorobotics (2019) 6.
[54] Y. Hu, L. Anderson, T.-M. Li, Q. Sun, N. Carr, J. Ragan-Kelley, F. Durand, DiffTaichi:
     Differentiable Programming for Physical Simulation, 2020. URL: http://arxiv.org/abs/1910.
     00935. doi:10.48550/arXiv.1910.00935, arXiv:1910.00935 [physics, stat].
[55] C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, O. Bachem, Brax – A Dif-
     ferentiable Physics Engine for Large Scale Rigid Body Simulation, 2021. URL: http:
     //arxiv.org/abs/2106.13281, arXiv:2106.13281 [cs].
[56] M. Geilinger, D. Hahn, J. Zehnder, M. Bächer, B. Thomaszewski, S. Coros, Add: Analytically
     differentiable dynamics for multi-body systems with frictional contact, ACM Transactions
     on Graphics (TOG) 39 (2020) 1–15.
[57] J. Xu, T. Chen, L. Zlokapa, M. Foshey, W. Matusik, S. Sueda, P. Agrawal, An end-to-end
     differentiable framework for contact-aware robot design, arXiv preprint arXiv:2107.07501
     (2021).
[58] J. Liang, M. Lin, V. Koltun, Differentiable cloth simulation for inverse problems, Advances
     in Neural Information Processing Systems 32 (2019).
[59] Y.-L. Qiao, J. Liang, V. Koltun, M. C. Lin, Scalable differentiable physics for learning and
     control, arXiv preprint arXiv:2007.02168 (2020).
[60] C. Schenck, D. Fox, Spnets: Differentiable fluid dynamics for deep neural networks, in:
     Conference on Robot Learning, PMLR, 2018, pp. 317–335.
[61] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
     recognition, arXiv preprint arXiv:1409.1556 (2014).
[62] B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa, P. Abbeel, A. M. Dollar,
     Yale-cmu-berkeley dataset for robotic manipulation research, The International Journal of
     Robotics Research 36 (2017) 261–268.
[63] K. Mülling, J. Kober, O. Kroemer, J. Peters, Learning to select and generalize striking
     movements in robot table tennis, International Journal of Robotics Research 32 (2013)
     263–279. Publisher: Sage Publications, Inc. Thousand Oaks, CA, USA.


A. Appendix
A.1. Differentiable physics simulation
Differentiable physics simulation is a computational tool that utilizes gradient-based techniques
for learning and control of physical system [40]. During the past few years, it has been success-
fully applied in many areas, such as system identification [41, 42], design optimization [43, 44]
and motion optimization [45, 46], as shown in Figure 2, which also emphasizes the wide range of
potential applications. As mentioned in [47], leveraging the recent developments in automatic
differentiation techniques and libraries [48, 49, 19, 50, 51], various differentiable physics engines
have been proposed to address the control and parameter estimation issues for rigid bodies
[47, 52, 53, 54, 55, 13, 56, 57], as listed and categorized in Table 1, and non-rigid bodies, such as
cloth [58, 59] and fluid [60].
                                                 Differentiable physics

                                                         Graphics

                                                      Body dynamics

                                                     Contact dynamics


           Design optimization                   System identification                     Motion optimization

            Robot kinematics              External                      Internal          Trajectory optimization

               Robot shape            Pose estimation               Robot kinematics   Deep Reinforcement Learning

           Gearbox optimization     Object classification            Robot inertias    Dynamic movement primitives

                                      Object inertias               Motor dynamics

                                      Object friction


Figure 2: Application of differentiable physics
Table 1
Differentiable engines for articulated rigid bodies, extended from [47]
 Engine                     Contacts      State       Collisions    Gradients       Language          URDF
 MuJoCo [8]                  custom      reduced       complete        finite          C++           support
 Degrave et al. [53]        impulsive    maximal      primitves        auto          Theano            no
 DiffTaichi [54]            impulsive    maximal      primitives       auto           Taichi           no
 TinyDiffSim [13]         iterated LCP   reduced      primitives       auto            C++           support
 De A.B.-P. et al.         direct LCP    maximal      primitives     symbolic        Pytorch           no
 [52]
 Gelinger et al. [56]       custom       reduced      primitives     symbolic       not released     unknown
 Nimble [47]              direct LCP     reduced      complete       symbolic          DART           support
 DiffREDMax [57]            custom       reduced      primitives     symbolic           C++           support
 Brax v1 [55]             impulsive      maximal      primitives       auto         Python(JAX)       support
 Brax v2 [7]                custom       reduced      primitives       auto         Python(JAX)       support


A.2. Inverse rendering with MCMC
The initial step of our framework is to obtain a graph-based scene representation using priors
for the scene parameters (see Figure 1) as well as a single image to obtain a posterior distribution.
Here, a scene with a single object 𝒪 (see Figure 3a) is characterized by a transformation 𝒯
consisting of translation 𝑡, rotation 𝜃 and scale 𝑠 as well as a stochastic material ℳ consisting
of ambient 𝛼, diffuse 𝛽 and color 𝜅 and a shape 𝒫 taking the values sphere, cylinder and
box. Furthermore, it is rendered in an environment 𝒢 (flat ground with lights and specific
camera perspective) matching the current observation 𝒥𝑟 . We define a likelihood function
based on RGB-data considering the pixel-wise disparity as well as image features obtained from
VGG16 [61]. The posterior (see Figure 3b) of all parameters is computed using the Rosenbluth-
Metropolis-Hastings algorithm (RMH). We can use this posterior for sampling, as shown in
Figure 3c or as a proto-program to generate similar scenes by excluding parts of the scene graph.
   The depicted example is representative for our preliminary results, which indicate that the
method can also be applied on out-of-distribution samples to find suitable approximations for
rendered objects from the YCB dataset [62].

A.3. Scenario complexity levels

Table 2
Scenario complexity levels
 level       goal           reward model components       simulation parameters             policy
   1      reach object    object-gripper-distance (OGD)        kinematics (K)        2 waypoints (2WP)
   2       pick object      OGD + grasp stability (GS)        K + collisions (C)    3WP/ sequential DMP
   3      poke object      OGD + object travel distance   K + C + object dynamics   DMP with end-velocity
   4     object curling     OGD + GS + final location          full dynamics           neural network


  (1) To reach the object with a robot arm movement defined by start and end point, a kinematics
   (a) Object graph                       (b) Posterior distributions                 (c) Rendered images
Figure 3: Inverse rendering with MCMC. The scene is represented as a graph (a) in which an object 𝒪
has properties with parameters (top row) for which a prior is defined. MCMC is applied to compute
the posterior for all properties including the translation and the color as shown in (b). In (c), the target
image, highlighted by a dashed frame in the center, and 8 samples from the posterior are depicted.


simulation is sufficient. (2) For picking an object, the gripper’s relative pose and collisions with
the object are required to allow a sequential movement representation to solve the task. (3)
Poking the object with a desired effect intensity such as the resulting object displacement, the
simulation needs to include the object’s inertia and friction with the ground. A DMP with
end-velocity is a suitable behavior representation [63]. (4) A task, such as in the sport curling,
requires a stable grasp and accurate release to reach the target location. It can be learned with
a neural network policy when dynamics parameters are sufficiently optimized, using policy
parameters learned for the previous level as initialization.

A.4. Incomplete physical modelling
From our simulation, physical aspects can be reduced, which means, e.g. in the case of colli-
sions, that some objects’ collisions are excluded from the computation. When physical aspects
cannot be excluded entirely from the manipulation scenario, such as friction, we make use
of the parameter uncertainty and provide feedback to learning and optimization algorithms
considering sample-specific, optimal values. For example, a curling behavior may be directed
at the correct target but due to an inadequate friction model, it receives a low reward. By
setting the friction coefficients temporarily to values of the parameter distribution for which
the reward is maximized, the agent can focus on improving the direction of the curling behavior
first. On the other hand, in the estimation of model parameters, this allows MCMC to set a
subset of parameters while the excluded parameters can take any sample-specific value. Once
the reduction of uncertainty of the simulation parameters does not significantly influence the
task learning progress, the set of physical aspects modeled in our simulation is increased by
choosing the one with the highest gradient w.r.t. the reward.