Developing safe and explainable autonomous agents: from simulation to the real world

Developing safe and explainable autonomous agents: from simulation to the real world FedericoBianchi University of Verona

strada Le Grazie 15 37135 Verona Italy

AlbertoCastellini University of Verona

strada Le Grazie 15 37135 Verona Italy

AlessandroFarinelli University of Verona

strada Le Grazie 15 37135 Verona Italy

LucaMarzari University of Verona

strada Le Grazie 15 37135 Verona Italy

DanieleMeli daniele.meli@univr.it University of Verona

strada Le Grazie 15 37135 Verona Italy

FrancescoTrotti University of Verona

strada Le Grazie 15 37135 Verona Italy

CelesteVeronese University of Verona

strada Le Grazie 15 37135 Verona Italy

Developing safe and explainable autonomous agents: from simulation to the real world 1613-0073 0A8C40245B4D691790828F55E6C4EE87 GROBID - A machine learning software for extracting information from scholarly documents Safe Reinforcement Learning Formal verification of neural networks Neurosymbolic AI Planning under uncertainty

Responsible artificial intelligence is the next challenge of research to foster the deployment of autonomous systems in the real world. In this paper, we focus on safe and explainable design and deployment of autonomous agents, e.g., robots. In particular, we present our recent contributions to: i) safe and explainable planning, leveraging on safe Reinforcement Learning (RL) and neurosymbolic planning; ii) effective deployment of RL policies via model-based control; iii) formal verification of the safety of deep RL policies; and iv) explainable anomaly detection of complex real systems.

Introduction

Artificial Intelligence (AI) and robotics are pervading everyday activities, from industrial automation [1] to environmental monitoring [2]. As more and more sophisticated autonomous cognitive systems interact with humans in complex scenarios, the development of responsible AI solutions [3] becomes a fundamental design requirement, as prescribed also by the latest international regulations 1 . Responsible AI involves several aspects, including safety, transparency and trustability [4]. Safety regards providing guarantees about the behavior of AI systems, e.g., autonomous robotic systems, in terms of performance and potential harm to the surrounding environment or humans. Transparency and trustability are related to the perception of humans interacting with the AI system, e.g., the explainability and compliance of the system's behaviour to the expectation of humans from a moral or rational perspective [5].

In this paper, we summarize our main contributions in the field of responsible AI. We focus on autonomous agents, e.g., robots, and present our approach to responsible autonomy at different developmental stages. We first describe our solutions for safe and explainable planning in autonomous agents, via safe Reinforcement Learning (RL) and neurosymbolic approaches. We also analyze the problem of safe and compliant transfer of a planned policy on a physical robotic system, combining RL with model-based control. We then investigate how to provide formal guarantees of safety for black-box policies, e.g., 1 European AI Act from deep RL, via formal verification. Finally, we present solutions for efficient and explainable anomaly detection in autonomous systems.

Safe and explainable planning

We assume the autonomous agent and the environment are represented as a Markov Decision Process (MDP) 𝑀 = ⟨𝑆, 𝐴, 𝑇, 𝑅⟩, defining respectively the state space, the action space, the transition map, and the reward map. The first approach is based on Safe Policy Improvement (SPI) [6] and Monte Carlo Tree Search (MCTS) [7], which performs simulations in a model of the real environment to estimate the optimal policy online. The second solution combines MCTS with symbolic and logical reasoning, to guide the exploration of the RL agent towards better pathways.

Safe Policy Improvement with MCTS

Safe RL [9] investigates how to learn policies that maximize the performance of the agent, while respecting safety constraints during learning. One popular approach is Safe Policy Improvement with Baseline Bootstrapping (SPIBB) [10]. SPIBB starts from a baseline policy 𝜋0 (e.g., a sub-optimal expert-designed policy). The algorithm then collects a batch dataset of trajectories (i.e., stateaction pairs), and uses the baseline policy on less frequent state-action pairs. However, it does not scale to large state and action spaces.

To improve scalability, we recently introduced Monte Carlo Tree Search Safe Policy Improvement with Baseline Bootstrapping (MCTS-SPIBB) [8]. The algorithm exploits MCTS to estimate 𝜋𝐼 online, hence it can scale to large domains, while keeping the asymptotic guarantees of convergence of SPIBB [8]. In [8] we compared MCTS-SPIBB with several state-of-the-art SPI algorithms on benchmark domains (see Figure 1.a). Furthermore, we showed that on very large state spaces, such as the standard SysAdmin benchmark 2 with up to 35 machines, MCTS-SPIBB is the only SPI algorithm capable of computing improved policies (see Figure 1.b).

Planning with logics in MCTS

MCTS may require a large number of online simulations when the state and action spaces are large. This becomes even more critical in Partially Observable MDPs (POMDPs), where part of the state is uncertain, hence a particle filter must be used to sample and estimate the actual state of the system, starting from a probability distribution called the belief. Recent online solvers for POMDPs, e.g., Partially Observable Monte Carlo Planning (POMCP) [11] and Determinized Sparse Partially Observable Trees (DESPOT) [12] require the definition of task-specific policy heuristics, in order to efficiently bias the exploration towards most fruitful policies. Moreover, it is essential to guarantee the exploration of only safe policies.

To this aim, in [13] we proposed an approach based on maximum satisfiability modulo theory [14] to probabilistically verify the adherence of the policy computed by POMCP to a set of user-defined specifications, expressed in a fragment of first-order logic. In this way, we can shield undesired actions in MCTS simulations, and increase the explainability of the generated policy thanks to the logic formalism. However, defining the logical policy specifications may be tedious and error-prone in realistic complex domains. For this reason, in [15,16] we proposed an approach based on inductive logic programming [17] to learn logical policy heuristics from 2 SysAdmin: https://jair.org/index.php/jair/article/view/10341/24723 trajectories (belief-action pairs) of POMDP executions collected offline. Specifically, given a set of task-related concepts 𝐹 provided by the user to describe the belief space, offline trajectories are converted to a logical formalism, where logical predicates encode concepts in 𝐹 . As an example, consider the paradigmatic POMDP rocksample scenario depicted in Figure 2a, where a robotic agent must collect valuable rocks (green dots) avoiding worthless ones (red dots) in a grid world. The state of the POMDP includes information about the position of agents and rocks, and the probability (belief) of rocks to be valuable. The state can be translated to a logical representation in terms of the following concepts in 𝐹 : the Manhattan distance D between the agent and each rock R dist(R,D) and the probability P of a rock R to be valuable guess(R,P). Defining semantic concepts about the domain is easier than defining directly policy specifications, since it simply involves a re-interpretation of the state formalization.

We preliminarily learn policy specifications from trajectories collected from a rocksample agent operating in a 12×12 grid with 4 rocks. We adopt the logical formalism of Answer Set Programming (ASP) [18], which represents the state of the art for planning in first-order logic [19]. Our approach requires relatively few training trajectories (less than 800 in rocksample) to learn interpretable transparent policy specifications. Moreover, learned heuristics allow POMCP to use significantly fewer online simulations per step of execution (Figure 2b, achieving comparable performance with respect to expert-designed specifications (pref ). Finally, the heuristics generalize to unseen problem instances, e.g., enhancing scalability to larger grid sizes (Figure 2c) which require a longer planning horizon, typically challenging for MCTS-based solvers. In [20], we also showed that this approach can be used to derive policy explanations of black-box model-free RL agents, in the context of autonomous driving.

Safe deployment in the real world

The policy computed by a RL-based planner, e.g., POMCP for POMDPs, cannot always be effectively and safely deployed on a real robotic system. Indeed, MCTS-based planners perform online simulations based on a model of the environment, but the chosen policy must be adapted to the inevitable unmodeled inaccuracies and non-linearities of the physical plant. To overcome this problem, in [21] we implemented the two-layer architecture depicted in Figure 3, combining a high-level controller based on POMCP with a low-level model-based controller: The low-level controller is designed using the inverse dynamics technique [22,23], that allows to linearize via feedback the system. In particular, let , where 𝑣 is an auxiliary control signal. Therefore, the low-level controller exploits the auxiliary control signal 𝑣, which is mapped as reference values for the controller, to compute the command 𝑢. The high-level controller is formalized as a POMDP that exploits the linearized closed-loop model to select the best local action 𝑢 for the agents. In particular, the POMCP provides the sub-optimal reference values for the low-level controller optimizing user-defined objectives, encoded in the reward function. Note, the two-layer have different control loop sample rates; the low-level has to be fast since it has to provide the commands to the agent, while the high-level can be slower since it generates the reference values for the low-level.

The two-layer approach is tested in a scenario where an aerial drone has to reach a target area, avoiding some no-fly zones and minimizing fuel consumption or attitude error. Therefore, the reward function is composed of four contributions: an attractive potential component to reach the target, a repulsive component to avoid the no-fly zone, the fuel consumption and the heading error. The last two components are weighted to rank between different objectives. Figure 4 shows the trajectory followed by the drone optimizing only the fuel consumption (black line), both fuel and attitude (red dashed line) and only the attitude error (green dotted line). The black line follows the shortest path to minimize the fuel, the red line follows the shortest path but near the target position the attitude error component increases to align the drone with the desired attitude (black arrow). The green line follows the optimal path to minimize only the attitude.

Formal verification of deep RL

Trained RL policies, especially model-free RL policies encoded in a Deep Neural Network (DNN), do not guarantee to provably meet the safety standards required in the real world. For instance, DNNs are vulnerable to the so-called adversarial inputs, i.e., minimal input variations that fool the system to output an undesired value (or action) [24]. Consequently, in recent years, Formal Verification (FV) of DNNs (aka DNN-Verification) has been developed to provide formal guarantees on the behavior of these systems [25]. In particular, given a predefined safety property, the goal of DNN-Verification is to assert whether at least one input configuration exists that violates the property. However, given the non-convex and non-linear nature of DNNs, verifying safety properties in the worst case has been shown to be an NP-complete problem [26]. Moreover, the standard binary response of DNN-verification (safe vs. unsafe) does not provide sufficient information to compare the safety of different DNNs. To overcome these limitations, in [27], we proposed a novel quantitative formulation of the DNN-verification problem, allowing to enumerate all unsafe regions for a given domain of interest and thus rank the models on the portion of unsafe regions they may have. However, we showed that this problem turns out to be #P-hard. Hence, in [28] we proposed 𝜖-ProVe. Exploiting a controllable underestimation of the output reachable sets obtained via statistical prediction of tolerance limits [29], the algorithm provides a tight -with provable probabilistic guarantees-lower estimate of the (un)safe areas.

We validated DNN-Verification in realistic robotic safety-critical scenarios. In particular, in [30], we showed that DNN-Verification can be used to rank different successful DNN models according to the level of safety, verifying collision avoidance in robotic mapless navigation. We then applied a similar pipeline in a more safety-critical domain, namely autonomous colonoscopy navigation for colorectal detection with deep RL [31] (Figure 5). In particular, we trained an agent to navigate the endoscope in patient-specific colon models based on endoscopic images, using Constrained RL (CRL) to impose a safety cost for the agent to touch colon walls at the training stage. Nevertheless, due to the Lagrangian relaxation implemented by CRL to perform constrained optimization, safety may not be guaranteed. Hence, we adopted a model selection strategy that harnesses FV to evaluate the safety of a vast pool of trained policies to select the one the meets all the behavioral preferences specified. The results of our study are reported in Table 1 over 300 trained models, finding 3 completely safe models that provably meet the safety requirements.

Finally, to address the necessity of running the FV process only after training due to its computational complexity, in [32] we proposed an unconstrained DRL framework that leverages a novel sample-based method to approximate local violations of input-output conditions to foster the learning of safer behaviors inside the training loop. However, such conditions are typically hard-coded and require task-level knowledge, making their application intractable in challenging safety-critical tasks. To this end, in [33], we introduced the Collection and Refinement of Online Properties (CROP) framework to collect and refine safety properties during training. The combination of CROP with approximate violation inside the training loop allowed us to obtain a more robust approach with respect to other existing Safe DRL methodologies in the context of autonomous navigation, promoting safer behaviors while maintaining similar or better returns.

Explainable and data-efficient anomaly detection

Autonomous systems operating in the real world are required to reliably work over long periods of time (Long Term Autonomy, LTA) under changing and unpredictable environmental conditions. In this context, anomaly detection is crucial to promptly identify situations that diverge from the desired behaviour. Specifically, unsupervised anomaly detection aims to idenfity anomalies related to the global behavior of the system [34,35,36], monitoring multivariate time series generated from sensors and actuators and starting from the only knowledge of the nominal (i.e., anomaly-free) behavior.

We recently proposed two contributions in this area. namely, an online approach for detecting anomalous behaviors of robotic systems involved in complex LTA scenarios (HHAD) [37], and an adversarial data augmentation and retraining approach (HHAD-AUG) [38]. In HHAD [37], we use Hidden Markov Models (HMMs) to represent the nominal behavior of a robot. We then evaluate online the dissimilarity between the probability distribution of multivariate sensor time series in a sliding window and the emission probability of the related HMM hidden states. We adopt the Hellinger distance [39] as a distance measure since it is bounded (thus it lends itself to simpler interpretation and thresholding) and it is less noisy, hence more informative and discriminative.

In HHAD-AUG [38], we address the usual lack (or paucity) of anomalous examples and the noise that characterizes time series of real systems. We propose a data augmentation method based on perturbed (adversarial) time series [40], having the advantage of not requiring any prior knowledge about the application domain and data conformation. We generate adversarial examples only for nominal points, optimizing a loss function based on the Hellinger distance between the observed and the expected data distributions.

We evaluate our data augmentation and re-training approach on several public datasets, plus one collected from our aquatic drones developed in the EU H2020 project INTCATCH [41]. Results show that (i) the adversarial generation algorithms can generate meaningful adversarial examples for HHAD, employing them to significantly improve the performance of HHAD; (ii) our data augmentation method yields higher performance than examples generated by state-of-the-art augmentation methods; erated considering standard log-likelihood; (v) the low computational complexity and high parallelizability of the proposed method allow for a fast data augmentation and retraining of HHAD. Figure 6 shows the results on the INTCATCH dataset [41].

Finally, we have recently addressed the problem of explainable anomaly detection, in order to provide useful information about the source of the anomaly for easier repair. To this aim, in [42] we showed that causal discovery based on Conditional Mutual Information (CMI) between time series can achieve higher performance than standard deep learning antomaly detectors, on a benchmark robotic dataset of the Pepper service robot 3 . Our methodology evaluates the variation of CMI between time series, thus providing a useful hint to the root cause of the anomaly. Moreover, it builds a nominal model of the real physical relations between variables of the system, thus resulting in higher robustness and more accurate anomaly detection, compared to DNN methods (95% vs 90 % F1-score and 100% precision).

Conclusion and future works

Our methodologies aim at increasing transparency and safety at different development levels, from planning to execution and verification. Our current research direction includes the online integration of symbolic learning and formal verification approaches into RL, focusing on the current scalability issues.

Figure 1 :1Figure 1: Safe Policy Improvement: a) Comparison of performance among SPI algorithms; b) Scalability comparison between MCTS-SPIBB and SPIBB [8].

Figure 2 :Figure 3 :23Figure 2: a) Rocksample setup; b) Results of[16] with few simulations and c) on larger grids.

Figure 4 :Figure 5 :45Figure 4: Drone paths. The black and blue arrows are, respectively, the desired yaw angle and drone initial yaw angle

Figure 6 :6Figure 6: Average F1-score for the original detector HHAD and augmented detectors [38]: H-AUG (ours, based on Hellinger distance), L-AUG (ours, based on log-likelihood), R-AUG (random-based baseline), D-AUG (drift-based baseline), G-AUG (gaussian-based baseline), and S-AUG (SMOTE-based baseline) on different training set sizes in the INTCATCH dataset. Averages are computed over 30 datasets, for each dataset size.

Table 11Results of model selection. SAT indicates property violation.Θ's denote the safety property not to touch the colon wall in any cardinal direction.Safety PropertiesΘ ↑Θ ↓Θ←Θ→FV selectionMethodSATSATSATSATSafe modelsPPO300246801670L-PPO221198531613

https://sites.google.com/diag.uniroma1.it/robsec-data

Artificial intelligence for industry 4.0: Systematic review of applications, challenges, and opportunities ZJan FAhamed WMayer NPatel GGrossmann MStumptner AKuusk Expert Systems with Applications 216 119456 2023 Reinforcement learning applications in environmental sustainability: a review MZuccotto ACastellini DLTorre LMola AFarinelli Artificial Intelligence Review 57 88 2024 Responsible ai: Bridging from ethics to practice BShneiderman Communications of the ACM 64 2021 Thinking responsibly about responsible AI and 'the dark side'of AI PMikalef KConboy JELundström APopovič 2022 Artificial intelligence, values, and alignment IGabriel Minds and machines 30 2020 Safe policy improvement approaches and their limitations PScholl FDietrich COtte SUdluft International Conference on Agents and Artificial Intelligence Springer 2022 Monte carlo tree search: A review of recent modifications and applications MŚwiechowski KGodlewski BSawicki JMańdziuk Artificial Intelligence Review 56 2023 Scalable safe policy improvement via Monte Carlo tree search ACastellini FBianchi EZorzi TDSimão AFarinelli MT JSpaan Proceedings of the 40th International Conference on Machine Learning (ICML 2023) the 40th International Conference on Machine Learning (ICML 2023)

PMLR

2023 Reinforcement learning, An introduction RSutton ABarto 2018 MIT Press 2nd ed. Safe policy improvement with baseline bootstrapping RLaroche PTrichelair RTachet DesCombes Proceedings of the 36th International Conference on Machine Learning (ICML) the 36th International Conference on Machine Learning (ICML)

PMLR

2019 Monte-carlo planning in large pomdps DSilver JVeness Advances in neural information processing systems 23 2010 Despot: Online pomdp planning with regularization NYe ASomani DHsu WSLee Journal of Artificial Intelligence Research 58 2017 Risk-aware shielding of Partially Observable Monte Carlo Planning policies GMazzi ACastellini AFarinelli Artificial Intelligence 324 2023 Satisfiability modulo theories CBarrett RSebastiani SASeshia CTinelli 2021 Learning logic specifications for soft policy guidance in POMCP GMazzi DMeli ACastellini AFarinelli Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, AAMAS '23 the 2023 International Conference on Autonomous Agents and Multiagent Systems, AAMAS '23 IFAAMAS 2023 Learning logic specifications for policy guidance in POMDPs: an inductive logic programming approach DMeli ACastellini AFarinelli Journal of Artificial Intelligence Research 79 2024 Inductive logic programming: Theory and methods SMuggleton LDeRaedt The Journal of Logic Programming 19 1994 Answer set planning: a survey SCTran EPontelli MBalduccini TSchaub Theory and Practice of Logic Programming 23 2023 Logic programming for deliberative robotic task planning DMeli HNakawala PFiorini Artificial Intelligence Review 56 2023 Inductive logic programming for transparent alignment with multiple moral values CVeronese DMeli FBistaffa MRodríguez-Soto AFarinelli JARodríguez-Aguilar CEUR WORKSHOP PRO-CEEDINGS 2023 An online path planner based on pomdp for uavs FTrotti AFarinelli RMuradore 2023 European Control Conference (ECC) IEEE 2023 AIsidori Nonlinear control systems II Springer 2013 Nonlinear Systems HKhalil 2002 Prentice Hall CSzegedy WZaremba ISutskever JBruna DErhan IGoodfellow RFergus arXiv:1312.6199 Intriguing properties of neural networks 2013 Algorithms for verifying deep neural networks CLiu TArnon CLazarus CStrong CBarrett MJKochenderfer Foundations and Trends® in Optimization 4 2021 Reluplex: An efficient smt solver for verifying deep neural networks GKatz CBarrett DLDill KJulian MJKochenderfer International conference on computer aided verification Springer 2017 The #DNN-Verification Problem: Counting Unsafe Inputs for Deep Neural Networks LMarzari DCorsi FCicalese AFarinelli International Joint Conference on Artificial Intelligence (IJCAI) 2023 Enumerating safe regions in deep neural networks with provable probabilistic guarantees LMarzari DCorsi EMarchesini AFarinelli FCicalese Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2024 38 Statistical prediction with special reference to the problem of tolerance limits SSWilks The annals of mathematical statistics 13 1942 Verifying learningbased robotic navigation systems GAmir DCorsi RYerushalmi LMarzari DHarel AFarinelli GKatz 29th International Conference TACAS Springer 2023 Constrained reinforcement learning and formal verification for safe colonoscopy navigation DCorsi LMarzari APore AFarinelli ACasals PFiorini DDall'alba IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) IEEE 2023. 2023 Safe deep reinforcement learning by verifying tasklevel properties EMarchesini LMarzari AFarinelli CAmato AAMAS '23, International Foundation for Autonomous Agents and Multiagent Systems 2023 Online safety property collection and refinement for safe deep reinforcement learning in mapless navigation LMarzari EMarchesini AFarinelli IEEE International Conference on Robotics and Automation (ICRA), IEEE 2023. 2023 Time series segmentation for statemodel generation of autonomous aquatic drones: A systematic framework ACastellini MBicego FMasillo MZuccotto AFarinelli Engineering Applications of Artificial Intelligence 90 2020 Subspace clustering for situation assessment in aquatic drones: a sensitivity analysis for state-model improvement ACastellini MBicego DBloisi JBlum FMasillo SPeignier AFarinelli Cybernetics and Systems 50 2019 Subspace clustering for situation assessment in aquatic drones ACastellini FMasillo MBicego DBloisi JBlum AFarinelli Proc. 33th ACM/SIGAPP Symposium on Applied Computing 33th ACM/SIGAPP Symposium on Applied Computing SAC 2019 HMMs for anomaly detection in autonomous robots DAzzalini ACastellini MLuperto AFarinelli FAmigoni Proc. AAMAS AAMAS 2020 Adversarial data augmentation for hmm-based anomaly detection ACastellini FMasillo DAzzalini FAmigoni AFarinelli IEEE Transactions on Pattern Analysis and Machine Intelligence 45 2023 Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen EHellinger Journal für die reine und angewandte Mathematik 136 1909 Adversarial attacks on time series FKarim SMajumdar HDarabi IEEE Trans Pattern Anal Mach Intell 43 2020 Multivariate sensor signals collected by aquatic drones involved in water monitoring: A complete dataset ACastellini DBloisi JBlum FMasillo AFarinelli Data Brief 30 105436 2020 DMeli arXiv:2404.09871 Explainable online unsupervised anomaly detection for cyber-physical systems via causal discovery from time series 2024