-

International Workshop on Emerging Ethical Aspects of AI @ AIxIA

1613-0073

ming For Transparent Alignment With Multiple Moral Values

Celeste Veronese

celeste.veronese@univr.it 0 2

Daniele Meli

daniele.meli@univr.it 0 2

Filippo Bistafa

filippo.bistaffa@iiia.csic.es 1 2

Manel Rodríguez-Soto

manel.rodriguez@iiia.csic.es 1 2

Alessandro Farinelli

alessandro.farinelli@univr.it 0 2

Juan A. Rodríguez-Aguilar

1 2 0 Department of Computer Science, University of Verona , Verona, 37134 , Italy 1 IIIA-CSIC , Campus UAB, 08913 Bellaterra , Spain 2 Inductive Logic Programming, Answer Set Programming , Explainable AI , Ethical Decision Making

2023

Reinforcement learning is a key paradigm for developing intelligent agents that operate in complex environments and interact with humans. However, researchers face the need to explain and interpret the decisions of these systems, especially when it comes to ensuring their alignment with societal value systems. This paper marks the initial stride in an ongoing research direction by applying an inductive logic programming methodology to explain the policy learned by an RL algorithm in the domain of autonomous driving, thus increasing the transparency of the ethical behaviour of agents.

CEUR ceur-ws.org

1. Introduction

As artificial agents become more intelligent and integrated into our society, ensuring that they align with human values is crucial to prevent potential ethical risks in critical areas [ 1 ]. Reinforcement Learning (RL) [ 2 ] is an efective paradigm for developing intelligent agents that learn to interact with humans in complex environments. However, ensuring that RL agents pursue their own objectives while remaining aligned with human values is still a challenging under-explored problem, which is known as the multi-valued RL problem. [ 3 ] showed that one way to optimally solve this problem is to embed value signals into a single reward function, which is then maximized in a single-objective RL problem. However, the embedding is highly computationally demanding. In this setting, the aim of this work is to improve the transparency of the multi-valued RL agent’s behaviour, allowing a more aware evaluation of the agent’s ethical alignment. Following [ 4, 5 ], we use the framework of Inductive Logic Programming (ILP) [ 6 ] to generate concise and interpretable explanations of RL agents’ behaviour, starting from traces (state-action pairs) of its execution. In this way, we obtain logical rules that describe the rationale behind the optimal policy. We can then express learned rules in logic programming paradigms for planning [ 7, 8 ], e.g., Answer Set Programming (ASP) [ 9 ], and apply them to †These authors contributed equally. CEUR Workshop Proceedings guide the agent in place of RL. We preliminarily validate our methodology for multi-valued autonomous car driving in simulation, showing that the ASP planner accomplishes the task successfully and ethically, which proves the quality of learned explanations.

2. Background

We now report fundamentals for ASP and ILP, both required by our methodology.

2.1. Answer Set Programming

In ASP [ 9 ], a domain is represented as a collection of logical statements which describe relationships between entities (either actions or environmental features for planning domains), which are represented as variables and predicates (atoms). When assigning a value to a variable we say that it is ground, and if its variables are ground, an atom becomes ground. Logical statements (axioms) considered in this work are causal rules h ∶- b1, … , bn, which define the body of the rule (i.e. the logical conjunction of terms ⋀i=1 bi) as a precondition for the head ℎ. In the planning domain, they express preconditions or efects of actions. Given an ASP task n description, the solving process involves computing answer sets, that is, the minimal sets of ground atoms satisfying axioms. The ASP solver starts from an initial grounding of body atoms and deduces ground heads of rules, i.e., feasible sequences of actions.

2.2. Inductive Logic Programming

An ILP [ 6 ] task is defined as a tuple = ⟨, , ⟩ , consisting of background knowledge expressed in a logic formalism , search space consisting of possible axioms and a set of examples , both expressed in the syntax of . The goal is to find a hypothesis ⊆ covering . In ILASP [ 10 ], an implementation of ILP under the ASP semantics, examples are ContextDependent Partial Interpretations (CDPIs), that is, tuples ⟨, ⟩ context (for our scope, environmental features) and = ⟨ , where is a set of atoms called , ⟩ is made up by an included set

and an excluded set , both containing ground atoms (actions, for our scope). The goal of ILASP is to find such that: ∀ ∈ ∶ ∪ ∪ ⊨ ∧ ∪ ∪ ⊭ the context, ILASP finds axioms which support the execution of actions in support the ones in . Considering that could fail to cover all CDPIs, ILASP also returns . Therefore, given and does not the number of uncovered CDPIs as a confidence measure.

3. Problem definition and case study

This work focuses on the problem of multi-value alignment, wherein moral values influence decision-making with diferent priorities. We start from a value system , that is, a tuple = ⟨ , ≽⟩ , where = {

1, … , } stands for a non-empty set of moral values plus the agent’s individual objective, and ≽ is a total order of preference over ’s elements. The methodology in [ 3 ] considers the ethical knowledge in as ethical rewards in a Multi-Objective Markov Decision Process (MOMDP) [ 11 ]. Recall that a MOMDP is defined by a tuple ⟨ , , , ⟩ , where is a finite set of states, is a finite set of actions, ∶ × × → ℜ is a reward function that describes a vector of rewards for a given state, action, and next state, and ∶ × × → [ 0, 1 ] is a transition function that specifies the probability of moving to the next state for a given state and action. [ 3 ] shows that the MOMDP can be converted to an equivalent single-objective MDP, whose optimal policy is guaranteed to be aligned with . Consider, for example, the autonomous driving task, the car agent not only has to reach its destination, but it must also preserve pedestrians’ safety and avoid obstacles. The task can be modelled as a MOMDP (with a reward vector that comprises all three objectives) and converted to an equivalent single-objective MDP [ 3 ]. In this work, we will be observing the resulting RL agent acting in the scenario in Figure 1.

4. Methodology

Given the ethical policy ∶ → , we want to represent it as a set of logical formulas through a map Γ ∶ ℱ → ℒ, in which ℱ = {F } is a set of ASP atoms representing user-defined environmental features, and ℒ = {A } is the ASP formulation of . We propose the following steps to achieve our objective.

4.1. Domain representation in ASP syntax

The first step of our method is to define an ASP feature map ℱ ∶ → (ℱ ) and an action map ℒ ∶ → ( ℒ), with (⋅) representing a grounding function. Features in ℱ abstract raw information about the environmental state into more interpretable human-level concepts, resulting in a more transparent expression of the state-action map . For instance, features for the autonomous driving domain are in the form item_pos(Dir, Dist), where Dist ∈ ℤ represents the distance between the agent and the item (pedestrian, obstacle or goal), along the direction Dir ∈ Dirs = {right, left, forward, down}. In the same domain, we have ℒ = {move_slow(Dir), move_fast(Dir)}.

4.2. ILASP task definition from execution traces

In order to extract ASP task specifications from RL executions after training, we collect traces (i.e. sequences of state-action pairs ⟨, ⟩ ∈ × ). We then map them to ASP representation via ℱ and ℒ maps, obtaining a lifted representation ⟨s, a⟩ for each pair. For each lifted couple, we generate two CDPIs: ⟨⟨a, ∅⟩ s⟩ and ⟨⟨∅, G( ℒ) ⧵ G( )⟩, s⟩, with s ⊆ (ℱ ) and a ⊆ ( A ), being A the ASP atom representing and G( ℒ) ⧵ G( ) denoting all possible grounding of actions diferent from the executed one. Including examples with an empty included set enables the learned axioms to provide more significant and practical policy specifications, as they are also derived from counterexamples where actions are not executed. For instance, from ⟨move_fast(down), {obs_pos(left, 1), goal_pos(down, 2), ped_pos(down, 3)}⟩, we generate the following CDPIs: ⟨⟨move_fast(down), ∅⟩, {obs_pos(left, 1), goal_pos(down, 2), ped_pos(down, 3)}⟩ ⟨⟨∅, move_slow(_)⟩, {obs_pos(left, 1), goal_pos(down, 2), ped_pos(down, 3)}⟩ To complete the ILASP task definition we add the background knowledge, which defines ranges of variables and atoms, and the search space, which contains all possible axioms in the form a ∶- f1, … fn, with a ∈ ℒ and fi ∈ ℱ.

5. Empirical Evaluation

We applied our methodology on the domain in Figure 1, observing the RL agent trained to prioritize pedestrians’ safety over passengers’ safety and goal-reaching, which is the value system considered in [ 3 ]. To improve ILASP performance, we assume independent axioms for each action, thus, we create separate ILASP tasks. Starting from ≈ 36 trained RL executions with random agent start positions and pedestrian motions, we generate 4 CDPIs and, from a search space containing approximately 300 rules (with a maximum length of 6 body atoms and 4 variables each), we learn the following rules, which cover ≈ 70% of CDPIs: move_fast(V1) ∶- goal_pos(V1, V2); V2 > 1; ped_pos(V1, V3); V3 > 2. move_slow(V1) ∶- goal_pos(V1, V2); V2 < 2; ped_pos(V1, V3); V3 > 0. move_slow(V1) ∶- goal_pos(V1, V2); V2 > 2; not ped_pos(V1, V3); distance(V3). (1) (2) (3) Importantly, these rules reflect, and explain, what the RL agent learned, that is, to prioritize pedestrians’ safety even to the detriment of speed. Axiom 2 may seem critical, still, the only unsafe situation is represented by a pedestrian being precisely on the goal cell, and that’s impossible since goal cells are only reachable by the car. Rule generation took ≈ 5s, while training RL agent required ≈ 3h even on a very small scale instance1. Learned axioms were used to implement an ASP agent, and its performance was evaluated in 1 random scenarios. Since the task definition does not include a stop action, but the ASP axiom could not be satisfied in all contexts, we introduce a move_default action to ASP, equivalent to moving slow and ground only when no other action is. Results in Table 1 show that, despite ASP solving being 1All experiments have been run with 11th Gen Intel(R) Core(TM) i7-1165G7 Quad-Core processor and 8 GB RAM.

Statistics Average execution time per simulation

Average steps number Average obstacles collisions Average pedestrians collisions slightly slower, the ASP agent reaches the goal in fewer steps and with fewer collisions with pedestrians (neglecting the default action), thus achieving better performance than RL.

6. Conclusion

We proposed a methodology based on ILP to provide a way to learn socially acceptable interpretations of multi-value policies generated by RL. Learned specifications are also helpful in implementing an ASP planner with slightly better performance compared to RL. In future works, we plan to extend the validation of the planner to verify that learned ASP specifications are generalizable to larger domains, thus not requiring demanding training of RL. Furthermore, our intention is to exploit the non-monotonic features of ASP to extend the scope of our research to more advanced scenarios.

[1] Russell , et al., Research priorities for robust and beneficial artificial intelligence , AI Magazine ( 2015 ).

[2] Sutton , et al., Reinforcement Learning: An Introduction , MIT Press, 2018 .

[3] Rodriguez-Soto , et al., Multi-objective reinforcement learning for guaranteeing alignment with multiple values , 2023 . ALA workshop at AAMAS.

[4] Meli , et al., Inductive learning of answer set programs for autonomous surgical task planning: Application to a training task for surgeons , Machine Learning ( 2021 ).

[5] Mazzi , et al., Learning logic specifications for soft policy guidance in pomcp, 2023 . AAMAS.

[6]

Muggleton , Inductive logic programming , New Generation Computing ( 1991 ).

[7]

Meli , et al., Autonomous tissue retraction with a biomechanically informed logic based framework , 2021 . IEEE ISMR 2021 .

[8] Meli , et al., Logic programming for deliberative robotic task planning , Artificial Intelligence Review ( 2023 ).

[9] Calimeri , et al., Asp-core-2 input language format , Theory and Practice of Logic Programming ( 2019 ).

[10]

Law , Inductive learning of answer set programs , Ph.D. thesis , 2018 .

[11] Roijers , et al., A survey of multi-objective sequential decision-making , Journal of Artificial Intelligence Research ( 2013 ).