<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Workshop on Emerging Ethical Aspects of AI @ AIxIA</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ming For Transparent Alignment With Multiple Moral Values</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Celeste Veronese</string-name>
          <email>celeste.veronese@univr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Meli</string-name>
          <email>daniele.meli@univr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filippo Bistafa</string-name>
          <email>filippo.bistaffa@iiia.csic.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manel Rodríguez-Soto</string-name>
          <email>manel.rodriguez@iiia.csic.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Farinelli</string-name>
          <email>alessandro.farinelli@univr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan A. Rodríguez-Aguilar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Verona</institution>
          ,
          <addr-line>Verona, 37134</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IIIA-CSIC</institution>
          ,
          <addr-line>Campus UAB, 08913 Bellaterra</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Inductive Logic Programming, Answer Set Programming</institution>
          ,
          <addr-line>Explainable AI</addr-line>
          ,
          <country>Ethical Decision Making</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>6</volume>
      <abstract>
        <p>Reinforcement learning is a key paradigm for developing intelligent agents that operate in complex environments and interact with humans. However, researchers face the need to explain and interpret the decisions of these systems, especially when it comes to ensuring their alignment with societal value systems. This paper marks the initial stride in an ongoing research direction by applying an inductive logic programming methodology to explain the policy learned by an RL algorithm in the domain of autonomous driving, thus increasing the transparency of the ethical behaviour of agents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        As artificial agents become more intelligent and integrated into our society, ensuring that
they align with human values is crucial to prevent potential ethical risks in critical areas [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Reinforcement Learning (RL) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is an efective paradigm for developing intelligent agents that
learn to interact with humans in complex environments. However, ensuring that RL agents
pursue their own objectives while remaining aligned with human values is still a challenging
under-explored problem, which is known as the multi-valued RL problem. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] showed that one
way to optimally solve this problem is to embed value signals into a single reward function,
which is then maximized in a single-objective RL problem. However, the embedding is highly
computationally demanding. In this setting, the aim of this work is to improve the transparency
of the multi-valued RL agent’s behaviour, allowing a more aware evaluation of the agent’s
ethical alignment. Following [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], we use the framework of Inductive Logic Programming (ILP)
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to generate concise and interpretable explanations of RL agents’ behaviour, starting from
traces (state-action pairs) of its execution. In this way, we obtain logical rules that describe the
rationale behind the optimal policy. We can then express learned rules in logic programming
paradigms for planning [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], e.g., Answer Set Programming (ASP) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and apply them to
†These authors contributed equally.
CEUR
Workshop
Proceedings
guide the agent in place of RL. We preliminarily validate our methodology for multi-valued
autonomous car driving in simulation, showing that the ASP planner accomplishes the task
successfully and ethically, which proves the quality of learned explanations.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Background</title>
      <p>We now report fundamentals for ASP and ILP, both required by our methodology.</p>
      <sec id="sec-3-1">
        <title>2.1. Answer Set Programming</title>
        <p>
          In ASP [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], a domain is represented as a collection of logical statements which describe
relationships between entities (either actions or environmental features for planning domains),
which are represented as variables and predicates (atoms). When assigning a value to a variable
we say that it is ground, and if its variables are ground, an atom becomes ground. Logical
statements (axioms) considered in this work are causal rules h ∶- b1, … , bn, which define the
body of the rule (i.e. the logical conjunction of terms ⋀i=1 bi) as a precondition for the head ℎ.
In the planning domain, they express preconditions or efects of actions. Given an ASP task
n
description, the solving process involves computing answer sets, that is, the minimal sets of
ground atoms satisfying axioms. The ASP solver starts from an initial grounding of body atoms
and deduces ground heads of rules, i.e., feasible sequences of actions.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Inductive Logic Programming</title>
        <p>
          An ILP [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] task is defined as a tuple  = ⟨, 
 , ⟩ , consisting of background knowledge 
expressed in a logic formalism  , search space   consisting of possible axioms and a set of
examples  , both expressed in the syntax of  . The goal is to find a hypothesis  ⊆   covering
 . In ILASP [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], an implementation of ILP under the ASP semantics, examples are
ContextDependent Partial Interpretations (CDPIs), that is, tuples ⟨, ⟩
context (for our scope, environmental features) and  = ⟨ 
, where  is a set of atoms called
,   ⟩ is made up by an included set
        </p>
        <p>and an excluded set   , both containing ground atoms (actions, for our scope). The goal
of ILASP is to find  such that: ∀ ∈  ∶  ∪  ∪  ⊨ 

∧  ∪  ∪  ⊭ 
the context, ILASP finds axioms which support the execution of actions in 
support the ones in   . Considering that  could fail to cover all CDPIs, ILASP also returns
 . Therefore, given

and does not
the number of uncovered CDPIs as a confidence measure.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Problem definition and case study</title>
      <p>This work focuses on the problem of multi-value alignment, wherein moral values influence
decision-making with diferent priorities. We start from a
value system   , that is, a tuple
  = ⟨ , ≽⟩ , where  = {</p>
      <p>
        1, … ,   } stands for a non-empty set of moral values plus the agent’s
individual objective, and ≽ is a total order of preference over  ’s elements. The methodology
in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] considers the ethical knowledge in   as ethical rewards in a Multi-Objective Markov
Decision Process (MOMDP) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Recall that a MOMDP is defined by a tuple ⟨ ,  , ,  ⟩
, where
 is a finite set of states,  is a finite set of actions,  ∶  ×  ×  → ℜ
 is a reward function that
describes a vector of  rewards for a given state, action, and next state, and  ∶  × × → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]
is a transition function that specifies the probability of moving to the next state for a given
state and action. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] shows that the MOMDP can be converted to an equivalent single-objective
MDP, whose optimal policy  is guaranteed to be aligned with   . Consider, for example,
the autonomous driving task, the car agent not only has to reach its destination, but it must
also preserve pedestrians’ safety and avoid obstacles. The task can be modelled as a MOMDP
(with a reward vector  that comprises all three objectives) and converted to an equivalent
single-objective MDP [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this work, we will be observing the resulting RL agent acting in
the scenario in Figure 1.
      </p>
    </sec>
    <sec id="sec-5">
      <title>4. Methodology</title>
      <p>Given the ethical policy  ∶  →  , we want to represent it as a set of logical formulas through
a map Γ ∶ ℱ →  ℒ, in which ℱ = {F } is a set of ASP atoms representing user-defined
environmental features, and  ℒ = {A } is the ASP formulation of  . We propose the following
steps to achieve our objective.</p>
      <sec id="sec-5-1">
        <title>4.1. Domain representation in ASP syntax</title>
        <p>The first step of our method is to define an ASP feature map  ℱ ∶  → (ℱ ) and an action
map   ℒ ∶  → ( ℒ), with (⋅) representing a grounding function. Features in ℱ abstract
raw information about the environmental state into more interpretable human-level concepts,
resulting in a more transparent expression of the state-action map  . For instance, features
for the autonomous driving domain are in the form item_pos(Dir, Dist), where Dist ∈ ℤ
represents the distance between the agent and the item (pedestrian, obstacle or goal), along
the direction Dir ∈ Dirs = {right, left, forward, down}. In the same domain, we have  ℒ =
{move_slow(Dir), move_fast(Dir)}.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. ILASP task definition from execution traces</title>
        <p>In order to extract ASP task specifications from RL executions after training, we collect traces
(i.e. sequences of state-action pairs ⟨, ⟩ ∈  ×  ). We then map them to ASP representation via
 ℱ and   ℒ maps, obtaining a lifted representation ⟨s, a⟩ for each pair. For each lifted couple,
we generate two CDPIs: ⟨⟨a, ∅⟩ s⟩ and ⟨⟨∅, G( ℒ) ⧵ G(  )⟩, s⟩, with s ⊆ (ℱ ) and a ⊆ ( A ),
being A the ASP atom representing  and G( ℒ) ⧵ G(  ) denoting all possible grounding
of actions diferent from the executed one. Including examples with an empty included set
enables the learned axioms to provide more significant and practical policy specifications, as
they are also derived from counterexamples where actions are not executed. For instance,
from ⟨move_fast(down), {obs_pos(left, 1), goal_pos(down, 2), ped_pos(down, 3)}⟩, we
generate the following CDPIs:
⟨⟨move_fast(down), ∅⟩, {obs_pos(left, 1), goal_pos(down, 2), ped_pos(down, 3)}⟩
⟨⟨∅, move_slow(_)⟩, {obs_pos(left, 1), goal_pos(down, 2), ped_pos(down, 3)}⟩
To complete the ILASP task definition we add the background knowledge, which defines ranges
of variables and atoms, and the search space, which contains all possible axioms in the form
a ∶- f1, … fn, with a ∈  ℒ and fi ∈ ℱ.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Empirical Evaluation</title>
      <p>
        We applied our methodology on the domain in Figure 1, observing the RL agent trained to
prioritize pedestrians’ safety over passengers’ safety and goal-reaching, which is the value
system considered in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To improve ILASP performance, we assume independent axioms for
each action, thus, we create separate ILASP tasks. Starting from ≈ 36 trained RL executions
with random agent start positions and pedestrian motions, we generate 4 CDPIs and, from a
search space containing approximately 300 rules (with a maximum length of 6 body atoms and
4 variables each), we learn the following rules, which cover ≈ 70% of CDPIs:
move_fast(V1) ∶- goal_pos(V1, V2); V2 &gt; 1; ped_pos(V1, V3); V3 &gt; 2.
move_slow(V1) ∶- goal_pos(V1, V2); V2 &lt; 2; ped_pos(V1, V3); V3 &gt; 0.
move_slow(V1) ∶- goal_pos(V1, V2); V2 &gt; 2; not ped_pos(V1, V3); distance(V3).
(1)
(2)
(3)
Importantly, these rules reflect, and explain, what the RL agent learned, that is, to prioritize
pedestrians’ safety even to the detriment of speed. Axiom 2 may seem critical, still, the only
unsafe situation is represented by a pedestrian being precisely on the goal cell, and that’s
impossible since goal cells are only reachable by the car. Rule generation took ≈ 5s, while
training RL agent required ≈ 3h even on a very small scale instance1. Learned axioms were
used to implement an ASP agent, and its performance was evaluated in 1 random scenarios.
Since the task definition does not include a stop action, but the ASP axiom could not be satisfied
in all contexts, we introduce a move_default action to ASP, equivalent to moving slow and
ground only when no other action is. Results in Table 1 show that, despite ASP solving being
1All experiments have been run with 11th Gen Intel(R) Core(TM) i7-1165G7 Quad-Core processor and 8 GB RAM.
      </p>
      <p>Statistics
Average execution time per simulation</p>
      <p>Average steps number
Average obstacles collisions
Average pedestrians collisions
slightly slower, the ASP agent reaches the goal in fewer steps and with fewer collisions with
pedestrians (neglecting the default action), thus achieving better performance than RL.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>We proposed a methodology based on ILP to provide a way to learn socially acceptable
interpretations of multi-value policies generated by RL. Learned specifications are also helpful in
implementing an ASP planner with slightly better performance compared to RL. In future works,
we plan to extend the validation of the planner to verify that learned ASP specifications are
generalizable to larger domains, thus not requiring demanding training of RL. Furthermore, our
intention is to exploit the non-monotonic features of ASP to extend the scope of our research to
more advanced scenarios.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Russell</surname>
          </string-name>
          , et al.,
          <article-title>Research priorities for robust and beneficial artificial intelligence</article-title>
          ,
          <source>AI</source>
          Magazine (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Sutton</surname>
          </string-name>
          , et al.,
          <source>Reinforcement Learning: An Introduction</source>
          , MIT Press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Rodriguez-Soto</surname>
          </string-name>
          , et al.,
          <article-title>Multi-objective reinforcement learning for guaranteeing alignment with multiple values</article-title>
          ,
          <year>2023</year>
          . ALA workshop at AAMAS.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Meli</surname>
          </string-name>
          , et al.,
          <article-title>Inductive learning of answer set programs for autonomous surgical task planning: Application to a training task for surgeons</article-title>
          ,
          <source>Machine Learning</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Mazzi</surname>
          </string-name>
          , et al.,
          <article-title>Learning logic specifications for soft policy guidance in pomcp, 2023</article-title>
          . AAMAS.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Muggleton</surname>
          </string-name>
          ,
          <article-title>Inductive logic programming</article-title>
          ,
          <source>New Generation Computing</source>
          (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Meli</surname>
          </string-name>
          , et al.,
          <article-title>Autonomous tissue retraction with a biomechanically informed logic based framework</article-title>
          ,
          <year>2021</year>
          . IEEE
          <string-name>
            <surname>ISMR</surname>
          </string-name>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Meli</surname>
          </string-name>
          , et al.,
          <article-title>Logic programming for deliberative robotic task planning</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Calimeri</surname>
          </string-name>
          , et al.,
          <article-title>Asp-core-2 input language format</article-title>
          ,
          <source>Theory and Practice of Logic Programming</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Law</surname>
          </string-name>
          ,
          <article-title>Inductive learning of answer set programs</article-title>
          ,
          <source>Ph.D. thesis</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Roijers</surname>
          </string-name>
          , et al.,
          <article-title>A survey of multi-objective sequential decision-making</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>