<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>What are you saying? Explaining com munication in multi-agent reinforcement learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniele Meli</string-name>
          <email>daniele.meli@univr.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristian Morasso</string-name>
          <email>cristian.morasso@univr.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Castellini</string-name>
          <email>alberto.castellini@univr.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Farinelli</string-name>
          <email>alessandro.farinelli@univr.it</email>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Multi-Agent Reinforcement Learning, Communication in MARL, Explainable AI, Causal Discovery</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Communication in Multi-Agent Reinforcement Learning (MARL) has the potential to improve the performance of cooperating agents, especially in complex robotic domains under partial observability. However, a transparent interpretation of the learned communication policy is crucial for trustability and safety. In this paper, we use tools from explainable artificial intelligence to investigate the impact of communication in a benchmark MARL setting, involving collision avoidance among multiple agents. Our preliminary tests show that the role of communication cannot be evidenced solely by looking at the state-action policy map; instead, causal discovery on the state and communication spaces highlights the latent behavioural impact of messages passed among agents, indirectly afecting the actual actions for more eficient collision avoidance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Reinforcement Learning (RL) is an established methodology to achieve agent autonomy in
complex scenarios, including robotics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Indeed, given the model of interaction with the
environment (the transition map) and the reward attained in consequence of executing specific
actions in particular conditions (states), a RL algorithm automatically learns the best policy, i.e.,
state-action map, to fulfill the task at the highest cumulative reward ( return). The advent of Deep
Neural Networks (DNNs) has enhanced the learning of complex policies for the most challenging
tasks, shifting towards Deep RL (DRL). This has also paved the way towards DRL applied in
multi-agent settings (Multi-Agent RL, MARL) [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], where the best task strategy does not solely
depend on the individual policies, but rather on the inter-agent coordination. Inspired from
biology and human behaviour, an emerging problem in MARL is the inter-agent communication
[
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], i.e., learning and deploying an eficient mechanism for information sharing among agents,
with the goal to enhance coordination and improve the individual and global task performance.
While several approaches have been studied and compared, one fundamental question rises
when deploying communicating MARL agents in the real world, e.g., on real robots interacting
with humans: what is the meaning of the learned communication policy? Answering this question
is fundamental for the transparency and interpretability of the MARL application, which in
(a)
(b)
turn are essential for trustability and social acceptance, as well for proper monitoring of the
autonomous systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In this paper, we address the problem of explaining MARL communication. We consider a
benchmark domain for MARL, simple spread1, where 3 robotic agents must coordinate to reach
3 separate targets (Figure 1a). We design diferent communication protocols, both hardcoded
and learned in the MARL pipeline. We then investigate the impact of diferent communication
strategies on MARL performance, both from a quantitative perspective (i.e., evaluating the
achieved return) and exploiting eXplainable Artificial Intelligence (XAI) techniques, including
relevant feature analysis via Integrated Gradients (IG) and causal discovery, already employed
for complex system explanation and monitoring [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. In this way, we want to analyze the
meaning of messages passed among agents, and how specific parts of information afect the
overall performance observed by standard RL metrics, as the return.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Background</title>
      <p>We now provide the relevant background about MARL and related communication strategies,
and XAI methods adopted in this paper, i.e., IG and causal discovery.</p>
      <sec id="sec-3-1">
        <title>2.1. Multi-Agent Reinforcement Learning</title>
        <p>We frame the problem of single-agent RL as a Markov Decision Process (MDP) ⟨ , ,  , ,  ⟩ ,
where  is the state space;  is the action space;  ∶  ×  →  is the transition function mapping
state   and action   at time  to the state at  +1 (assuming a discretization of the time dimension);
 ∈ R is the discount factor;  ∶  ×  ×  → R is the reward map, assigning a real number to
incentivize / penalize the agent for executing   at   , with a corresponding next state determined
from  . The goal of RL is to compute a policy map  ∶  →  , prescribing the best   to be
performed at   , in order to maximize the expected value of the return ∑∞=1   −1 (   ,   ,  ′) .</p>
        <p>In the MARL setting with  agents, we assume that each agent  has access to a state
space   such that ⟨  1, … ,   ⟩ =  ; similarly, each agent can pick an action from   such that
1https://pettingzoo.farama.org/environments/mpe/simple_spread/
⟨
 1, … ,</p>
        <p>
          ⟩ =  ; ,  , 
the environment, but still all agents should coordinate to compute the best global policy towards
the maximization of the cumulative shared reward. It is then fundamental for the agents to
communicate; however, it is highly domain-dependent, and in general far from trivial, to design
the messages and methodologies for efective communication [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. An interesting approach is then
to learn the best communication policy   ∶  →  ,  being the set of available communication
remain unchanged. In this way, each agent has partial observabiliy of
actions   at time  .
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Explainable AI</title>
        <p>
          XAI aims at providing explanations about AI algorithms to a targeted audience, according to
their needs and knowledge in relation to a specific domain of application [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In this paper,
we focus on two main XAI methodologies: causal discovery from time series and Integrated
Gradients (IG) to explain the input / output relations in DNNs.
        </p>
        <sec id="sec-3-2-1">
          <title>2.2.1. Causal discovery</title>
          <p>Consider a multi-variate time series  = {</p>
          <p>}  =1,… composed of  time series, and denote as
  = {  1, …   } the sequence of observations of variable   for  time steps. The goal of causal

discovery is to identify directed causal links between variables in  . More specifically, causal
links are determined according to the measure of Conditional Mutual Information (CMI), which
is defined for random variables  ,  ,</p>
          <p>as:
 (  ;  ∣  ) = ∭  ( ,  ,  ) log</p>
          <p>( ,  ∣  )
   
( ∣ )
 
( ∣ )
 
where  ( ⋅∣ ⋅) and  ( ⋅, ⋅) denote the conditional and joint probability distributions, respectively.
From the above definiton, it can be easily shown that variables  and  are conditionally
independent under  , denoted as 
⫫  ∣  , if  (  ; 
∣  ) = 0. In other words,  and  have no
mutual causal influence , assuming that  holds. On the other hand,  and  may conditionally
depend on  .</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>2.2.2. Integrated Gradients (IG)</title>
          <p>
            IG [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] is defined as:
dimensions, respectively. Let  ∈
  (  ′) = 1
          </p>
          <p>⋅ 1 (i.e., a neutral input for the DNN).</p>
          <p>
            Consider a DNN   ∶ R → [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]  ,  being the set of parameters, , 
the input and output
R
          </p>
          <p>be a generic input to   , and  ′ ∈ R be a baseline input, s.t.
  (  ) = (  −  ′) ⋅ ∑
 =1
   (  ′ +  ⋅  −  ′)
 (

⋅
1

determining   (  ) .
2Approximated via  discretization steps.
which is the path integral2 of the gradients of   along the straight line (in R ) from  to  ′.
For each input dimension  &lt;  , IG measures its attribution to   (  ) , i.e., the contribution   in
(a)
(b)
for actions left , right, down, up.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>/ left / down / right motions.</p>
      <p>
        We consider a MARL setting based on deep deterministic policy gradient [11]. The simple
spread domain (Figure 1a) is described as a MDP, where each agent observes the following
continuous state-variables: i)  −  velocities   = ⟨   ,   ⟩ ; ii) coordinates  = ⟨   ,   ⟩ ; iii)
landmark (target) coordinates 
= ⟨   ,   ⟩ ; iv) coordinates of other agents 
= ⟨   ,   ⟩ . The
continuous action space, for each agent, consists of 4 directional forces in [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] , resulting in up
      </p>
      <p>
        We assume the Reinforced Inter-Agent Learning (RIAL) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] communication protocol is
employed (Figure 1b). In RIAL, the communication action   from agent  at time  is passed to
all other agents as an additional observation (state) input. Hence, we can define the MDP
⟨ , ̄ ,  ,̄ , 
⟩
, where  ̄ =  ⋃  and  ̄ =  ⋃  .
      </p>
      <p>Our goal is to investigate the meaning of the communication policy   , i.e., the impact of the
communication actions on MARL performance. To this aim, we first apply IG to the trained

policy network  , in order to identify the impact of   on   ,  ≠  . This quantifies the direct
impact of the communication strategy on MARL.</p>
      <p>Then, we discover causal relations between time series from  ,̄ generated applying the trained
policy  in inference. This study evidences the latent impact of the communication strategy, i.e.,
how   influences the general behaviour of the agent (e.g., its intentions), rather than merely its
actions. We adopt state-of-the-art PCMCI+ algorithm [12] for causal discovery, which is sound
and complete under the assumptions of causal Markovianity and suficiency, and faithfulness.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <p>We consider 3 diferent communication policies: i)
  is trained together with  , resulting in   ∈ R −1 (in our case R −1 = R2).
landmark to  -th agent; ii) Intent, where   is the action selected by agent  ; iii) Learnable, where
Closest Target (CT), where   is the closest</p>
      <p>We first report the training performance (over 5 random seeds) with the diferent
communication strategies in Figure 2a, where Base denotes no communication, i.e.,   ≡ 0. We notice
that all MARL policies have large negative drops after the stabilization of the training process.
This derives from the non-stationarity of MARL, under the assumption of partial observability
from each agent. However, the Learnable protocol results in the smallest negative peak in the
return trend. This suggests that each agent can learn to communicate useful messages to the
others, resulting in more robust performance of MARL.</p>
      <p>We first try to understand the role of communication in the Learnable protocol via IG analysis.
Figure 2b shows the attributions of state features for all actions (see the legend in the caption;
we only report one agent for compactness). It is evident that the communication actions ( 1,…3)
do not have significant attribution on the actions chosen by the agent, which in turn depend
mostly on its velocity  , and the position of target  0.</p>
      <p>We then employ causal discovery with the Learnable protocol to show latent connections
between variables in  .̄ Figure 3 shows the causal graph derived from PCMCI+3, where nodes are
variables, edges denote their causal relations, and the color map represents the corresponding
CMI value. We observe that a causal link is identified among communication variables of
diferent agents, denoting the tight interaction strategy between agents. Interestingly, the
communications between agents 0 and 1 afect each other’s position and velocity, as it is visible
in Figure 1a, which shows that the two agents decide to reach two targets close to each other,
hence they learn to communicate to safely avoid collisions.
3We employ the implementation from https://github.com/jakobrunge/tigramite</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>In this paper, we exploited diferent XAI strategies, particularly integrated gradients and causal
discovery, to explain the role of communication in MARL with DNNs. We studied a benchmark
multi-robot navigation problem, the simple-spread domain. Among diferent communication
protocols, including pre-defined messages based on prior task knowledge, the agents achieve
the best performance when they can learn the communication protocol, reducing the negative
impact of non-stationarity in MARL. Under the Learnable communication protocol, IG detects
state-action relations in the policy network, but does not highlight an impact of communication
messages. On the contrary, causal discovery evidences the role of communication among close
agents in the map, in order to exchange mutual position and velocity information and avoid
collisions. In the future, we will extend our study to more complex and real-world robotic
MARL domains.
[11] T. Lillicrap, Continuous control with deep reinforcement learning, arXiv preprint
arXiv:1509.02971 (2015).
[12] J. Runge, Discovering contemporaneous and lagged causal relations in autocorrelated
nonlinear time series datasets, in: Conference on Uncertainty in Artificial Intelligence,
Pmlr, 2020, pp. 1388–1397.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W.-Y. Yau,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <article-title>A brief survey: Deep reinforcement learning in mobile robot navigation</article-title>
          ,
          <source>in: 2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>592</fpage>
          -
          <lpage>597</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Christianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          , Multi-Agent
          <source>Reinforcement Learning: Foundations and Modern Approaches</source>
          , MIT Press,
          <year>2024</year>
          . URL: https://www.marl-book.com.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zorzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Simão</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T. J.</given-names>
            <surname>Spaan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Scalable safe policy improvement for factored multi-agent mdps</article-title>
          ,
          <source>in: Proceedings of the 41st International Conference on Machine Learning (ICML</source>
          <year>2024</year>
          ), PMLR 235,
          <string-name>
            <surname>PMLR</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>3952</fpage>
          -
          <lpage>3973</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Foerster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Assael</surname>
          </string-name>
          , N. De Freitas,
          <string-name>
            <given-names>S.</given-names>
            <surname>Whiteson</surname>
          </string-name>
          ,
          <article-title>Learning to communicate with deep multi-agent reinforcement learning</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>29</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-H.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zha</surname>
          </string-name>
          ,
          <article-title>Learning structured communication for multi-agent reinforcement learning</article-title>
          ,
          <source>Autonomous Agents and MultiAgent Systems</source>
          <volume>36</volume>
          (
          <year>2022</year>
          )
          <fpage>50</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Vouros</surname>
          </string-name>
          ,
          <article-title>Explainable deep reinforcement learning: state of the art and challenges</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Runge</surname>
          </string-name>
          ,
          <article-title>Causal network reconstruction from time series: From theoretical assumptions to practical estimation</article-title>
          ,
          <source>Chaos: An Interdisciplinary Journal of Nonlinear Science</source>
          <volume>28</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Meli</surname>
          </string-name>
          ,
          <article-title>Explainable online unsupervised anomaly detection for cyber-physical systems via causal discovery from time series*</article-title>
          ,
          <source>in: 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>4120</fpage>
          -
          <lpage>4125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Naik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Omer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shah</surname>
          </string-name>
          , G. Morgan, et al.,
          <article-title>Explainable ai (xai): Core ideas, techniques, and solutions</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundararajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Taly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Axiomatic attribution for deep networks</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3319</fpage>
          -
          <lpage>3328</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>