Towards Multi-Agent Model-Based Reinforcement
                         Learning in Discrete Non-Markovian Reward Decision
                         Processes
                         Alessandro Trapasso1 , Marcello Bavaro2 , Francesco Amigoni2 , Luca Iocchi1 and
                         Fabio Patrizi1
                         1
                             Sapienza University of Rome
                         2
                             Politecnico di Milano


                                        Abstract
                                        Non-Markovian Reward Decision Processes have proven very effective in defining and solving complex tasks for
                                        autonomous agents, and recent work has focused on devising relative models and algorithms. When applied to
                                        multiple agents, they can effectively define complex multi-agent behaviors. In this paper, we discuss the main
                                        advantages in using a Multi-Agent Model-Based Reinforcement Learning approach for solving complex tasks in
                                        Multi-agent systems with temporal goals. We use the challenging scenario of Multi-Agent Pickup and Delivery
                                        as a case study to illustrate potential benefits of the proposed approach.

                                        Keywords
                                        Multi-Agent Systems, Model-Based Reinforcement Learning, Non-Markovian Reward Decision Processes, Multi-
                                        Agent Pickup and Delivery


                         1. Introduction
                         Reinforcement Learning (RL) has exhibited exceptional results in different domains, due to its ability to
                         deal with complex environments, even with minimal prior knowledge. A crucial factor in designing
                         RL tasks is the definition of the reward function, i.e., a real-valued function capturing the agent’s task.
                         Indeed, when the task complexity grows, defining effective reward functions becomes increasingly
                         challenging, especially if temporal sub-tasks are involved. A body of research has investigated this
                         problem, leading to the introduction of Non-Markovian Rewards (NMRs) and Non-Markovian Reward
                         Decision Processes (NMRDPs) as models for capturing complex tasks, as well as to the adoption of
                         temporal logic and automata-based formalisms for the specification of NMRs – see, e.g., [1, 2, 3, 4].
                            A fundamental result of these works has been proving that every NMRDP ℳ can be mapped into
                         an equivalent Markov Decision Process (MDP) ℳ ′ , representing the NMR as a suitable Deterministic
                         Finite-state Automaton (DFA), modeling the synchronous joint dynamics of the environment and the
                         NMR. It has been shown that standard RL algorithms, in particular model-free, can be used for learning
                         in such a joint model [4, 5]. One problem with this approach is that model-free RL disregards learning
                         the transition and reward functions of the (joint) environment, thus making it impossible for the agent
                         to distinguish between state changes due to transitions of ℳ and those due to transitions of the NMR,
                         in turn resulting in the agent’s impossibility to take full advantage of the knowledge about ℳ, possibly
                         acquired with past experience. Model-based approaches also suffer from a similar problem, if used
                         off-the-shelf. This problem is even more relevant in the Multi-Agent RL (MARL) context [6] where
                         sample efficiency and scalability are key success factors.


                         AAPEI ’24: 1st International Workshop on Adjustable Autonomy and Physical Embodied Intelligence, October 20, 2024, Santiago
                         de Compostela, Spain.
                         Envelope-Open trapasso@diag.uniroma1.it (A. Trapasso); marcello.bavaro@polimi.it (M. Bavaro); francesco.amigoni@polimi.it
                         (F. Amigoni); iocchi@diag.uniroma1.it (L. Iocchi); patrizi@diag.uniroma1.it (F. Patrizi)
                         GLOBE https://alee08.github.io/ (A. Trapasso)
                         Orcid 0000-0001-5431-6607 (A. Trapasso); 0009-0008-8373-3424 (M. Bavaro); 0000-0001-8146-6213 (F. Amigoni);
                         0000-0001-9057-8946 (L. Iocchi); 0000-0002-9116-251X (F. Patrizi)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   In this short paper, we discuss the use of model-based RL approaches in Multi-Agent Systems (MAS)
with non-Markovian rewards, focusing on two important techniques: 1) decoupling and separately
learning the Markovian environment transitions and rewards and the non-Markovian reward dynamics;
2) sharing the learned environment dynamics among the agents during training. The use of these
techniques leads to solution methods that are significantly more sample-efficient than standard model-
free approaches.
   More specifically, we consider homogeneous MAS, where the agents act in the same environment but
have different non-Markovian tasks. In this context, each agent can learn separately the environment
dynamics and the reward dynamics and can share the learned model of the environment dynamics
with the other agents during training. This approach can greatly reduce the number of samples needed
to perform the individual tasks, thus leading to a significant increase in the efficiency of the overall
multi-agent learning process. We demonstrate the applicability and the scalability of the approach with
an experimental analysis over the Multi-Agent Pickup and Delivery (MAPD) domain [7], showing the
ability of the approach to deal with novel variants of the problem.
   Related Work. Recent active research lines concern model-based RL and RL with NMRs. Gaon
and Brafman [8] describe an application of the R-max algorithm for NMRDPs which incrementally
extends the MDP state space with the states of a reward automaton to learn an NMR. They use a
state-action representation similar to that of the baseline R-max used for the experiments in this paper.
In comparison, our approach exploits a factorization of the environment’s dynamics to improve sample
efficiency. A different approach to model-based RL for NMRDPs, based on estimating the parameters of
a fractional dynamical model without explicitly representing the NMR is proposed by Gupta et al. [9];
differently, we explicitly represent the NMR. Hierarchical structures have been recently explored [10],
enabling the agents to dynamically change strategy to adapt to complex environments. These works
mainly focus on learning options [11] for reaching goals, not considering NMRs. Another line uses
temporal logics over finite traces (ltl𝑓 /ldl𝑓 [12]) to specify and optimize complex RL behaviors trough
reward shaping [4, 5]; these works focus on model-free approaches, while we consider model-based
ones. Finally, a large body of work aims at learning the NMR model, e.g., [8]; while related, this is out
of this paper’s scope, which addresses model-based RL for given NMR specifications. To the best of our
knowledge, no other work has devised model-based RL algorithms for discrete NMRDPs, factorizing
the environment’s and reward’s dynamics to improve sample efficiency.


2. Background
A Markov Decision Process (MDP) is a tuple ℳ = (𝑆, 𝐴, 𝛿, 𝑅𝐸 ), with: 𝑆 the finite set of states, 𝐴 the
finite set of actions, 𝛿 ∶ 𝑆 × 𝐴 → 𝑃𝑟(𝑆) the transition function returning, for 𝑠 ∈ 𝑆 and 𝑎 ∈ 𝐴, a
probability distribution 𝑃(𝑠 ′ |𝑠, 𝑎) over 𝑆, and 𝑅𝐸 ∶ 𝑆 × 𝐴 × 𝑆 → ℝ the reward function. A policy 𝜌 for
                                                                                                        ∞
an MDP ℳ is a function 𝜌 ∶ 𝑆 → 𝐴. The return 𝑣 𝜌 (𝑠) of a policy 𝜌 from 𝑠 is 𝑣 𝜌 (𝑠) = 𝐸[∑𝑖=0 𝛾 𝑖 𝑟𝑖 ],
with 𝑟𝑖 = 𝑅𝐸 (𝑠𝑖 , 𝜌(𝑠𝑖 ), 𝑠𝑖+1 ), 𝑠0 = 𝑠, and discount factor 𝛾 ∈ [0, 1]. Reinforcement Learning (RL) consists
in finding an optimal policy 𝜌 ∗ guaranteeing maximal return 𝑣 ∗ (𝑠) from every 𝑠 ∈ 𝑆. The 𝑄-function
is 𝑄 ∶ 𝑆 × 𝐴 → ℝ s.t. 𝑄(𝑠, 𝑎) is the return obtained by executing 𝑎 in 𝑠, then acting optimally. Given
𝑄, an optimal policy is 𝜌 ∗ (𝑠) = argmax𝑎 𝑄(𝑠, 𝑎). In RL, 𝛿, 𝑅𝐸 , and 𝑄 are unknown. The main solution
approaches are model-free and model-based, with representative algorithms Q-learning [13] and
R-max [14]. The former seeks 𝜌 ∗ by learning 𝑄, the latter by learning 𝛿 and 𝑅𝐸 .
   A Non-Markovian Reward Decision Process (NMRDP) is a tuple 𝒩 = (ℳ, 𝑅𝑄 ), with ℳ an MDP and
𝑅𝑄 ∶ 𝑆 + → ℝ the Non-Markovian Reward (NMR) function, returning rewards based on ℳ histories.
Several models for 𝑅𝑄 exist, e.g., Reward Machines (RM) [4] or Restraining Bolts [15]. For simplicity we
assume 𝑅𝑄 based on a Deterministic Finite-state Automaton (DFA) 𝒜 = (𝑆, 𝑄, 𝑞0 , 𝜂, 𝐹 ), with 𝑆 the input
alphabet, 𝑄 the finite set of states, 𝑞0 ∈ 𝑄 the initial state, 𝜂 ∶ 𝑄 × 𝑆 → 𝑄 the transition function, and
𝐹 ⊆ 𝑄 the set of final states; 𝑅𝑄 returns reward 𝑅𝑄 (𝑞, 𝑠, 𝑞 ′ ) based on last 𝒜’s transition (𝑞, 𝑠, 𝑞 ′ ). We
adopt this form for presentation convenience only; experiments are carried out using RMs.
   In an NMRDP 𝒩 = (ℳ, 𝑅𝑄 ) with ℳ = (𝑆, 𝐴, 𝛿, 𝑅𝐸 ), rewards are based on trajectories. The reward
on trajectory 𝑠0 𝑎0 ⋯ 𝑠𝑎𝑠 ′ ∈ (𝑆 × 𝐴 × 𝑆)+ is 𝑅𝐸 (𝑠, 𝑎, 𝑠 ′ ) + 𝑅𝑄 (𝑞, 𝑠 ′ , 𝑞 ′ ), with 𝑞 the state reached by 𝒜 on
history 𝑠0 ⋯ 𝑠 and 𝑞 ′ on 𝑠0 ⋯ 𝑠𝑠 ′ . This yields a different RL problem, for which direct use of existing
approaches is inadequate. RL with NMRs can be reduced to the Markovian case, by accessing the state
of the NMR’s DFA, which thus becomes a component of ℳ’s state. This is detailed in [3], which shows
that an NMRDP 𝒩 = (ℳ, 𝑅𝑄 ) is equivalent to an MDP ℳ ′ = (𝑆 ′ , 𝐴, 𝛿 ′ , 𝑅′ ) modeling the synchronous
execution of ℳ and 𝒜, with 𝑆 ′ = 𝑆 × 𝑄, 𝛿 ′ capturing joint transitions of ℳ and 𝒜, and 𝑅′ ∶ 𝑆 ′ × 𝐴 × 𝑆 ′
a Markovian reward function. While this reduction makes standard RL solution methods suitable for
non-Markovian settings (e.g., [4, 15]), such methods blur the distinction between the MDP and the DFA,
possibly yielding inefficiencies due to the agent’s inability to take advantage of previously acquired
knowledge about states differing only in the DFA component.
   This work addresses this problem in a Multi-Agent System (MAS) setting. We extend MDPs to MAS
using Markov Games consisting of: a set of agents Λ = {𝜆1 , … , 𝜆𝑛 }, one action set for each agent 𝐴1 , … , 𝐴𝑛 ,
state transitions controlled by joint actions, i.e., 𝛿 ∶ 𝑆 × 𝐴1 × ⋯ × 𝐴𝑛 → 𝑃𝑟(𝑆), and agents’ individual
reward functions 𝑅𝑖 ∶ 𝑆 ×𝐴1 ×⋯×𝐴𝑛 → ℝ. We denote agent 𝜆𝑖 ’s joint action as (𝑎𝑖 , 𝑎−𝑖           ̄ ), with 𝑎𝑖 the action
performed by 𝜆𝑖 and 𝑎−𝑖   ̄ those of all other agents. Agents have different goals associated to different
DFAs. 𝒜𝑖 = (𝑆, 𝑄𝑖 , 𝑞𝑖,0 , 𝜂𝑖 , 𝐹𝑖 ) denotes the DFA modeling 𝜆𝑖 ’s goal, with 𝑞𝑖,0 ∈ 𝑄𝑖 , 𝜂𝑖 ∶ 𝑄𝑖 × 𝑆 → 𝑄𝑖 , and
𝐹𝑖 ⊆ 𝑄𝑖 . Every agent having its own goal, individual DFA’s transitions depend only on the environment
state and not on other DFAs.


3. NMRDPs Factorization
The dynamics of an NMRDP 𝒩 = (ℳ, 𝑅𝑄 ) is as follows: 1. on reset, the environment is in 𝑠0 and
𝒜 performs a dummy transition 𝑞0 → 𝑞0 (consuming 𝑠0 ); 2. at each step, on state (𝑠, 𝑞) and action
𝑎, ℳ progresses to 𝑠 ′ with probability 𝛿(𝑠, 𝑎), 𝒜 progresses to 𝑞 ′ = 𝜂(𝑞, 𝑠 ′ ) with probability 1, and
the returned reward is 𝑅𝐸 (𝑠, 𝑎, 𝑠 ′ ) + 𝑅𝑄 (𝑞, 𝑠 ′ , 𝑞 ′ ). The resulting transition function is captured by the
distribution 𝑃(𝑠 ′ , 𝑞 ′ |𝑠, 𝑞, 𝑎) = 𝑃(𝑞 ′ |𝑠 ′ , 𝑞, 𝑠, 𝑎) 𝑃(𝑠 ′ |𝑠, 𝑞, 𝑎), with 𝑃(𝑠 ′ , 𝑞 ′ |𝑠, 𝑞, 𝑎), 𝑃(𝑞 ′ |𝑠 ′ , 𝑞, 𝑠, 𝑎), 𝑃(𝑠 ′ |𝑠, 𝑞, 𝑎) the
transition probability distributions of 𝒩, 𝒜, and ℳ, respectively. Since ℳ’s transitions are Markovian,
𝑃(𝑠 ′ |𝑠, 𝑞, 𝑎) = 𝑃(𝑠 ′ |𝑠, 𝑎), i.e., given 𝑠 and 𝑎, 𝑠 ′ is independent of the history that led 𝒜 to 𝑞. Also, 𝑞 ′
depending only on 𝑞 and 𝑠 ′ , 𝑃(𝑞 ′ |𝑠 ′ , 𝑞, 𝑠, 𝑎) = 𝑃(𝑞 ′ |𝑠 ′ , 𝑞). Thus, 𝒩’s transition model can be rewritten
as (♢)𝑃(𝑠 ′ , 𝑞 ′ |𝑠, 𝑞, 𝑎) = 𝑃(𝑠 ′ |𝑠, 𝑎)𝑃(𝑞 ′ |𝑠 ′ , 𝑞), and the reward function as (♡)𝑅(𝑠, 𝑞, 𝑎, 𝑠 ′ , 𝑞 ′ ) = 𝑅𝐸 (𝑠, 𝑎, 𝑠 ′ ) +
𝑅𝑄 (𝑞, 𝑠 ′ , 𝑞 ′ ). Note that 𝑅𝑄 (𝑞, 𝑠 ′ , 𝑞 ′ ) is independent of 𝑠, 𝑎, i.e., of how (𝑠 ′ , 𝑞 ′ ) is reached from previous state
(𝑠, 𝑞). Equations (♢) and (♡) factorize the system dynamics decoupling the Markovian environment
dynamics (𝑃(𝑠 ′ |𝑠, 𝑎), 𝑅𝐸 (𝑠, 𝑎, 𝑠 ′ )) from the non-Markovian reward expressed by the DFA (𝑃(𝑞 ′ |𝑠 ′ , 𝑞),
𝑅𝑄 (𝑞, 𝑠 ′ , 𝑞 ′ )).
    Discrete model-based RL can be applied in the joint state space 𝑆×𝑄 to 𝑃(𝑠 ′ , 𝑞 ′ |𝑠, 𝑞, 𝑎) and 𝑅(𝑠, 𝑞, 𝑎, 𝑠 ′ , 𝑞 ′ ).
An algorithm for this is R-max [8, 14], which estimates the model by counting the number of visits
𝑛(𝑠, 𝑞, 𝑎) and setting a threshold 𝛿 to determine known transitions, thus enabling the estimation of the
model functions. However, this solution does not scale with task complexity, as it does not exploit the
Markovian nature of the environment.
    Here, we focus on homogeneous collaborative MASs where the agents share the same environment
transition and reward functions. Given the set of agents Λ = {𝜆1 , … , 𝜆𝑛 }, for every 𝜆𝑖 , 𝜆𝑗 , we have
𝑃(𝑠 ′ |𝑠, 𝑎𝑖 , 𝑎−𝑖
                ̄ ) = 𝑃(𝑠 ′ |𝑠, 𝑎𝑗 , 𝑎−𝑗
                                      ̄ ) and 𝑅𝐸 (𝑠, 𝑎𝑖 , 𝑎−𝑖 ̄ , 𝑠 ′ ) = 𝑅𝐸 (𝑠, 𝑎𝑗 , 𝑎−𝑗
                                                                                       ̄ , 𝑠 ′ ). Agent-independence is denoted
          ′
by 𝑃(𝑠 |𝑠, 𝑎∗ , 𝑎−∗  ̄ ) and 𝑅𝐸 (𝑠, 𝑎∗ , 𝑎−∗        ′
                                             ̄ , 𝑠 ). On the other hand, agents have different tasks, modeled by
individual DFAs 𝒜𝑖 . As a result, for every 𝜆𝑖 , the NMRDP can be factorized by (♣)𝑃(𝑠 ′ , 𝑞𝑖′ |𝑠, 𝑞𝑖 , 𝑎𝑖 , 𝑎−𝑖                             ̄ )=
     ′
𝑃(𝑠 |𝑠, 𝑎∗ , 𝑎−∗        ′  ′
                 ̄ )𝑃(𝑞𝑖 |𝑠 , 𝑞𝑖 ) and (♠)𝑅(𝑠, 𝑞𝑖 , 𝑎𝑖 , 𝑎−𝑖      ′    ′
                                                            ̄ , 𝑠 , 𝑞𝑖 ) = 𝑅𝐸 (𝑠, 𝑎∗ , 𝑎−∗       ′               ′
                                                                                         ̄ , 𝑠 ) + 𝑅𝑄 (𝑞𝑖 , 𝑠 , 𝑞𝑖 ). ′

    The multi-agent state space 𝑆 and the joint action space 𝐴1 × ⋯ × 𝐴𝑛 significantly impact performance
and scalability. When combined with NMRDPs, state representation is further expanded with the states
of the DFAs modeling the tasks. To efficiently tackle this problem, we consider independent DFAs
allowing for extending the state of each agent 𝜆𝑖 with 𝑆𝑖′ = 𝑆 × 𝑄𝑖 . In his way, the factorization defined
by Equations (♣) and (♠) together with the independence 𝑃(𝑠 ′ |𝑠, 𝑎∗ , 𝑎−∗                         ̄ ) and 𝑅𝐸 (𝑠, 𝑎∗ , 𝑎−∗     ̄ , 𝑠 ′ ) provide
for a significantly higher sample efficiency.
   Using the above mentioned formulation, it is possible to devise algorithms for sample-efficient MARL
in NMRDP. More specifically, the algorithm will implement and exploit the above mentioned techniques:
1) separate learning of the Markovian environment dynamics and the non-Markovian reward dynamics;
2) sharing of the environment dynamics among the agents during training.
   In this short paper, we do not describe the details of any algorithm, since the main goal is to focus on
the basic techniques described above. However, we provide in the next section some preliminary results
when using a simple approach based on incremental individual training, in which each agent 𝜆𝑖 learns
the environment dynamics 𝑃(𝑠 ′ |𝑠, 𝑎∗ , 𝑎−∗                        ̄ , 𝑠 ′ ) and the reward dynamics 𝑃(𝑞𝑖′ |𝑠 ′ , 𝑞𝑖 ) and
                                              ̄ ) and 𝑅𝐸 (𝑠, 𝑎∗ , 𝑎−∗
𝑅𝑄 (𝑞𝑖 , 𝑠 ′ , 𝑞𝑖′ ) to accomplish its task and shares the agent-independent learned environment dynamics
with the other agents. The name MAQR-max is used in the next section to refer to such a solution.
   This approach improves sample efficiency, as the environment transitions and automaton transitions
are learned separately and, once the environment model is stable, there is no need to keep learning
them. When dealing with different (and even more complex) tasks in the same environment expressed
by different automata, it is sufficient to update the non-Markovian model, leveraging the previously
learned environment estimates. In this way, the agents easily scale to more complex tasks reusing the
learning environment models.


4. Case Study: Multi-Agent Pickup and Delivery
Multi-Agent Pickup and Delivery (MAPD) [7] is an online problem where a team of coordinated agents
needs to fulfill a set of dynamically incoming pickup and delivery tasks. MAPD has several real-world
applications, such as automated warehouse logistics [16], coordination of autonomous vehicles [17],
automated control of non-player characters in video games [18]. Formally, MAPD involves 𝑘 agents
in an environment modeled as an undirected connected graph 𝐺 = (𝑉 , 𝐸) whose vertices 𝑉 represent
locations and edges 𝐸 connections. In this paper, 𝐺 is a 4-connected regular grid and time is discrete.
An agent can either remain in its current location or move to any adjacent one. Actions last one
timestep and, here, we assume can fail, leaving the agent in its current location. A task set 𝒯 contains all
unassigned tasks, and new ones can be dynamically added at any timestep. Each task 𝜏𝑗 ∈ 𝒯 comprises
a pickup and a delivery location. When an agent is assigned a task, it plans a path on 𝐺 to reach, from
its current location, first the pickup and then the delivery location. Agents must not collide: distinct
agents cannot be in the same location or traverse the same edge in opposite directions, at the same
time. We consider a problem variant where 𝒯 contains only 𝑘 tasks, one per agent, all initially known.
The aim of MAPD is planning paths to complete all the tasks in the shortest time. We evaluate solution
quality as makespan (number of timesteps needed to complete all the tasks).
   Here, we adopt an RL-based approach, modeling MAPD as a deterministic discrete NMRDP where
each agent has its own non-Markovian reward, modeled as a RM, defining the agent’s goals and their
fulfillment order; these goals require an agent to reach first the pickup and then the delivery location.
Each agent learns its own policy using MAQR-max.
   In the NMRPD model, an agent’s state includes its current location, the current timestep, and the
progress, tracked by the RM, in fulfilling the assigned tasks. Such compound state is (𝑠, 𝑞) ∈ 𝑆 × 𝑄, with
𝑠 = (𝑥, 𝑦, 𝑡) the agent’s position on the grid 𝑥, 𝑦 (a vertex in graph 𝐺) at timestep 𝑡 and 𝑞 the state of
the agent’s RM, tracking the progress towards its tasks. RM transitions between states depend on the
occurrence of pickup, delivery, and collision events. A pickup event occurs (instantaneously) when the
agent reaches the pickup location; delivery events occur similarly, but only if the pickup event has
occured already. Collision events occur if two agents are in a same position 𝑥, 𝑦 at same timestep 𝑡.
The RM rewards the transitions firing on pickup and delivery events, while penalizes those occurring
on collisions. More precisely, the RM capturing an agents’ non-Markovian goal is s.t., on event 𝑒, a
transition (𝑞, 𝑒) → (𝑞 ′ , 𝑟𝑄 ) takes place from 𝑞 to 𝑞 ′ returning a reward value 𝑟𝑄 .
           (a) MAPD 13 × 13.                       (b) MAPD 8 × 8.                    (c) Priority legend.

Figure 1: MAPD for 13×13, 8×8 and legend.


4.1. Experimental Setup
The MAPD environments we use consist of grids with walls and narrow corridors, which challenge the
agents’ coordination ability. As discussed, the agent’s observation space includes its current position,
the timestep, and RM state. The action space comprises the four cardinal directions, thus having size
4 (note that we do not consider wait actions). Pickup and delivery actions are implicitly assumed to
be automatically and instantaneously executed when the pickup and delivery locations are achieved.
Executing actions that would make an agent hit an obstacle or a boundary yields no effect. In order to
penalize adjacent agents when swapping positions simultaneously, each agent’s trajectory is augmented
with an additional timestep for every recorded position (details omitted for space reasons).
   Agents are trained incrementally, one after another, taking into account, in the reward signal, the
potential occurrence of collisions stemming from the learnt policies of agents trained at previous
iterations. Agents are assigned a priority corresponding to the training order: an agent receives a
penalty whenever, at training time, occupies the same position as that another agent with higher priority
would occupy, at the same timestep, when executing its learned policy.
   We have used two distinct environments, an 8 × 8 and a 13 × 13 map (see Figs. 1a and 1b), and tasks,
modeled as RMs and assigned before training. In both environments, priorities decrease from 1 to 5.
Agents are assigned priorities and tasks as illustrated in Fig. 1c, where the first and second cells reported
in column “Pickup & Delivery” are the pickup and delivery locations, respectively. Observe how this
approach allows for flexibly defining complex MAPD tasks. For instance, one could conveniently define
a problem where the agents must perform multiple pickups and deliveries under order constraints on
their visits.

4.2. Experimental Results
In this section, we report the results of some preliminary experiments aiming at discussing potential
benefits of algorithms exploiting Model-Based Multi-Agent RL techniques described before.
   We carried out tests to evaluate sample efficiency and environment exploration rate, using a CPU
i7-11800H 2.30GHz and 16 GB of RAM. Actions have 0.6 success (move to intended cell) and 0.4 failure
(remain still) probability. The threshold for a pair (𝑠, 𝑎) to be known is 100. Every 50 episodes, the
current optimal policy is evaluated over 100 executions.
   Fig. 2 shows the performance of MAQR-max on the 8×8 MAPD domain (Fig. 1b) with 4 agents, in
the basic scenario where agents share no information (transition probabilities) about the environment.
Fig. 3 illustrates the results for the same scenario, when agents benefit from the transition probability
function learned by previous agents. A significant improvement in sample efficiency is obtained, in
particular, by Agents 3 and 4, which show a clear decrease in convergence time and an increase in
solution quality (timesteps taken to complete the task).
   Fig. 4 shows results for the same experiment, now with Agent 1 inheriting Agent 4’s learned transitions
Figure 2: MAPD 8×8 without Transfer Learning.             Figure 3: MAPD 8×8 with Transfer Learning.


Figure 4: Agent 1 starts with 70% known transitions.


Figure 5: MAPD 13×13 without Transfer Learning.           Figure 6: MAPD 13×13 with Transfer Learning.


(70.33%). Agents 1 and 2 converge more rapidly, demonstrating scalability. The increasing fraction of
known transitions after each training phase further underscores the importance of knowledge transfer.
   We have also tested the approach on a 13×13 scenario (Figs. 5 and 6), to preliminarily test scalability.
A behavior similar to that of the 8×8 scenario is observed, with the agents achieving a smaller makespan
and requiring fewer episodes to learn the optimal policy than in the case without transfer learning.
These results demonstrate the feasibility of our approach. Future refinements may further improve
performance and scalability.


5. Conclusion
We have presented a formalization of Multi-Agent Model-Based Reinforcement Learning for NMRDPs
that, by decoupling the Markovian transitions in the environment from the non-Markovian evolution
of the reward and by sharing the learned environment model among the agents during training, allows
the definition of sample-efficient RL algorithms for solving complex multi-agent problems. The general
applicability of the proposed solution has been demonstrated by successfully addressing the Multi-Agent
Pickup and Delivery problem. A more detailed investigation of algorithms exploiting these properties
is left as future work.
Acknowledgements
Work partly supported by the PNRR MUR project PE0000013-FAIR and the Sapienza project MARLeN
(Multi-layer Abstraction for Reinforcement Learning with Non-Markovian Rewards).


Declaration on Generative AI
The authors have not employed any Generative AI tools.


References
 [1] F. Bacchus, C. Boutilier, A. J. Grove, Rewarding behaviors, in: Proc. AAAI, 1996, pp. 1160–1167.
 [2] S. Thiébaux, C. Gretton, J. K. Slaney, D. Price, F. Kabanza, Decision-theoretic planning with
     non-markovian rewards, J. Artif. Intell. Res. 25 (2006) 17–74.
 [3] R. Brafman, G. De Giacomo, F. Patrizi, LTLf/LDLf non-Markovian rewards, in: Proc. AAAI, 2018,
     pp. 1771–1778.
 [4] R. T. Icarte, T. Q. Klassen, R. A. Valenzano, S. A. McIlraith, Reward machines: Exploiting reward
     function structure in reinforcement learning, J. Artif. Intell. Res. 73 (2022) 173–208.
 [5] G. D. Giacomo, L. Iocchi, M. Favorito, F. Patrizi, Reinforcement learning for LTLf/LDLf goals,
     CoRR (2018). URL: http://arxiv.org/abs/1807.06333.
 [6] S. Albrecht, F. Christianos, L. Schäfer, Multi-Agent Reinforcement Learning: Foundations and
     Modern Approaches, MIT Press, 2024.
 [7] H. Ma, J. Li, T. Kumar, S. Koenig, Lifelong multi-agent path finding for online pickup and delivery
     tasks, in: Proc. AAMAS, 2017, pp. 837–845.
 [8] M. Gaon, R. I. Brafman, Reinforcement learning with non-Markovian rewards, in: Proc. AAAI,
     2020, pp. 3980–3987.
 [9] G. Gupta, C. Yin, J. V. Deshmukh, P. Bogdan, Non-Markovian reinforcement learning using
     fractional dynamics, in: Proc. CDC, 2021, pp. 1542–1547.
[10] P. Bacon, J. Harb, D. Precup, The option-critic architecture, CoRR (2016). URL: http://arxiv.org/
     abs/1609.05140.
[11] R. S. Sutton, D. Precup, S. Singh, Between MDPs and semi-MDPs: A framework for temporal
     abstraction in reinforcement learning, J. Artif. Intell. 112 (1999) 181–211.
[12] G. D. Giacomo, M. Y. Vardi, Linear temporal logic and linear dynamic logic on finite traces, in:
     Proc, IJCAI, 2013, pp. 854–860.
[13] C. Watkins, P. Dayan, Q-learning, Mach. Learn. 8 (1992) 279–292.
[14] R. I. Brafman, M. Tennenholtz, R-max - a general polynomial time algorithm for near-optimal
     reinforcement learning, J. Mach. Learn. Res. 3 (2003) 213–231.
[15] G. De Giacomo, L. Iocchi, M. Favorito, F. Patrizi, Restraining bolts for reinforcement learning
     agents, in: Proc, ICAPS, volume 34, 2020, pp. 13659–13662.
[16] P. R. Wurman, R. D’Andrea, M. Mountz, Coordinating hundreds of cooperative, autonomous
     vehicles in warehouses, AI Mag. 29 (2008) 9–20.
[17] M. Veloso, J. Biswas, B. Coltin, S. Rosenthal, Cobots: Robust symbiotic autonomous mobile service
     robots, in: Proc. IJCAI, 2015, pp. 4423–4429.
[18] H. Ma, J. Yang, L. Cohen, T. S. Kumar, S. Koenig, Feasibility study: Moving non-homogeneous
     teams in congested video game environments, in: Proc. AIIDE, 2017, pp. 270–272.