<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1007/s10994-021-05963-2</article-id>
      <title-group>
        <article-title>Imitation Learning of Logical Program Policies for Multi-Agent Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manuel Eberhardinger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Maucher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Setareh Maghsudi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hochschule der Medien Stuttgart</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Tuebingen</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work shows that it is possible to learn Logical Program Policies (LPP) for the Multi-Agent Reinforcement Learning environment called Level-based Foraging. Policies are derived from decision trees that use a logical combination of small feature detector programs generated by a probabilistic context-free grammar. We show that these created programmatic policies allow to explain decisions of all agents in the environment and also can be used without retraining on all other environments. This is possible by creating not just one program for each agent, but several, by sampling learned programs and selecting the best programs according to the approximate reward in the specific environment the program was learned on. By using LPP to represent the policies for agents, the policies are interpretable and also extrapolate to unseen environments. Code: https://github.com/ManuelEberhardinger/LPP-MARL Programmatic reinforcement learning is an interesting research direction to make black-box deep reinforcement learning (DRL) policies interpretable. By using a synthesized program to control an agent, the behaviour of the agent is interpretable, as the generated program can be studied and verified by experts before using it in dangerous environments. Additionally, the program is controllable, as software developers can adjust the program to their own needs. Programs also tend to extrapolate to out-of-distribution data, as programs are not restricted to specific input sizes like neural network models. Nevertheless, learning a programmatic policy directly, is very hard and most work use a form of Imitation Learning and a customised domain-specific language (DSL) to reduce the search space and make the synthesis of a program feasible [1, 2, 3]. We show that it is possible to learn programs for multiple agents by using a strong prior in the DSL so that the search for programs becomes feasible. Programs are learnt by using Imitation Learning after collecting data from a black-box neural network policy. In this work</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Program Synthesis</kwd>
        <kwd>Multi-Agent Reinforcement Learning</kwd>
        <kwd>Imitation Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        we propose to use Logical Program Policies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to generate programs for the Multi-Agent
Reinforcement Learning (MARL) environment Level-based Foraging (see Figure 2) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Figure
1 shows a high-level overview of all necessary steps to create programs for a given MARL
environment.
      </p>
      <p>We show that the generated logical programs are interpretable and also extrapolate to unseen
environments as the programs are not restricted by grid sizes, the number of players or the
number of objects. This makes programs a good alternative for black-box neural network
policies which must be retrained if one changes the details of the environment, even though
synthesized programs do not achieve the same performance as the DRL policy which was
trained until convergence and used for data collection.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Program Synthesis and Programmatic Policies</title>
        <p>
          Program synthesis has a long history in the artificial intelligence
research community [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ]. In recent years, more and more research
has been conducted on combining deep learning and program
synthesis to make program search more feasible by reducing or guiding
the search space [
          <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
          ]. Another promising method is to learn
a library of functions from previously solved problems. These
functions are then reusable in an updated DSL to solve more dificult
problems [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ].
        </p>
        <p>
          Existing methods for creating programmatic policies mostly use
Imitation Learning. Verma et al. use program sketches and a neural
network oracle to synthesise programmatic policies [
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ]. Other
work synthesised finite state machines to represent policies [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] or
extract policies from decision trees [
          <xref ref-type="bibr" rid="ref13 ref4">4, 13</xref>
          ]. Recent promising work
also shows that it is possible to synthesise programmatic policies
directly [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ], since distilling policies into a program always leads
to performance degradation on a given task.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Multi-Agent Reinforcement Learning</title>
        <sec id="sec-2-2-1">
          <title>MARL Environment</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Train Neural Network</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>Collect Demonstrations</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Imitation Learning</title>
        </sec>
        <sec id="sec-2-2-5">
          <title>Logical Program Policies</title>
          <p>
            The goal of MARL is to jointly learn multiple agents to solve a given task by learning how
to behave in a shared environment. There exists a variety of paradigms how to learn MARL
algorithms by adapting DRL methods for multiple agents. One approach is to learn the agents
independently or with sharing of information [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. The latter paradigm is also known as
Centralised Training Decentralised Execution [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ].
          </p>
          <p>
            Program synthesis research for MARL was first conducted to derive programmatic
communication rules by imitating a learned transformer communication policy [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]. To the best of
our knowledge, there exists no work that tests the extrapolation qualities of learned policies on
unseen environments.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Level-based Foraging</title>
        <p>
          The implementation of the Level-based Foraging environment that is used in this work was
introduced in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] (see Figure 2). Each agent and fruit has a specific level. Possible actions are
moving in each direction or picking up a fruit. The goal of the agents is to pick up all fruits, but
agents can only pick up fruits if their level is greater or equal than the level of the fruit. If the
level of the fruit is higher than the agents level, multiple agents need to cooperate to pick up
the fruit. Therefore this is a mixed cooperative-competitive environment, as agents compete
in picking up fruits but also need to cooperate for some fruits. The reward is divided by the
agents that were involved in foraging of the fruit.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>
        This work adapts the Bayesian Imitation Learning method
Logical Program Policies that calculates a posterior distribution
of logical formulas ℎ for MARL environments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The
logical formula consists of small feature detector programs,
denoted as  , that are drawn from a probabilistic context-free
grammar (PCFG) . The goal of the feature detectors is to
parametrise the state-action pairs of the expert demonstrations
 with boolean values if the given action was taken in the given
state:  (, ) :  ×  → {0, 1}.
      </p>
      <p>Actions taken in a given state represent positive examples. In
the original method all actions not taken for a specific state were Figure 2: The environment
added as negative examples. However, this is not applicable for Level-based foraging with two
our purpose, since in a given state several actions are possible players and two fruits.
that lead to the same goal, namely to approach a fruit in order to pick it up. Therefore, we add
as negative examples only actions that increase the distance of the agent from the fruit.</p>
      <p>
        This newly created dataset can then be used to train a binary decision tree classifier. The
logical formula is represented in a disjunctive normal form and can be extracted from this
classifier as a combination of a finite subset of feature detector programs
ℎ(, )=(1,1(, ) ∧ ... ∧ 1,1 (, )) ∨ ... ∨ (,1(, ) ∧ ... ∧ 1, (, )),
̂︀
(1)
where  is the number of the demonstrations and  is the number of the feature detector
programs. Negation of  is also possible. These extracted formulas can be converted into a flow
chart for making decisions of the agents interpretable for humans [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>3.1. Approximating the posterior</title>
        <p>
          The following equations (adapted for our purpose from [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]) show how to approximate the
posterior distribution  from the prior ( ) and the likelihood (| ) where  are the expert
demonstrations:
( ) ≈ ( |)
(2)
∀ 0 ≤  ≤ 1 : ( |) =  * ( ) + (| ) − ()
( ) = the log probability of the generated feature detectors by
        </p>
        <p>
          (| ) = ln( 1 ∑︁ ∑︁ )
 
(3)
(4)
(5)
(6)
(7)

() = the log probability of the evidence, e.g. the provided demonstrations
The prior of the distribution is defined as the weighted log probability of the generated
feature detectors from the PCFG. Since  is defined to decrease the importance of the length of
the programs, it must be in between 0 ≤  ≤ 1. We use  = 0.1 in our experiments as the
environment is more complex than the grid games used in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This decreases the importance
of the length of the programs since very short feature detectors are not able to find good
abstractions for our problem. The likelihood is calculated from the log of the average reward
each agent received by following a given policy  for  runs. We set  = 100 to get a good
approximation of how good the program is on the given environment. Equation 3 adds and
subtracts the probabilities instead of multiplying and dividing them, since we work with log
probabilities.
        </p>
        <p>
          To better approximate the posterior, Silver et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] use  logical formulas ℎ1, ..., ℎ in one
LPP, as a weighted mixture of experts with each formula weighted equally. In our experiments
we set  = 25. To provide a more diverse distribution of ℎ to select, we train multiple decision
trees with diferent random seeds. In our experiments we choose to train 5 decision trees. This
improves the performance of LPP, since hardly one formula captures the complete structure
of the state space. The selection of  formulas is implemented by sampling a subset of 
programs from all generated decision trees and empirically evaluating which subset returned
the highest average reward after  test runs of the environment. The more often  decision
trees are sampled, the better the approximation for the best LPP. We run this 100 times in our
experiments.
        </p>
        <p>To use LPP in MARL, we store all combinations of evaluated  logical formulas in a list, and
instead of returning only the best policy, the same number of policies as agents are returned.</p>
        <p>Following equation derives the final policy  * () for using at test time:
 * () = arg max E[ (, )] = arg max ∑︁ (ℎ) ℎ(|)</p>
        <p>∈ ∈ ℎ∈
In words, each selected logical formula calculates for all actions the probability with which an
action  should be taken in a given state . The probabilities for the same action are added and
then the action with the highest probability is returned.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Domain-specific Language for the feature detectors</title>
        <p>The DSL was updated according to the requirements of the environment. We added more
specific perceptual primitives that the feature detector programs can use to represent the state
succinctly. These are the functions that can be used to determine in which direction the fruit is
located from the agent (see Table 3 in the appendix). We also had to include the agent’s position
information as a parameter in the function so that the program would know which agent was
being controlled.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. The neural network oracle</title>
        <p>
          For training of the neural network policy we choose to use the IQL algorithm [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] reported from
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] which received on both 8x8 grid size environments perfect reward. IQL is an adaptation of
the Q-Learning algorithm [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] for multiple agents in which each agent has an independent
stateaction value function. The loss function is the default Q-learning loss, which is independently
minimised for each agent’s local past observations and actions [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. We trained the model with
the default parameters reported in the paper on the smallest environment on a grid size 5x5
with two players and two fruits. This model was then used to collect data for generation of the
LPPs.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Summary of the algorithm</title>
        <p>Algorithm 1 shows a high level overview of all necessary steps to generate programs with LPP
for the level-based foraging environment. First, we need an oracle to create the positive expert
demonstrations. Then we can create negative examples from the positive ones. The next step ist
to generate all feature detector programs from  and calculate the corresponding prior. Then a
dataset is created by running all feature detectors on all demonstrations. This dataset is used
for training of multiple decision trees. The trained classifiers are evaluated to approximate the
likelihood of the models using equation 5. After computing the posterior distribution with
equation 3, we can select the best models, which are then converted into specific LPP polices by
equation 7.</p>
        <p>Algorithm 1: The complete algorithm for generating a program with LPP</p>
        <p>Input : neural network  , ensemble size  , number of demos  and programs 
Output : the n best LPP policies
1  = 0.1; // the importance of the prior
2  = collect_demos_with_oracle( , );
3  = generate_anti_demos();
4  = generate_feature_detectors_from_dsl();
5 priors = calculate_priors( );
6  = {(1(, ), ..., (, )) : (, ) ∈  ∪ } ; // generate data for learning decision trees
7  = {0, ..., } ; // boolean truth values derived from  ∪  are the targets with  ∈ {0, 1}
8 models = train_decision_tree_classifiers( , ,  );
9 likelihoods = evaluate_likelihoods(models);
10 posterior = · priors + likelihoods - ln(P(evidence));
11 LPPs = select_best_lpp_models(models, posterior);
12 return LPPs;</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In the experiments we want to compare the extrapolation capabilities of the neural network
oracle and the created LPP policies. As extrapolation we denfie all larger environments than the</p>
      <p>Results</p>
      <p>Environment
Foraging-grid-5x5-2p-2f-v2
Foraging-grid-8x8-2p-4f-v2
Foraging-grid-10x10-2p-4f-v2
Foraging-grid-12x12-2p-4f-v2
Foraging-grid-14x14-2p-5f-v2
Foraging-grid-16x16-2p-6f-v2
Foraging-grid-18x18-2p-8f-v2
one which was used to train the neural network model. Training models on bigger grid sizes
and then testing them on smaller ones, will not test the extrapolation capabilities as the smaller
environments are just a subset of the larger environment. This would be interpolation of the
state space and not extrapolation. This trained model is then used as an oracle and also as a
comparison baseline.</p>
      <p>As it is not easily possible to test the trained neural network model on other environments,
we decided to use a partial observability for other grid sizes. Using partial observability makes
it possible to run the neural network also on diferent grid sizes than the one it was trained
on. We also evaluate the programmatic policies on diferent numbers of players. However,
we cannot compare them to the trained neural network policies, since they cannot be easily
adapted to diferent numbers of players.</p>
      <p>Table 1 shows that we only loose performance on the smallest environment, that
was used to generate the data from the oracle. The smallest environment is called
Foraging-grid-5x5-2p-2f-v2, with a grid size of 5x5, the number before p are the players
and the number before f are the amount of fruits. On all larger environments our approach
outperforms the neural network policy by far.</p>
      <p>Table 2 shows the results when using the created LPP policies for more players than the ones
trained on. It is clearly visible that the programs can extrapolate to all other environments.
With three players the performance drops, but with more players the performance increases
again.</p>
      <sec id="sec-4-1">
        <title>4.1. Number of demonstrations</title>
        <p>In this experiment we want to analyse how important the provided number of demonstrations
are. Therefore, we run the generation of LPP policies with diferent numbers of demonstrations.
Figure 3 shows how the number of demonstrations afect the performance of the created policies.
It is observable that providing too many demonstrations result in not finding a program at all.
We choose to use 50 demonstrations as this results in the best performance. In addition, it takes
as few as 15 demonstrations to achieve a performance almost as good as the best. However, we
only evaluated the performance on the Foraging-grid-5x5-2p-2f-v2 environment.</p>
        <p>Results on diferent number of players</p>
        <p>Environment LPP Environment
Foraging-grid-5x5-3p-2f-v2 0.59 Foraging-grid-5x5-5p-2f-v2
Foraging-grid-8x8-3p-4f-v2 0.38 Foraging-grid-8x8-5p-4f-v2
Foraging-grid-10x10-3p-4f-v2 0.35 Foraging-grid-10x10-5p-4f-v2
Foraging-grid-12x12-3p-4f-v2 0.36 Foraging-grid-12x12-5p-4f-v2
Foraging-grid-14x14-3p-5f-v2 0.26 Foraging-grid-14x14-5p-5f-v2
Foraging-grid-16x16-3p-6f-v2 0.25 Foraging-grid-16x16-5p-6f-v2
Foraging-grid-18x18-3p-8f-v2 0.24 Foraging-grid-18x18-5p-8f-v2
Foraging-grid-5x5-4p-2f-v2 0.76 Foraging-grid-5x5-6p-2f-v2
Foraging-grid-8x8-4p-4f-v2 0.51 Foraging-grid-8x8-6p-4f-v2
Foraging-grid-10x10-4p-4f-v2 0.48 Foraging-grid-10x10-6p-4f-v2
Foraging-grid-12x12-4p-4f-v2 0.46 Foraging-grid-12x12-6p-4f-v2
Foraging-grid-14x14-4p-5f-v2 0.48 Foraging-grid-14x14-6p-5f-v2
Foraging-grid-16x16-4p-6f-v2 0.36 Foraging-grid-16x16-6p-6f-v2
Foraging-grid-18x18-4p-8f-v2 0.39 Foraging-grid-18x18-6p-8f-v2</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Using the same program for multiple agents</title>
        <p>We also evaluated how the same LPP policy behaves if used for multiple agents. Figure 4 shows
that the overall performance decreased on all environments. We think that the reason for this
behaviour is, that the agents get more often stuck as they want to move to the same position or
one agent blocks the way to the fruit from the other agent. We concluded that the program
was not able to learn how to navigate around the other agent. This is also the case for the still
existing gap between the average reward from the oracle and the created programs.</p>
        <p>0.8</p>
        <p>Environment name</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Interpretation of the decision making process</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusion</title>
      <p>This work shows that it is possible to make decisions of Multi-Agent Systems interpretable
by adapting the Logical Program Policies method. We show that generated programs on the
((action_is_load(cell_is_value( lbc.AGENT.value , action, state, pos) , action, state, pos)
and not (action_is_east(fruit_is_east(state, pos) , action, state, pos))
and not (action_is_west(fruit_is_west(state, pos) , action, state, pos))
and not (action_is_south(fruit_is_south(state, pos) , action, state, pos))
and not (action_is_load(fruit_is_west(state, pos) , action, state, pos))
and not (action_is_load(fruit_is_east(state, pos) , action, state, pos))))
Figure 5: This code block shows the part of the formula why the action to pick up a fruit was selected.
We have displayed only the part that was responsible for selecting the ‘Pick up a fruit‘ action. The other
disjunctive parts of the logical formula were omitted for clarity as the program would get too large.
smallest environment can be used without retraining on other environments or players and are
as well interpretable for any person if converted into a flowchart. This is only possible with
using a strong prior in the DSL as otherwise programs are not interpretable and also do not
generalise to other instances of the same environment.</p>
      <p>
        Nevertheless, there is still a moderate performance gap between the neural network models
and the created programs. One reason for this is, that distilling a policy into a program always
results in loss of information [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Additionally, at the moment it is not possible for the agents to
communicate with each other. Incorporating communication into the DSL could be a promising
way to increase the performance of the programmatic policies, but only if the program also
learns how to navigate around other agents.
      </p>
      <p>
        An interesting way that agents could learn this behaviour is through the use of library
learning as proposed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Right now, the DSL does not adequately represent how to navigate
when there are objects between the agent and the selected target. Learning a library of functions
would make it possible for the program to navigate around each agent when seen several
times from the oracle. Another advantage is that one does not need to incorporate background
knowledge of the fruit location in the DSL. The method should learn these functions itself by
updating the DSL when the same structures are found in multiple programs. A disadvantage
of learning a library is that much more demonstrations and computational power would be
required to learn these functions.
      </p>
      <p>C → shifted(O, B)</p>
      <p>C → B
F → fruit_is_east()
F → fruit_is_south()
F → fruit_is_west()</p>
      <p>F → fruit_is_north()
F → fruit_is_pickable()</p>
      <p>O → (P, P)
O → (N, N)
O → (P, N)
O → (N, P)</p>
      <p>P → 0,...,4
N → 0,-1,...,-4</p>
      <p>V → agent
V → fruit</p>
      <p>Probabilistic Context Free Grammar
Probability Description of the Method</p>
      <p>Filter for correct actions
0.2 checks if the action is going to the west
0.2 checks if the action is going to the east
0.2 checks if the action is going to the south
0.2 checks if the action is going to the north
0.2 checks if the action is picking up the fruit</p>
      <p>Conditions</p>
      <p>shifts agents position and then checks a condition
0.5
0.5</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Murali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <article-title>Programmatically interpretable reinforcement learning</article-title>
          , in: J.
          <string-name>
            <surname>Dy</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Krause (Eds.),
          <source>Proceedings of the 35th International Conference on Machine Learning</source>
          , volume
          <volume>80</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>5045</fpage>
          -
          <lpage>5054</lpage>
          . URL: https://proceedings.mlr.press/v80/verma18a.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <article-title>Imitation-projected programmatic reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 33rd International Conference on Neural Information Processing Systems</source>
          , Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Inala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bastani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tavares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solar-Lezama</surname>
          </string-name>
          ,
          <article-title>Synthesizing programmatic policies that inductively generalize</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2020</year>
          . URL: https://openreview.net/forum?id=
          <fpage>S1l8oANFDH</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Lew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Kaelbling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <article-title>Few-shot bayesian imitation learning with logical program policies</article-title>
          ,
          <source>in: The Thirty-Fourth AAAI Conference on Artificial Intelligence</source>
          ,
          <source>AAAI</source>
          <year>2020</year>
          , New York, NY, USA, February 7-
          <issue>12</issue>
          ,
          <year>2020</year>
          , AAAI Press,
          <year>2020</year>
          , pp.
          <fpage>10251</fpage>
          -
          <lpage>10258</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/view/6587.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramamoorthy</surname>
          </string-name>
          ,
          <article-title>A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems</article-title>
          ,
          <source>in: Proceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS '13</source>
          ,
          <string-name>
            <surname>Richland</surname>
            ,
            <given-names>SC</given-names>
          </string-name>
          ,
          <year>2013</year>
          , p.
          <fpage>1155</fpage>
          -
          <lpage>1156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Waldinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Prow: A step toward automatic program writing</article-title>
          ,
          <source>in: IJCAI</source>
          ,
          <year>1969</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Manna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Waldinger</surname>
          </string-name>
          ,
          <article-title>Knowledge and reasoning in program synthesis</article-title>
          ,
          <source>in: Proceedings of the 4th International Joint Conference on Artificial Intelligence -</source>
          Volume
          <volume>1</volume>
          , IJCAI'
          <fpage>75</fpage>
          , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <year>1975</year>
          , p.
          <fpage>288</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Balog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Gaunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brockschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nowozin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tarlow</surname>
          </string-name>
          ,
          <article-title>Deepcoder: Learning to write programs</article-title>
          ,
          <source>in: Proceedings International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2017</year>
          . URL: https://openreview.net/pdf?id=rkE3y85ee.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hewitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solar-Lezama</surname>
          </string-name>
          ,
          <article-title>Learning to infer program sketches</article-title>
          , in: K. Chaudhuri, R. Salakhutdinov (Eds.),
          <source>Proceedings of the 36th International Conference on Machine Learning</source>
          , volume
          <volume>97</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4861</fpage>
          -
          <lpage>4870</lpage>
          . URL: https://proceedings.mlr.press/v97/nye19a.html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ritchie</surname>
          </string-name>
          , A. Thomas,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hanrahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <article-title>Neurally-guided procedural models: Amortized inference for procedural graphics programs using neural networks</article-title>
          ,
          <source>in: Proceedings of the 30th International Conference on Neural Information Processing Systems</source>
          , NIPS'16, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2016</year>
          , p.
          <fpage>622</fpage>
          -
          <lpage>630</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hewitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Anh</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <article-title>Learning to learn generative programs with memoised wake-sleep</article-title>
          , in: J.
          <string-name>
            <surname>Peters</surname>
          </string-name>
          , D. Sontag (Eds.),
          <source>Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)</source>
          , volume
          <volume>124</volume>
          ,
          <string-name>
            <surname>PMLR</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>1278</fpage>
          -
          <lpage>1287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Nye</surname>
          </string-name>
          , M. Sablé-Meyer, L. Morales,
          <string-name>
            <given-names>L. B.</given-names>
            <surname>Hewitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cary</surname>
          </string-name>
          , A. SolarLezama, J. B.
          <string-name>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <article-title>Dreamcoder: bootstrapping inductive program synthesis with wake-sleep library learning</article-title>
          ,
          <source>in: PLDI '21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation</source>
          , Virtual Event, Canada, June 20- 25,
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>835</fpage>
          -
          <lpage>850</lpage>
          . URL: https://doi.org/10.1145/3453483.3454080. doi:
          <volume>10</volume>
          .1145/ 3453483.3454080.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bastani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solar-Lezama</surname>
          </string-name>
          ,
          <article-title>Verifiable reinforcement learning via policy extraction</article-title>
          ,
          <source>in: Proceedings of the 32nd International Conference on Neural Information Processing Systems</source>
          , NIPS'18, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2018</year>
          , p.
          <fpage>2499</fpage>
          -
          <lpage>2509</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Trivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <article-title>Learning to synthesize programs as interpretable and generalizable policies</article-title>
          , in: A.
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Dauphin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Vaughan</surname>
          </string-name>
          (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          ,
          <year>2021</year>
          . URL: https://openreview.net/ forum?id=wP9twkexC3V.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Programmatic reinforcement learning without oracles</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          . URL: https://openreview.net/forum?id= 6Tk2noBdvxt.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          <article-title>, Multi-agent reinforcement learning: Independent vs. cooperative agents</article-title>
          ,
          <source>in: In Proceedings of the Tenth International Conference on Machine Learning</source>
          , Morgan Kaufmann,
          <year>1993</year>
          , pp.
          <fpage>330</fpage>
          -
          <lpage>337</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papoudakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Christianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          ,
          <article-title>Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks</article-title>
          ,
          <source>in: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Inala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Paulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bastani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Rinard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solar-Lezama</surname>
          </string-name>
          ,
          <article-title>Neurosymbolic transformers for multi-agent communication</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems</source>
          <year>2020</year>
          ,
          <article-title>NeurIPS 2020</article-title>
          , December 6-
          <issue>12</issue>
          ,
          <year>2020</year>
          , virtual,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Christianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          ,
          <article-title>Shared experience actor-critic for multi-agent reinforcement learning</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Skirzyński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lieder</surname>
          </string-name>
          ,
          <article-title>Automatic discovery of interpretable planning strategies</article-title>
          ,
          <source>Mach. Learn</source>
          .
          <volume>110</volume>
          (
          <year>2021</year>
          )
          <fpage>2641</fpage>
          -
          <lpage>2683</lpage>
          . URL: https://doi.org/10.1007/s10994-021-05963-2.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>C. J. C. H. Watkins</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dayan</surname>
          </string-name>
          ,
          <article-title>Q-learning</article-title>
          ,
          <source>Machine Learning</source>
          <volume>8</volume>
          (
          <year>1992</year>
          )
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          . URL: https://doi.org/10.1007/BF00992698. doi:
          <volume>10</volume>
          .1007/BF00992698.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>