=Paper=
{{Paper
|id=Vol-2282/MARLO2
|storemode=property
|title=Extending World Models for Multi-Agent Reinforcement Learning in MALMÖ
|pdfUrl=https://ceur-ws.org/Vol-2282/MARLO_110.pdf
|volume=Vol-2282
|authors=Valliappa Chockalingam,Tegg Taekyong Sung,Feryal Behbahani,Rishab Gargeya,Amlesh Sivanantham,Aleksandra Malysheva
|dblpUrl=https://dblp.org/rec/conf/aiide/ChockalingamSBG18
}}
==Extending World Models for Multi-Agent Reinforcement Learning in MALMÖ==
<pdf width="1500px">https://ceur-ws.org/Vol-2282/MARLO_110.pdf</pdf>
<pre>
  Extending World Models for Multi-Agent Reinforcement Learning in MALMÖ
                      Valliappa Chockalingam1 , Tegg Taekyong Sung2 , Feryal Behbahani3 ,

                       Rishab Gargeya4 , Amlesh Sivanantham5 , Aleksandra Malysheva6
                           1
                             University of Alberta, 2 Kwangwoon University, 3 Imperial College London,
                          4
                            Stanford University, 5 University of Southern California, 6 JetBrains Research
                               1
                                 valliapp@ualberta.ca, 2 tegg89@gmail.com, 3 feryal.mp@gmail.com,
                         4
                           rgargeya@stanford.edu, 5 sivanant@usc.edu, 6 malyshevasasha777@gmail.com

                              Abstract                              states are first mapped onto a lower dimensional space and
                                                                    RL algorithms then work with the extracted features.
  Recent work in (deep) reinforcement learning has increas-         Deep Learning is a method to learn general functions using
  ingly looked to develop better agents for multi-agent/multi-
  task scenarios as many successes have already been seen in
                                                                    artificial neural networks and hence is a way towards non-
  the usual single-task single-agent setting. In this paper, we     linear function approximation (LeCun, Bengio, and Hin-
  propose a solution for a recently released benchmark which        ton 2015; Arulkumaran et al. 2017). Thus, combining Rein-
  tests agents in such scenarios, namely the MARLÖ competi-         forcement Learning and Deep Learning (Deep RL) has be-
  tion. Following the 2018 Jeju Deep Learning Camp, we con-         come popular in recent years for training agents in complex
  sider a combined approach based on various ideas generated        environments. Minecraft, one such environment with a rich
  during the camp as well as suggestions for building agents        and highly complex state space, provides a good testbed to
  from recent research trends, similar to the methodology taken     evaluate how various Deep RL agents compare. Moreover,
  in developing Rainbow (Hessel et al. 2017). These choices         since Minecraft allows for multiple tasks and agents, multi-
  include the following: using model-based agents which al-         task multi-agent scenarios can be simulated with ease. The
  lows for planning/simulation and reduces computation costs
  when learning controllers, applying distributional reinforce-
                                                                    recently proposed MARLÖ competition focuses on such
  ment learning to reduce losses incurred from using mean es-       scenarios, and in this paper we propose certain agents as
  timators, considering curriculum learning for task selection      good candidates to solve such scenarios.
  when tasks differ in difficulty, and graph neural networks as
  an approach to communicate between agents. In this paper,                               Preliminaries
  we motivate each of these approaches and discuss a combined       In this section, we outline some necessary background infor-
  approach that we believe will fare well in the competition.
                                                                    mation that will be useful in explaining our proposed method
                                                                    and the motivation for various decisions. We first discuss a
                       Introduction                                 particular model-based algorithm that we base our agents
                                                                    on. Then, we consider the general RL landscape, and the re-
Reinforcement Learning (RL) is a subfield of machine learn-
                                                                    cent interest in distributional RL. After that discussion, we
ing that studies how agents can learn to maximize total re-
                                                                    talk about some important things to note about the MARLÖ
ward while acting in environments that provide minimal su-
                                                                    competition in particular. Finally, we discuss NerveNet and
pervision through reward signals (Sutton, Barto, and oth-
                                                                    Graph Neural Networks which provide for an approach to
ers 1998). Generally, the formulation of RL assumes that
                                                                    think about when considering multi-agent communication.
the environment is modeled as a Markov Decision Process
                                                                    When designing reinforcement learning algorithms, there is
(MDP), given by the 5-tuple hS, A, R, T, γi where S denotes
                                                                    a choice between model-free algorithms and model-based
the state space, A the action space, R the reward function,
                                                                    algorithms. Model-free approaches only focus on the prob-
T the transition function, and γ the discount factor which
                                                                    lem of how to act so as to maximize reward. On the other
controls how much agents favor short-term vs. long term re-
                                                                    hand, model based approaches also try to understand how
wards. The objective is usually to maximize the expected
                 ∞                                                  the “world” works by learning the underlying MDP of
                    γ t−1 rt , where rt is the reward obtained
                 P
return E[G] =                                                       the environment. A recent such approach that has shown
                t=1                                                 promise and which we propose to use is World Models.
at time t.
Often the MDP and the state space in particular is very large       World Models
or complex, and, thus, using traditional table-lookup meth-
                                                                    World Models (Ha and Schmidhuber 2018) looked at learn-
ods that store information per state can be very expensive
                                                                    ing a compressed spatial and temporal model of an environ-
or lead to poor generalization performance. This problem
                                                                    ment in an unsupervised manner and using the learned fea-
is generally tackled with function approximation where the
                                                                    tures to train a very simple and compact policy to solve the
                                                                    task at hand. The authors were able to achieve this using a
                                                                    multi-step training procedure:
1. Collect rollouts from environment (using a random or                     With regard to terminology in fact, the competition speci-
   other simple policy π), i.e., store episode trajectories                 fication makes an important distinction between tasks and
   of the form s0 , a0 , r1 , s1 , a1 , r2 , ..., sT −1 , aT −1 , rT , sT   games by defining a game as a collection of tasks of vary-
   where at ∼ π(st ).                                                       ing difficulty and level layouts. The competition is structured
                                                                            such that we have access to some publicly available tasks for
2. Train a variational autoencoder, V , to learn a latent vec-
                                                                            every game, but some tasks are private and to be used later
   tor representation z ∈ Rm given the higher dimensional
                                                                            for evaluation.
   states collected during step 1.
3. Train a Mixture Density Network-Recurrent Neural Net-                    Graph neural networks
   work, M , to model P (zt+1 |zt , ht , at ) where ht is the hid-
                                                                            NerveNet (Wang et al. 2018) is a recently proposed ar-
   den state of the LSTM used in the MDN-RNN.
                                                                            chitecture for continuous control where the agent was as-
4. Train a controller C which learns what action at to take                 sumed to have multiple physical parts which can be repre-
   using zt and ht , like so: at = Wc [zt , ht ] + bc .                     sented as a graph. Graph Neural Networks (GNNs) were
                                                                            used as they posses relational inductive biases and are
While V and M can simply be trained through supervised                      invariant to node and edge permutations as entities and
learning, C can be trained in various ways. However, since                  their relations are represented as sets (Battaglia et al. 2018;
the controller portion of the setup is very small (on the order             Mitchell 1980).
of a few hundred parameters in the experiments used by the
                                                                            The updates of GNNs begin with propagation of messages to
authors), training using an evolutionary technique, CMA-ES
                                                                            neighboring nodes. The feature vector of the current state for
(Covariance Matrix Adaptation Evolution Strategy (Hansen
                                                                            each node u at propagation step 0 is denoted h0u . After each
and Ostermeier 2001)) is possible.
                                                                            node passes messages to neighboring nodes, it aggregates
                                                                            messages it receives from its neighborhood, then updates its
RL Algorithms                                                               feature vector and outputs a vector combining the informa-
When designing RL agents, another common choice to be                       tion from the current and the aggregated messages. A policy
made is whether the agents should directly learn in pol-                    network parameterized by the graph can learn what actions
icy space (how to act in different states) or learn value                   to take for each part. This process can be described using the
functions that define the “goodness” of states or state-                    equations below:
action pairs in terms of expected return. In the value-
based paradigm, Q-Learning (Watkins and Dayan 1992)                                   mt(u,v) = MPv (htu ) where v ∈ Nout (u)          (1)
and Deep Q-Networks (Mnih et al. 2015) have been pop-                                 Atv = MAv ({mt(u,v) |u ∈ Nin (v)})               (2)
ular where, following the Bellman equations, state action
pairs, Q(st , at ), are updated towards a target that is gener-                       ht+1
                                                                                        v  = U v (htv , Atv )                          (3)
ally more accurate, rt + γmaxa Q(st+1 , a). In the policy-                            ŷ = O({hTv | v ∈ G})                            (4)
based paradigm, policy gradients have been commonly used
with some changes to reduce variance. A usual starting point                where Nin (·) and Nout (·) represent the in- and out-
is the Monte-Carlo policy gradient update, REINFORCE                        neighborhoods respectively, ŷ represents the output of the
(Williams 1992), which performs gradient ascent on the pa-                  GNN and MPv , MA , U v and Ov may be neural networks
rameters of the policy using sampled returns: θ ← θ +                       and parameter sharing can be applied as necessary. For ex-
∇θ log θπ (st , at )Gt . Finally, actor-critic methods lie in be-           ample, a node u can use the same network MP for message
tween the two, where an actor learns and looks to improve                   passing to all nodes on its out-neighborhood v ∈ Nout (u) or
a policy and a critic evaluates the policy and, by replacing                use separate ones for messages to each of these nodes MPv
sampled returns for example, can reduce variance.                           and hence the additional superscript v.
   Recently, there have been questions about the general ap-
proach taken in RL. Namely, the observation is that most                                             Methods
prior RL algorithms estimate the average expected return
                                                                            We now consider a few different agents as a combination of
and use these estimates when choosing actions. Distribu-
                                                                            some of the ideas explained previously. The motivation for
tional RL (Morimura et al. 2010; Bellemare, Dabney, and
                                                                            these agents and design choices therein will be described in
Munos 2017), has posed an alternative whereby Z(st , at ),
                                                                            the following section.
the return distribution for state-action pair (st , at ), is up-
dated towards rt + γmaxa Z(st+1 , at ). Action selection can                The simplest agent we propose is a multi-agent variation
then be done in an expectation sense or using, for example,                 of World Models. Firstly, we note that we assume V and
risk sensitive approaches where actions with lower variance                 M are shared. Alternatively, they can be trained separately
may be favored.                                                             for each agent. Regardless, in the explanations that follow,
                                                                            we will only discuss training the controllers. Owing to the
                                                                            small number of parameters in the controllers of world mod-
MARLÖ Competition                                                           els, we can view them as being “policy embeddings” which
One major aspect of the competition is that multiple agents                 dictate how each agent acts in the environment. Assuming
will be involved. With teams, cooperative tasks and com-                    that each agent’s controller is trained separately, let the pa-
petitive tasks become important subsets of tasks to consider.               rameters of the controllers learned be denoted by Cθi for
agent i where Cθi corresponds to the controller parame-                  state (as explained in the previous paragraph), we can even
ters [Wc , bc ] for agent i’s controller C i . If each agent were        train a more accurate value estimation network for the dis-
to act independently, then it would just forward the latent              tribution of expected returns Z(s, a).
vector embedding of its current state and hidden state of                Lastly, in a multi-task setting, there may be tasks of varying
the LSTM from the MDN-RNN, say [zti , hit ], through its                 types and difficulties. Curriculum learning is a common ap-
controller C i to get the action it should take. However, in             proach as the agent can slowly learn tasks of increasing dif-
a multi-agent setting, we are often interested in maximiz-               ficulty without which some tasks may be very hard to solve.
ing total reward for a team of agents (particularly so in                We propose having a teacher suggesting tasks using a bandit
co-operative settings.) So, we define a meta-controller as               algorithm, EXP3 (Exponential-weight algorithm for Explo-
a controller which takes in [zt1 , h1t , ..., ztn , hnt ] and outputs    ration and Exploitation) that keeps track of how each arm
[πa10 , ..., πa1m , ..., πan0 , ..., πanm ] assuming n agents are “un-   (or task in this case) fares. The general pattern we wish to
der” the teams control, there are m actions and where πai j is           see here is that as agents show signs of improving in one
the probability of agent i taking action aj .                            task (say, in an average reward sense), we select that task
                                                                         more until it is essentially solved and then we move on to
The next agent we consider is one where the controllers are              another task that is next up in the hierarchy of tasks based
trained separately, but a Graph Neural Network style ap-                 on difficulty. Since MARLÖ has the concept of games and
proach is used, reducing the number of parameters and en-                tasks, we can use a hierarchical approach whereby we first
couraging communication of shared knowledge. To do this,                 choose the game and then the task to train on. Alternatively,
we first concatenate the hidden state vectors hit for each               the teacher can select from all the available tasks directly.
agent i, say H = [h1t , ..., hnt ]. We then use this matrix to
construct a graph Gt . At the next timestep, we can do the                                      Motivation
same to get a graph Gt+1 . Now, the graph is to represent
the beliefs of how the world works and how the agents re-                In this section, we explain the reasoning behind some of our
late to each other (which it can do so given the information             design decisions and choices.
from the memory module from each agent). So, we can ex-                     While model-free approaches such as Deep Q-Networks
pect that if graphs are learned appropriately, we can infer              (Mnih et al. 2015) and A3C (Mnih et al. 2016) and their
the graph at time t + 1, given the graph and the actions taken           later variants have been popular and shown promising re-
by each agent [a1t , ..., ant ] at time t. Thus, we can use su-          sults in the past, a primary reason for their success has been
pervised learning to train the graph construction. The actual            that they are easier to train. Model-based approaches on the
construction process can be done in many ways, but we pro-               other hand are generally more computationally intensive as
pose using cosine-distance row- and column-wise to get how               there are additional steps to model the MDP. However, there
entities in the environment and agents relate to each other.             are many benefits to learning models. Importantly, running
With these graphs serving as a proxy for useful information              experiments on even simulated environments can be very ex-
about the environment and agents, these graphs can serve as              pensive as we move towards more complex environments.
useful input to each agent’s controller.                                 Minecraft is a modern game, built many years past Atari
                                                                         2600 games, and hence, as one would expect, is a much more
Past work has looked at action-conditional next state pre-               computationally intensive environment. Learning models al-
diction, but has often made the implicit assumption that the             lows for agents to learn without needing to interact with the
environment is deterministic. Since many environments are                environment at hand. We can even combine learning from
stochastic and world models itself consists of a parameter τ             the environment and learning the model with planning us-
which controls the environments stochasticity, we can look               ing the model following the Dyna architecture, i.e., learning
to previous work in distributional RL to consider a better               from “dreams.”
approach to generating and using next state predictions. No-             Now, while using models as a proxy for complex environ-
tably, when there are many possible next states, we can inte-            ments is an obvious advantage to using models. There are
grate over them by the probability to get an “averaged” next             still many other interesting things to note about using mod-
state vector. The probability distribution over next states,             els and model-based learning in general. One primary ar-
P (zt+1 |at , zt , ht ), is modeled by M in the world models             gument against model-based learning is that getting good
approach. So, instead of sampling from the mixture density               models is very difficult and poor models can hinder learn-
model and using a single sample, we can easily sample re-                ing. However, in both World Models and I2A, the authors
peatedly or even in some uniform interval of τ to get multi-             note that poor models can still in fact help learning. While
ple possible next states with various probabilities of occur-            no solid reason for this is known, we hypothesize that this
ring. Averaging them could give us a more useful z and this              may be due to some state-aggregation like effects or where
is an approach we wish to study. Using all of the sampled                the MDP approximated has some strong correlation to the
z’s separately, as in (Buesing et al. 2018) is also possible.            true underlying MDP. In other words, the model is at least
While the evolutionary approach allows us to be free from                able to learn some useful clustering of the states or learn
the problem of needing to define how to use the rewards,                 reward and transition functions that are similar to the true
we can consider RL approaches as well. Value based ap-                   reward and transition functions but differ in small ways.
proaches help in long term planning and hence we propose                 Opponent modeling is another aspect that past research has
using actor-critic methods where the critic can use distribu-            considered. However, with limited access to opponent states
tional RL. Given multiple possible next states for the current           and policies as well partial observability, it can be a very
hard problem. Models allow opponents to just be modeled as       means. And, indeed, recent work has shown that distribu-
part of the environment and thus sidestep the need to incor-     tional RL doesn’t seem to be of much help in the linear func-
porate opponent modeling in agents. However, there is still      tion approximation case (Lyle and Bellemare 2018). Given
a question of how to setup opponent policies while learn-        that we pushed for using actor-critic above, a natural choice
ing the model. If setting opponent policies randomly does        for the critic is then to see how predicting return distribu-
not work, we can use self-play with some noisy version of        tions could affect performance, as done in Rainbow (Hessel
learned policies.                                                et al. 2017). Additionally, given the Actor-Critic approach
World models in particular is appealing for a few concrete       and that we have access to a simulator as well, counterfac-
reasons. Firstly, Minecraft is partially-observable, the hid-    tual policy gradients (Foerster et al. 2017) can also be used.
den state of the LSTM h in the MDN-RNN module could
be very useful in making the POMDP more like an MDP.
The hidden state encodes information about past observa-                               Related work
tions and actions taken. Hence, h combined with the current      There have been multiple approaches recently in the area
state encoding z represents a new state that encodes the his-    of multi-task multi-agent scenarios and, particularly in the
tory as well.                                                    Minecraft platform, Malmö, such as (Xiong et al. 2018).
Secondly, as the tasks in the MARLÖ competition involve          They propose the novel model HogRider to solve the Pig
multi-agent multi-task scenarios, we need to think about         Chase task. The objective of the task is to chase randomly
how to train the agents. With world models, we can share         spawned pigs in a 9 × 9 grid with collaboration. Since the
the V , the state feature-extraction module, across the agents   pigs are moving after the task is initialized, the environment
and games. The memory module M can also be shared with           is non-stationary, and hence can be viewed as a multi-agent
some consideration for how to train it given that agents may     team-based scenario.
have differing roles. The main point however is that world          The multi-agent problem of learning from a single scalar
models is structured in such a way that we can share many        reward signal has generally been quite difficult to approach
modules between agents, reducing computational costs and         but has recently been looked at. Value Decomposition Net-
introducing some useful priors. While such sharing of state      works (Sunehag et al. 2017) look to decompose the team
representations might be possible in model-free algorithms,      value function into agent value functions based and see that
the learned representations can be quite different given how     the method fares better when weight sharing and informa-
the architectures are trained (supervised learning vs. RL and    tion channels are used. Again, this is in fact a strong argu-
hence with different loss functions.)                            ment for using a model-based approach whereby many of
As a related note, even the structure of the controller, C, is   the feature representations are shared across agents.
amenable to the multi-agent setting. Since most of the com-
plexity in terms of number of parameters is in the V and M       A recent retro game contest held by OpenAI has tackled
models, C is very small. With multiple agents, we can train a    various scenarios similar to the MARLÖ contest (Gray and
meta-controller with a lot fewer parameters than if we were      Brown 2018) and has further brought attention to the need
using separate model-free architectures for each.                for adaptive exploration and meta-learning in RL.
Finally, we discuss some motivations behind taking a certain
approach to training the controller, C, using RL. Naively us-                Discussion and Future Work
ing CMA-ES in a multi-task cooperative task setting with
each controller being trained separately would lead to an        We believe this study is particularly useful and well-
exponential blow-up whereby we need to perform a com-            motivated because most of the challenges faced by agents
binatorial search to find good parameters for each agent in      evaluated in a multi-task/multi-agent settings are with under-
order to maximize “global” reward. Also, notably, while          standing of the interactions between agents and their respec-
evolutionary algorithms and the controller architecture is       tive policies. Our approach attempts to alleviate these diffi-
suited to policy search, complex tasks often require long        culties with a Graph Neural Network and Policy Embedding
term planning and hence value functions are useful. This is      approach. We further can address issues of uncertainty and
the same dichotomy seen in continuous control tasks where        model interpretability using distributional RL. Finally, we
policy based methods generally perform much better while         utilize curriculum learning to address agents’ need to bal-
in games with some amount of planning necessary, like Atari      ance performance on multiple tasks of varying difficulties,
2600 games or Go, value based methods have generally             intelligently deciding which tasks to train on and alleviating
done much better. Hence, we believe we should use param-         issues with catastrophic forgetting .
eter sharing or metacontrollers (to deal with the exponen-          The immediate next steps are to continue implementa-
tial blow-up) and Actor-Critic methods given how Minecraft       tions of the above ideas into an agent and evaluate its perfor-
combines the need for impulsive quick-actions and long           mance in the MARLÖ competition compared to other com-
term planning.                                                   mon benchmark agents. We also wish to look into how to
Recently, we have also seen that deep RL seems to benefit        construct simulated environments with various types of op-
from learning distributions even when acting in an expected      ponents and do ablation tests to see which of the ideas de-
sense. While it is still unclear why, one reasonable argu-       tailed above lead to the most improvement. We will keep
ment for this is that there seems to be some bias induced        up-to-date our progress and thoughts here at this webpage:
in the deep RL setting when collapsing distributions to their    https://sites.google.com/view/rainbowmarlo/.
                       References                                Networks For Cooperative Multi-Agent Learning. ArXiv e-
                                                                 prints.
Arulkumaran, K.; Deisenroth, M. P.; Brundage, M.; and
Bharath, A. A. 2017. A brief survey of deep reinforcement        Sutton, R. S.; Barto, A. G.; et al. 1998. Reinforcement learn-
learning. arXiv preprint arXiv:1708.05866.                       ing: An introduction. MIT press.
Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-            Wang, T.; Liao, R.; Ba, J.; and Fidler, S. 2018. Nervenet:
Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.;       Learning structured policy with graph neural networks.
Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Rela-        Watkins, C. J., and Dayan, P. 1992. Q-learning. Machine
tional inductive biases, deep learning, and graph networks.      learning 8(3-4):279–292.
arXiv preprint arXiv:1806.01261.                                 Williams, R. J. 1992. Simple statistical gradient-following
Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A              algorithms for connectionist reinforcement learning. Ma-
distributional perspective on reinforcement learning. arXiv      chine learning 8(3-4):229–256.
preprint arXiv:1707.06887.                                       Xiong, Y.; Chen, H.; Zhao, M.; and An, B. 2018. Hogrider:
Buesing, L.; Weber, T.; Racaniere, S.; Eslami, S.; Rezende,      Champion agent of microsoft malmo collaborative ai chal-
D.; Reichert, D. P.; Viola, F.; Besse, F.; Gregor, K.; Has-      lenge.
sabis, D.; et al. 2018. Learning and querying fast gen-
erative models for reinforcement learning. arXiv preprint
arXiv:1802.03006.
Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; and
Whiteson, S. 2017. Counterfactual Multi-Agent Policy Gra-
dients. ArXiv e-prints.
Gray, J., and Brown, T. 2018. Retro contest.
Ha, D., and Schmidhuber, J. 2018. World models. arXiv
preprint arXiv:1803.10122.
Hansen, N., and Ostermeier, A. 2001. Completely deran-
domized self-adaptation in evolution strategies. Evolution-
ary computation 9(2):159–195.
Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostro-
vski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; and Sil-
ver, D. 2017. Rainbow: Combining improvements in deep
reinforcement learning. arXiv preprint arXiv:1710.02298.
LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning.
nature 521(7553):436.
Lyle, C., and Bellemare, M. G. 2018. As expected? an anal-
ysis of distributional reinforcement learning.
Mitchell, T. M. 1980. The need for biases in learning gen-
eralizations. Department of Computer Science, Laboratory
for Computer Science Research, Rutgers Univ. New Jersey.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-
ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;
Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-
level control through deep reinforcement learning. Nature
518(7540):529.
Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.;
Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asyn-
chronous methods for deep reinforcement learning. In In-
ternational conference on machine learning, 1928–1937.
Morimura, T.; Sugiyama, M.; Kashima, H.; Hachiya, H.;
and Tanaka, T. 2010. Nonparametric return distribution
approximation for reinforcement learning. In Proceedings
of the 27th International Conference on Machine Learning
(ICML-10), 799–806.
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W. M.; Zam-
baldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo,
J. Z.; Tuyls, K.; and Graepel, T. 2017. Value-Decomposition

</pre>