=Paper=
{{Paper
|id=Vol-2282/MARLO2
|storemode=property
|title=Extending World Models for Multi-Agent Reinforcement Learning in MALMÖ
|pdfUrl=https://ceur-ws.org/Vol-2282/MARLO_110.pdf
|volume=Vol-2282
|authors=Valliappa Chockalingam,Tegg Taekyong Sung,Feryal Behbahani,Rishab Gargeya,Amlesh Sivanantham,Aleksandra Malysheva
|dblpUrl=https://dblp.org/rec/conf/aiide/ChockalingamSBG18
}}
==Extending World Models for Multi-Agent Reinforcement Learning in MALMÖ==
Extending World Models for Multi-Agent Reinforcement Learning in MALMÖ Valliappa Chockalingam1 , Tegg Taekyong Sung2 , Feryal Behbahani3 , Rishab Gargeya4 , Amlesh Sivanantham5 , Aleksandra Malysheva6 1 University of Alberta, 2 Kwangwoon University, 3 Imperial College London, 4 Stanford University, 5 University of Southern California, 6 JetBrains Research 1 valliapp@ualberta.ca, 2 tegg89@gmail.com, 3 feryal.mp@gmail.com, 4 rgargeya@stanford.edu, 5 sivanant@usc.edu, 6 malyshevasasha777@gmail.com Abstract states are first mapped onto a lower dimensional space and RL algorithms then work with the extracted features. Recent work in (deep) reinforcement learning has increas- Deep Learning is a method to learn general functions using ingly looked to develop better agents for multi-agent/multi- task scenarios as many successes have already been seen in artificial neural networks and hence is a way towards non- the usual single-task single-agent setting. In this paper, we linear function approximation (LeCun, Bengio, and Hin- propose a solution for a recently released benchmark which ton 2015; Arulkumaran et al. 2017). Thus, combining Rein- tests agents in such scenarios, namely the MARLÖ competi- forcement Learning and Deep Learning (Deep RL) has be- tion. Following the 2018 Jeju Deep Learning Camp, we con- come popular in recent years for training agents in complex sider a combined approach based on various ideas generated environments. Minecraft, one such environment with a rich during the camp as well as suggestions for building agents and highly complex state space, provides a good testbed to from recent research trends, similar to the methodology taken evaluate how various Deep RL agents compare. Moreover, in developing Rainbow (Hessel et al. 2017). These choices since Minecraft allows for multiple tasks and agents, multi- include the following: using model-based agents which al- task multi-agent scenarios can be simulated with ease. The lows for planning/simulation and reduces computation costs when learning controllers, applying distributional reinforce- recently proposed MARLÖ competition focuses on such ment learning to reduce losses incurred from using mean es- scenarios, and in this paper we propose certain agents as timators, considering curriculum learning for task selection good candidates to solve such scenarios. when tasks differ in difficulty, and graph neural networks as an approach to communicate between agents. In this paper, Preliminaries we motivate each of these approaches and discuss a combined In this section, we outline some necessary background infor- approach that we believe will fare well in the competition. mation that will be useful in explaining our proposed method and the motivation for various decisions. We first discuss a Introduction particular model-based algorithm that we base our agents on. Then, we consider the general RL landscape, and the re- Reinforcement Learning (RL) is a subfield of machine learn- cent interest in distributional RL. After that discussion, we ing that studies how agents can learn to maximize total re- talk about some important things to note about the MARLÖ ward while acting in environments that provide minimal su- competition in particular. Finally, we discuss NerveNet and pervision through reward signals (Sutton, Barto, and oth- Graph Neural Networks which provide for an approach to ers 1998). Generally, the formulation of RL assumes that think about when considering multi-agent communication. the environment is modeled as a Markov Decision Process When designing reinforcement learning algorithms, there is (MDP), given by the 5-tuple hS, A, R, T, γi where S denotes a choice between model-free algorithms and model-based the state space, A the action space, R the reward function, algorithms. Model-free approaches only focus on the prob- T the transition function, and γ the discount factor which lem of how to act so as to maximize reward. On the other controls how much agents favor short-term vs. long term re- hand, model based approaches also try to understand how wards. The objective is usually to maximize the expected ∞ the “world” works by learning the underlying MDP of γ t−1 rt , where rt is the reward obtained P return E[G] = the environment. A recent such approach that has shown t=1 promise and which we propose to use is World Models. at time t. Often the MDP and the state space in particular is very large World Models or complex, and, thus, using traditional table-lookup meth- World Models (Ha and Schmidhuber 2018) looked at learn- ods that store information per state can be very expensive ing a compressed spatial and temporal model of an environ- or lead to poor generalization performance. This problem ment in an unsupervised manner and using the learned fea- is generally tackled with function approximation where the tures to train a very simple and compact policy to solve the task at hand. The authors were able to achieve this using a multi-step training procedure: 1. Collect rollouts from environment (using a random or With regard to terminology in fact, the competition speci- other simple policy π), i.e., store episode trajectories fication makes an important distinction between tasks and of the form s0 , a0 , r1 , s1 , a1 , r2 , ..., sT −1 , aT −1 , rT , sT games by defining a game as a collection of tasks of vary- where at ∼ π(st ). ing difficulty and level layouts. The competition is structured such that we have access to some publicly available tasks for 2. Train a variational autoencoder, V , to learn a latent vec- every game, but some tasks are private and to be used later tor representation z ∈ Rm given the higher dimensional for evaluation. states collected during step 1. 3. Train a Mixture Density Network-Recurrent Neural Net- Graph neural networks work, M , to model P (zt+1 |zt , ht , at ) where ht is the hid- NerveNet (Wang et al. 2018) is a recently proposed ar- den state of the LSTM used in the MDN-RNN. chitecture for continuous control where the agent was as- 4. Train a controller C which learns what action at to take sumed to have multiple physical parts which can be repre- using zt and ht , like so: at = Wc [zt , ht ] + bc . sented as a graph. Graph Neural Networks (GNNs) were used as they posses relational inductive biases and are While V and M can simply be trained through supervised invariant to node and edge permutations as entities and learning, C can be trained in various ways. However, since their relations are represented as sets (Battaglia et al. 2018; the controller portion of the setup is very small (on the order Mitchell 1980). of a few hundred parameters in the experiments used by the The updates of GNNs begin with propagation of messages to authors), training using an evolutionary technique, CMA-ES neighboring nodes. The feature vector of the current state for (Covariance Matrix Adaptation Evolution Strategy (Hansen each node u at propagation step 0 is denoted h0u . After each and Ostermeier 2001)) is possible. node passes messages to neighboring nodes, it aggregates messages it receives from its neighborhood, then updates its RL Algorithms feature vector and outputs a vector combining the informa- When designing RL agents, another common choice to be tion from the current and the aggregated messages. A policy made is whether the agents should directly learn in pol- network parameterized by the graph can learn what actions icy space (how to act in different states) or learn value to take for each part. This process can be described using the functions that define the “goodness” of states or state- equations below: action pairs in terms of expected return. In the value- based paradigm, Q-Learning (Watkins and Dayan 1992) mt(u,v) = MPv (htu ) where v ∈ Nout (u) (1) and Deep Q-Networks (Mnih et al. 2015) have been pop- Atv = MAv ({mt(u,v) |u ∈ Nin (v)}) (2) ular where, following the Bellman equations, state action pairs, Q(st , at ), are updated towards a target that is gener- ht+1 v = U v (htv , Atv ) (3) ally more accurate, rt + γmaxa Q(st+1 , a). In the policy- ŷ = O({hTv | v ∈ G}) (4) based paradigm, policy gradients have been commonly used with some changes to reduce variance. A usual starting point where Nin (·) and Nout (·) represent the in- and out- is the Monte-Carlo policy gradient update, REINFORCE neighborhoods respectively, ŷ represents the output of the (Williams 1992), which performs gradient ascent on the pa- GNN and MPv , MA , U v and Ov may be neural networks rameters of the policy using sampled returns: θ ← θ + and parameter sharing can be applied as necessary. For ex- ∇θ log θπ (st , at )Gt . Finally, actor-critic methods lie in be- ample, a node u can use the same network MP for message tween the two, where an actor learns and looks to improve passing to all nodes on its out-neighborhood v ∈ Nout (u) or a policy and a critic evaluates the policy and, by replacing use separate ones for messages to each of these nodes MPv sampled returns for example, can reduce variance. and hence the additional superscript v. Recently, there have been questions about the general ap- proach taken in RL. Namely, the observation is that most Methods prior RL algorithms estimate the average expected return We now consider a few different agents as a combination of and use these estimates when choosing actions. Distribu- some of the ideas explained previously. The motivation for tional RL (Morimura et al. 2010; Bellemare, Dabney, and these agents and design choices therein will be described in Munos 2017), has posed an alternative whereby Z(st , at ), the following section. the return distribution for state-action pair (st , at ), is up- dated towards rt + γmaxa Z(st+1 , at ). Action selection can The simplest agent we propose is a multi-agent variation then be done in an expectation sense or using, for example, of World Models. Firstly, we note that we assume V and risk sensitive approaches where actions with lower variance M are shared. Alternatively, they can be trained separately may be favored. for each agent. Regardless, in the explanations that follow, we will only discuss training the controllers. Owing to the small number of parameters in the controllers of world mod- MARLÖ Competition els, we can view them as being “policy embeddings” which One major aspect of the competition is that multiple agents dictate how each agent acts in the environment. Assuming will be involved. With teams, cooperative tasks and com- that each agent’s controller is trained separately, let the pa- petitive tasks become important subsets of tasks to consider. rameters of the controllers learned be denoted by Cθi for agent i where Cθi corresponds to the controller parame- state (as explained in the previous paragraph), we can even ters [Wc , bc ] for agent i’s controller C i . If each agent were train a more accurate value estimation network for the dis- to act independently, then it would just forward the latent tribution of expected returns Z(s, a). vector embedding of its current state and hidden state of Lastly, in a multi-task setting, there may be tasks of varying the LSTM from the MDN-RNN, say [zti , hit ], through its types and difficulties. Curriculum learning is a common ap- controller C i to get the action it should take. However, in proach as the agent can slowly learn tasks of increasing dif- a multi-agent setting, we are often interested in maximiz- ficulty without which some tasks may be very hard to solve. ing total reward for a team of agents (particularly so in We propose having a teacher suggesting tasks using a bandit co-operative settings.) So, we define a meta-controller as algorithm, EXP3 (Exponential-weight algorithm for Explo- a controller which takes in [zt1 , h1t , ..., ztn , hnt ] and outputs ration and Exploitation) that keeps track of how each arm [πa10 , ..., πa1m , ..., πan0 , ..., πanm ] assuming n agents are “un- (or task in this case) fares. The general pattern we wish to der” the teams control, there are m actions and where πai j is see here is that as agents show signs of improving in one the probability of agent i taking action aj . task (say, in an average reward sense), we select that task more until it is essentially solved and then we move on to The next agent we consider is one where the controllers are another task that is next up in the hierarchy of tasks based trained separately, but a Graph Neural Network style ap- on difficulty. Since MARLÖ has the concept of games and proach is used, reducing the number of parameters and en- tasks, we can use a hierarchical approach whereby we first couraging communication of shared knowledge. To do this, choose the game and then the task to train on. Alternatively, we first concatenate the hidden state vectors hit for each the teacher can select from all the available tasks directly. agent i, say H = [h1t , ..., hnt ]. We then use this matrix to construct a graph Gt . At the next timestep, we can do the Motivation same to get a graph Gt+1 . Now, the graph is to represent the beliefs of how the world works and how the agents re- In this section, we explain the reasoning behind some of our late to each other (which it can do so given the information design decisions and choices. from the memory module from each agent). So, we can ex- While model-free approaches such as Deep Q-Networks pect that if graphs are learned appropriately, we can infer (Mnih et al. 2015) and A3C (Mnih et al. 2016) and their the graph at time t + 1, given the graph and the actions taken later variants have been popular and shown promising re- by each agent [a1t , ..., ant ] at time t. Thus, we can use su- sults in the past, a primary reason for their success has been pervised learning to train the graph construction. The actual that they are easier to train. Model-based approaches on the construction process can be done in many ways, but we pro- other hand are generally more computationally intensive as pose using cosine-distance row- and column-wise to get how there are additional steps to model the MDP. However, there entities in the environment and agents relate to each other. are many benefits to learning models. Importantly, running With these graphs serving as a proxy for useful information experiments on even simulated environments can be very ex- about the environment and agents, these graphs can serve as pensive as we move towards more complex environments. useful input to each agent’s controller. Minecraft is a modern game, built many years past Atari 2600 games, and hence, as one would expect, is a much more Past work has looked at action-conditional next state pre- computationally intensive environment. Learning models al- diction, but has often made the implicit assumption that the lows for agents to learn without needing to interact with the environment is deterministic. Since many environments are environment at hand. We can even combine learning from stochastic and world models itself consists of a parameter τ the environment and learning the model with planning us- which controls the environments stochasticity, we can look ing the model following the Dyna architecture, i.e., learning to previous work in distributional RL to consider a better from “dreams.” approach to generating and using next state predictions. No- Now, while using models as a proxy for complex environ- tably, when there are many possible next states, we can inte- ments is an obvious advantage to using models. There are grate over them by the probability to get an “averaged” next still many other interesting things to note about using mod- state vector. The probability distribution over next states, els and model-based learning in general. One primary ar- P (zt+1 |at , zt , ht ), is modeled by M in the world models gument against model-based learning is that getting good approach. So, instead of sampling from the mixture density models is very difficult and poor models can hinder learn- model and using a single sample, we can easily sample re- ing. However, in both World Models and I2A, the authors peatedly or even in some uniform interval of τ to get multi- note that poor models can still in fact help learning. While ple possible next states with various probabilities of occur- no solid reason for this is known, we hypothesize that this ring. Averaging them could give us a more useful z and this may be due to some state-aggregation like effects or where is an approach we wish to study. Using all of the sampled the MDP approximated has some strong correlation to the z’s separately, as in (Buesing et al. 2018) is also possible. true underlying MDP. In other words, the model is at least While the evolutionary approach allows us to be free from able to learn some useful clustering of the states or learn the problem of needing to define how to use the rewards, reward and transition functions that are similar to the true we can consider RL approaches as well. Value based ap- reward and transition functions but differ in small ways. proaches help in long term planning and hence we propose Opponent modeling is another aspect that past research has using actor-critic methods where the critic can use distribu- considered. However, with limited access to opponent states tional RL. Given multiple possible next states for the current and policies as well partial observability, it can be a very hard problem. Models allow opponents to just be modeled as means. And, indeed, recent work has shown that distribu- part of the environment and thus sidestep the need to incor- tional RL doesn’t seem to be of much help in the linear func- porate opponent modeling in agents. However, there is still tion approximation case (Lyle and Bellemare 2018). Given a question of how to setup opponent policies while learn- that we pushed for using actor-critic above, a natural choice ing the model. If setting opponent policies randomly does for the critic is then to see how predicting return distribu- not work, we can use self-play with some noisy version of tions could affect performance, as done in Rainbow (Hessel learned policies. et al. 2017). Additionally, given the Actor-Critic approach World models in particular is appealing for a few concrete and that we have access to a simulator as well, counterfac- reasons. Firstly, Minecraft is partially-observable, the hid- tual policy gradients (Foerster et al. 2017) can also be used. den state of the LSTM h in the MDN-RNN module could be very useful in making the POMDP more like an MDP. The hidden state encodes information about past observa- Related work tions and actions taken. Hence, h combined with the current There have been multiple approaches recently in the area state encoding z represents a new state that encodes the his- of multi-task multi-agent scenarios and, particularly in the tory as well. Minecraft platform, Malmö, such as (Xiong et al. 2018). Secondly, as the tasks in the MARLÖ competition involve They propose the novel model HogRider to solve the Pig multi-agent multi-task scenarios, we need to think about Chase task. The objective of the task is to chase randomly how to train the agents. With world models, we can share spawned pigs in a 9 × 9 grid with collaboration. Since the the V , the state feature-extraction module, across the agents pigs are moving after the task is initialized, the environment and games. The memory module M can also be shared with is non-stationary, and hence can be viewed as a multi-agent some consideration for how to train it given that agents may team-based scenario. have differing roles. The main point however is that world The multi-agent problem of learning from a single scalar models is structured in such a way that we can share many reward signal has generally been quite difficult to approach modules between agents, reducing computational costs and but has recently been looked at. Value Decomposition Net- introducing some useful priors. While such sharing of state works (Sunehag et al. 2017) look to decompose the team representations might be possible in model-free algorithms, value function into agent value functions based and see that the learned representations can be quite different given how the method fares better when weight sharing and informa- the architectures are trained (supervised learning vs. RL and tion channels are used. Again, this is in fact a strong argu- hence with different loss functions.) ment for using a model-based approach whereby many of As a related note, even the structure of the controller, C, is the feature representations are shared across agents. amenable to the multi-agent setting. Since most of the com- plexity in terms of number of parameters is in the V and M A recent retro game contest held by OpenAI has tackled models, C is very small. With multiple agents, we can train a various scenarios similar to the MARLÖ contest (Gray and meta-controller with a lot fewer parameters than if we were Brown 2018) and has further brought attention to the need using separate model-free architectures for each. for adaptive exploration and meta-learning in RL. Finally, we discuss some motivations behind taking a certain approach to training the controller, C, using RL. Naively us- Discussion and Future Work ing CMA-ES in a multi-task cooperative task setting with each controller being trained separately would lead to an We believe this study is particularly useful and well- exponential blow-up whereby we need to perform a com- motivated because most of the challenges faced by agents binatorial search to find good parameters for each agent in evaluated in a multi-task/multi-agent settings are with under- order to maximize “global” reward. Also, notably, while standing of the interactions between agents and their respec- evolutionary algorithms and the controller architecture is tive policies. Our approach attempts to alleviate these diffi- suited to policy search, complex tasks often require long culties with a Graph Neural Network and Policy Embedding term planning and hence value functions are useful. This is approach. We further can address issues of uncertainty and the same dichotomy seen in continuous control tasks where model interpretability using distributional RL. Finally, we policy based methods generally perform much better while utilize curriculum learning to address agents’ need to bal- in games with some amount of planning necessary, like Atari ance performance on multiple tasks of varying difficulties, 2600 games or Go, value based methods have generally intelligently deciding which tasks to train on and alleviating done much better. Hence, we believe we should use param- issues with catastrophic forgetting . eter sharing or metacontrollers (to deal with the exponen- The immediate next steps are to continue implementa- tial blow-up) and Actor-Critic methods given how Minecraft tions of the above ideas into an agent and evaluate its perfor- combines the need for impulsive quick-actions and long mance in the MARLÖ competition compared to other com- term planning. mon benchmark agents. We also wish to look into how to Recently, we have also seen that deep RL seems to benefit construct simulated environments with various types of op- from learning distributions even when acting in an expected ponents and do ablation tests to see which of the ideas de- sense. While it is still unclear why, one reasonable argu- tailed above lead to the most improvement. We will keep ment for this is that there seems to be some bias induced up-to-date our progress and thoughts here at this webpage: in the deep RL setting when collapsing distributions to their https://sites.google.com/view/rainbowmarlo/. References Networks For Cooperative Multi-Agent Learning. ArXiv e- prints. Arulkumaran, K.; Deisenroth, M. P.; Brundage, M.; and Bharath, A. A. 2017. A brief survey of deep reinforcement Sutton, R. S.; Barto, A. G.; et al. 1998. Reinforcement learn- learning. arXiv preprint arXiv:1708.05866. ing: An introduction. MIT press. Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez- Wang, T.; Liao, R.; Ba, J.; and Fidler, S. 2018. Nervenet: Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Learning structured policy with graph neural networks. Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Rela- Watkins, C. J., and Dayan, P. 1992. Q-learning. Machine tional inductive biases, deep learning, and graph networks. learning 8(3-4):279–292. arXiv preprint arXiv:1806.01261. Williams, R. J. 1992. Simple statistical gradient-following Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A algorithms for connectionist reinforcement learning. Ma- distributional perspective on reinforcement learning. arXiv chine learning 8(3-4):229–256. preprint arXiv:1707.06887. Xiong, Y.; Chen, H.; Zhao, M.; and An, B. 2018. Hogrider: Buesing, L.; Weber, T.; Racaniere, S.; Eslami, S.; Rezende, Champion agent of microsoft malmo collaborative ai chal- D.; Reichert, D. P.; Viola, F.; Besse, F.; Gregor, K.; Has- lenge. sabis, D.; et al. 2018. Learning and querying fast gen- erative models for reinforcement learning. arXiv preprint arXiv:1802.03006. Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; and Whiteson, S. 2017. Counterfactual Multi-Agent Policy Gra- dients. ArXiv e-prints. Gray, J., and Brown, T. 2018. Retro contest. Ha, D., and Schmidhuber, J. 2018. World models. arXiv preprint arXiv:1803.10122. Hansen, N., and Ostermeier, A. 2001. Completely deran- domized self-adaptation in evolution strategies. Evolution- ary computation 9(2):159–195. Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostro- vski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; and Sil- ver, D. 2017. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298. LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature 521(7553):436. Lyle, C., and Bellemare, M. G. 2018. As expected? an anal- ysis of distributional reinforcement learning. Mitchell, T. M. 1980. The need for biases in learning gen- eralizations. Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ. New Jersey. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve- ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human- level control through deep reinforcement learning. Nature 518(7540):529. Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asyn- chronous methods for deep reinforcement learning. In In- ternational conference on machine learning, 1928–1937. Morimura, T.; Sugiyama, M.; Kashima, H.; Hachiya, H.; and Tanaka, T. 2010. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 799–806. Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W. M.; Zam- baldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J. Z.; Tuyls, K.; and Graepel, T. 2017. Value-Decomposition