Pommerman: A Multi-Agent Playground Cinjon Resnick∗ Wes Eldridge David Ha Denny Britz Jakob Foerster NYU Rebellious Labs Google Brain Stanford University University of Oxford Julian Togelius Kyunghyun Cho and Joan Bruna NYU NYU, FAIR Abstract more than two players is the lack of suitable environments. We propose Pommerman as a solution. We present Pommerman, a multi-agent environment based on Pommerman is stylistically similar to Bomberman the classic console game Bomberman. Pommerman consists (Bomberman 1983), the famous game from Nintendo. At a of a set of scenarios, each having at least four players and con- taining both cooperative and competitive aspects. We believe high level, there are at least four agents all traversing a grid that success in Pommerman will require a diverse set of tools world. Each agent’s goal is to have their team be the last and methods, including planning, opponent/teammate model- remaining. They can plant bombs that, upon expiration, de- ing, game theory, and communication, and consequently can stroy anything (but rigid walls) in their vicinity. It contains serve well as a multi-agent benchmark. To date, we have al- both adversarial and cooperative elements. The Free-For-All ready hosted one competition, and our next one will be fea- (FFA) variant has at most one winner and, because there are tured in the NIPS 2018 competition track. four players, encourages research directions that can han- dle situations where the Nash payoffs are not all equivalent. The team variants encourage research with and without ex- Why Pommerman plicit communication channels, including scenarios where In this section, we provide our motivation and goals for both the agent has to cooperate with previously unseen team- the Pommerman benchmark and the NIPS 2018 competi- mates. The latter is a recently burgeoning subfield of multi- tion. Currently, there is no consensus benchmark involving agent learning (Foerster et al. 2016; Resnick et al. 2018d; either general sum game settings or settings with at least Evtimova et al. 2017; Foerster et al. 2017; Lewis et al. 2017; three players. Instead, recent progress has focused on two Mordatch and Abbeel 2017; Lazaridou et al. 2018) with es- player zero-sum games such as Go and Chess. We believe tablished prior work as well (Steels 1999; 2003; Levy and that the Pommerman environment can assume this role for Kirby 2006; Fehervari and Elmenreich 2010), while the lat- multi-agent learning. Additionally, we are organizing com- ter has been underexplored. petitions for Pommerman because we believe that they are We aim for the Pommerman benchmark to provide for a strong way to push forward the state of the art and can multi-agent learning what the Atari Learning Environment contribute to lasting results for years to come. (Bellemare et al. 2013) provided for single-agent reinforce- ment learning and ImageNet (Deng et al. 2009) for image Multi-Agent Learning recognition. Beyond game theory and communication, Pom- merman can also serve as a testbed for research into rein- Historically, a majority of multi-agent research has focused forcement learning, planning, and opponent/teammate mod- on zero-sum two player games. For example, computer com- eling. petitions for Poker and Go over the past fifteen years have RoboCup Soccer (Nardi et al. 2014) is a similar competi- been vital for developing methods culminating in recent su- tion that has been running since 1997. There, eleven agents perhuman performance (Moravcı́k et al. 2017; Noam Brown per side play soccer. Key differences between Pommerman 2017; Bowling et al. 2017; Silver et al. 2016). These bench- and RoboCup Soccer are: marks have also lead to the discovery of new algorithms and approaches like Monte Carlo Tree Search (Vodopivec, 1. Pommerman includes an explicit communication channel. Samothrakis, and Šter 2017; Browne et al. 2012; Kocsis and This changes the dynamics of the game and adds new re- Szepesvári 2006; Coulom 2006) and Counterfactual Regret search avenues. Minimization (Zinkevich et al. 2008). 2. Pommerman strips away the sensor input, which means We believe that an aspect restraining the field from pro- that the game is less apt for robotics but more apt for gressing towards general-sum research and scenarios with studying other aspects of AI, games, and strategy. ∗ Correspondence: cinjon@nyu.edu 3. Pommerman uses low dimensional, discrete control and input representations instead of continuous ones. We be- lieve this makes it easier to focus on the high level strate- gic aspects rather than low level mechanics. of the board, playing with random teammates, communica- tion among the agents, adding power-ups, and learning to 4. In team variants, the default Pommerman setup has only play with human players. two agents per side, which makes it more amenable to bur- These potential extensions, and the fact that N-player geoning fields like emergent communication which en- learning by itself has few mathematical guarantees, suggest counter training difficulties with larger numbers of agents. that Pommerman will be a challenging and fruitful testbed 5. Pommerman’s FFA variant promotes research that does for years to come. not reduce to a 1v1 game, which means that a lot of the There are, however, limitations to this environment. One theory underlying such games (like RoboCup Soccer) is difficulty is that a local optimum arises where the agent not applicable. avoids exploding itself by learning to never use the bomb The second, third, and fourth differences above are a posi- action. In the long term, this is ineffective because the agent tive or negative trade-off depending on one’s research goals. needs to use the bomb to destroy other agents. Players have successfully solved this challenge (Resnick et al. 2018a), but Another, more recent, benchmark is Half-Field Offense it is an aspect of basic gameplay that has to be handled in (Hausknecht et al. 2016), a modification of RoboCup that order for the multi-agent research benefits to become appar- reduces the complexity and focuses on decision-making in ent. a simplified subtask. However, unlike the FFA scenario in Pommerman, Half-Field Offense is limited to being a zero- sum game between two teams. Description In general, the communities that we want to attract to In this section, we give details of the Pommerman environ- benchmark their algorithms have not gravitated towards ment. Note that all of the code to run the game and train RoboCup but instead have relied on a large number of one- agents can be found in our git repository (Resnick et al. off toy tasks. This is especially true for multi-agent Deep 2018b), while our website (pommerman.com) contains fur- RL. We think that the reasons for that could be among the ther information on how to submit agents. five above. Consequently, Pommerman has the potential to unite these communities, especially when considering that Game Information future versions can be expanded to more than four agents. High Quality Benchmark There are attributes that are common to the best benchmarks beyond satisfying the community’s research direction. These include having mechanics and gameplay that are intuitive for humans, being fun to play and watch, being easy to integrate into common research setups, and having a learning problem that is not too difficult for the current state of method devel- opment. Most games violate at least one of these. For ex- ample, the popular game Defense of the Ancients (OpenAI 2018) is intuitive and fun, but extremely difficult to integrate. On the other hand, the card game Bridge is easy to integrate, Figure 1: Pommerman start state. Each agent begins in one but it is not intuitive; the gameplay and mechanics are slow of four positions. Yellow squares are wood, brown are rigid, to learn and there is a steep learning curve to understanding and the gray are passages. strategy. Pommerman satisfies these requirements. People have no As previously mentioned, Pommerman is stylistically trouble understanding basic strategy and mechanics. It is fun similar to Bomberman. Every battle starts on a randomly to play and to watch, having been developed by Nintendo drawn symmetric 11x11 grid (‘board’) with four agents, one for two decades. Additionally, we have purposefully made in each corner. Teammates start on opposite corners. the state input based not on pixel observations but rather on In team variants, the game ends when both players on one a symbolic interpretation so that it does not require large team have been destroyed. In FFA, it ends when at most one amounts of compute to build learning agents. agent remains alive. The winning team is the one who has Research game competitions disappear for two reasons remaining members. Ties can happen when the game does - either the administrators stop running it or participants not end before the max steps or if the last agents are de- stop submitting entrants. This can be due to the game be- stroyed on the same turn. If this happens in competitions, ing ‘solved’, but it could also be because the game just was we will rerun the game. If it reoccurs, then we will rerun the not enjoyable or accessible enough. We view Pommerman game with collapsing walls until there is a winner. This is a as having a long life ahead of it. Beyond the surface hyper- variant where, after a fixed number of steps, the game board parameters like board size and number of walls, early forays becomes smaller according to a specified cadence. We have suggest that there are many aspects of the game that can be a working example in the repository. modified to create a rich and long lasting research challenge Besides the agents, the board consists of wooden and rigid and competition venue. These include partial observability walls. We guarantee that the agents will have an accessible Game Intuitive? Fun? Integration? Bridge 1 3 5 Civilization 2 3 1 Counterstrike 5 5 2 Coup 4 5 5 Diplomacy 1 4 3 DoTA 3 5 2 Hanabi 2 3 5 Hearthstone 1 4 1 Mario Maker 4 5 3 Pommerman 5 4 5 PUBG 5 5 1 Rocket League 5 4 1 Secret Hitler 4 4 3 Settlers of Catan 4 3 3 Starcraft 2 3 5 5 Super Smash 5 5 1 Table 1: Comparing multi-agent games along three important axes for uptake beyond whether the game satisfies the commu- nity’s intended research direction. Attributes are considered on a 1-5 scale where 5 represents the highest value. Fun takes into account both watching and playing the game. The Intuitive and Fun qualities, while subjective, are noted because they have historically been factors in whether a game is used in research. path to each other. Initially, this path is occluded by wooden • Bomb Blast Strength: List of Ints. The bomb blast walls. See Figure 1 for a visual reference. strengths for each of the bombs in the agent’s purview. Rigid walls are indestructible and impassable. Wooden • Bomb Life: List of Ints. The remaining life for each of the walls can be destroyed by bombs. Until they are destroyed, bombs in the agent’s purview. they are impassable. After they are destroyed, they become either a passage or a power-up. • Message: 2 Ints in [0, 8]. The message being relayed from On every turn, agents choose from one of six actions: the teammate. Both Ints are zero only when a teammate is dead or if it is the first step. This field is not included for 1. Stop: This action is a pass. non-cheap talk variants. 2. Up: Move up on the board. The agent starts with one bomb (‘ammo’). Every time it 3. Left: Move left on the board. lays a bomb, its ammo decreases by one. After that bomb ex- plodes, its ammo will increase by one. The agent also has a 4. Down: Move down on the board. blast strength that starts at two. Every bomb it lays is im- 5. Right: Move right on the board. bued with the current blast strength, which is how far in the vertical and horizontal directions that bomb will effect. 6. Bomb: Lay a bomb. A bomb has a life of ten time steps. Upon expiration, the Additionally, if this is a communicative scenario, then the bomb explodes and any wooden walls, agents, power-ups or agent emits a message every turn consisting of two words other bombs within reach of its blast strength are destroyed. from a dictionary of size eight. These words are passed to its Bombs destroyed in this manner chain their explosions. teammate in the next step as part of the observation. In total, Power-Ups: Half of the wooden walls have hidden power- the agent receives the following observation each turn: ups that are revealed when the wall is destroyed. These are: • Board: 121 Ints. The flattened board. In partially observed • Extra Bomb: Picking this up increases the agent’s ammo variants, all squares outside of the 5x5 purview around the by one. agent’s position will be covered with the value for fog (5). • Increase Range: Picking this up increases the agent’s blast • Position: 2 Ints, each in [0, 10]. The agent’s (x, y) position strength by one. in the grid. • Can Kick: Picking this up permanently allows an agent to • Ammo: 1 Int. The agent’s current ammo. kick bombs by moving into them. The bombs travel in the direction that the agent was moving at one unit per time • Blast Strength: 1 Int. The agent’s current blast strength. step until they are impeded either by a player, a bomb, or • Can Kick: 1 Int, 0 or 1. Whether the agent can kick or not. a wall. • Teammate: 1 Int in [-1, 3]. Which agent is this agent’s Early results teammate. In non-team variants, this is -1. The environment has been public since late February and • Enemies: 3 Ints in [-1, 3]. Which agents are this agent’s the competitions were first announced in late March. In enemies. In team variants, the third int is -1. that time, we have seen a strong community gather around the game, with more than 500 people in the Discord November 21st, 2018. The featured environment will be the server (https://discord.gg/mtW7kp) and more than half of partially observable team variant without communication. the repository commits from open source contributors. Otherwise, we will be reusing the machinery that we devel- There have also been multiple published papers using oped to run the FFA competition. Pommerman (Resnick et al. 2018a; Zhou et al. 2018). These demonstrate that the environment is challenging and we do Submitting Agents not yet know what are the optimal solutions in any of the We run the competitions using Docker and expect sub- variants. In particular, the agents in (Resnick et al. 2018a) missions to be accompanied by a Docker file that we can discover a novel way of playing where they treat the bombs build on the game servers. For FFA competitions, this en- as projectiles by laying, then kicking them at opponents. tails submitting a (possibly private) repository having one This is a strategy that not even novice humans attempt, yet Docker file representing the agent. For team competitions, the agents use it to achieve a high success rate. this means the submission should have two Docker files to Preliminary analysis suggests that the game can be very represent the two agents. Instructions and an example for challenging for reinforcement learning algorithms out of building Docker containers from trained agents can be found the box. Without a very large batch size and a shaped re- in our repository (Resnick et al. 2018b). ward (Ng, Harada, and Russell 1999), neither of Deep Q- The agents should follow the prescribed convention spec- Learning (Mnih et al. 2013) nor Proximal Policy Optimiza- ified in our example code and expose an ‘act’ endpoint that tion (Schulman et al. 2017) learned to successfully play the accepts the dictionary of observations. Because we are us- game against the default learning agent (‘SimpleAgent’). ing Docker containers and http requests, we do not have any One reason for this is because the game has a (previously requirements for programming language or framework. mentioned) unique feature in that the bomb action is highly The expected response from the agent will be a single in- correlated with losing but must be wielded effectively to teger in [0, 5] representing which of the six actions that agent win. would like to take. In variants with messages, we also expect We also tested the effectiveness of DAgger (Daumé, two more integers in [1, 8] representing the message. If an Langford, and Marcu 2009) in bootstrapping agents to match agent does not respond in an appropriate time limit for our the SimpleAgent. We found that, while somewhat sensitive competition constraints (100ms), then we will automatically to hyperparameter choices, it was nonetheless effective at issue them the Stop action and, if appropriate, have them yielding agents that could play at or above the FFA win rate send out the message (0, 0). This timeout is an aspect of the of a single SimpleAgent (∼ 20%). This is less than chance competition and not native to the game itself. because four simple agents will draw a large percentage of the time. Conclusion In this paper, we have introduced the Pommerman environ- Competitions ment, detailed why it is a strong setup for multi-agent re- In this section, we describe the Pommerman competitions. search, and described early results and competitions. This includes both the upcoming NIPS 2018 event and the All of the code is readily available at our git reposi- FFA competition that we already ran. tory (github.com/MultiAgentLearning/playground) and fur- ther information about competitions, including NIPS 2018, FFA competition on our website (pommerman.com). We ran a preliminary competition on June 3rd, 2018. We did not advertise this widely other than within our Discord so- Acknowledgments cial group (https://discord.gg/mtW7kp), nor did we have any We are especially grateful to Roberta Raileanu, Sanyam prizes for it. Even so, we had a turnout of eight competitors Kapoor, Lucas Beyer, Stephan Uphoff, and the whole Pom- who submitted working agents by the May 31st deadline. merman Discord community for their contributions, as well The competition environment was the FFA variant as Jane Street, Facebook AI Research, Google Cloud, and (Resnick et al. 2018c) where four agents enter, all of whom NVidia Research for their sponsorship. are opponents. The top two agents were submitted by Görög Mrton and a team led by Yichen Gong, with the latter being References the strongest. Görög’s agent improved upon the repository’s baseline Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. agent through a number of edits. On the other hand, Yichen’s 2013. The arcade learning environment: An evaluation plat- agent was a redesign implementing a Finite State Machine form for general agents. J. Artif. Int. Res. 47(1):253–279. Tree-Search approach (Zhou et al. 2018). They respectively Bomberman, W. 1983. Wikipedia bomberman. https: won 8 and 22 of their 35 matches (with a number of the re- //en.wikipedia.org/wiki/Bomberman. maining being ties). Bowling, M.; Burch, N.; Johanson, M.; and Tammelin, O. 2017. Heads-up limit hold’em poker is solved. Commun. NIPS Competition ACM 60(11):81–88. The NIPS competition will be held live at NIPS 2018 and Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; competitors are required to submit a team of two agents by Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of monte M. H. 2017. Deepstack: Expert-level artificial intelligence carlo tree search methods. IEEE Transactions on Computa- in no-limit poker. CoRR abs/1701.01724. tional Intelligence and AI in games 4(1):1–43. Mordatch, I., and Abbeel, P. 2017. Emergence of grounded Coulom, R. 2006. Efficient selectivity and backup operators compositional language in multi-agent populations. arXiv in monte-carlo tree search. In van den Herik, H. J.; Cian- preprint arXiv:1703.04908. carini, P.; and Donkers, H. H. L. M., eds., Computers and Nardi, D.; Noda, I.; Ribeiro, F.; Stone, P.; von Stryk, O.; and Games, volume 4630 of Lecture Notes in Computer Science, Veloso, M. 2014. RoboCup soccer leagues. AI Magazine 72–83. Springer. 35(3):77–85. Daumé, H.; Langford, J.; and Marcu, D. 2009. Search-based Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy invari- structured prediction. Machine learning 75(3):297–325. ance under reward transformations: Theory and application Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei- to reward shaping. In ICML, volume 99, 278–287. Fei, L. 2009. ImageNet: A Large-Scale Hierarchical Image Noam Brown, T. S. 2017. Libratus: The superhuman ai for Database. In CVPR09. no-limit poker. In Proceedings of the Twenty-Sixth Interna- Evtimova, K.; Drozdov, A.; Kiela, D.; and Cho, K. 2017. tional Joint Conference on Artificial Intelligence, IJCAI-17, Emergent language in a multi-modal, multi-step referential 5226–5228. game. CoRR abs/1705.10369. OpenAI. 2018. Dota 2. https://blog.openai.com/ Fehervari, I., and Elmenreich, W. 2010. Evolving neural net- dota-2/. work controllers for a team of self-organizing robots. Jour- Resnick, C.; Raileanu, R.; Kapoor, S.; Peysakhovich, A.; nal of Robotics. Cho, K.; and Bruna, J. 2018a. Backplay: “Man muss im- Foerster, J.; Assael, I. A.; de Freitas, N.; and Whiteson, S. mer umkehren”. ArXiv e-prints. 2016. Learning to communicate with deep multi-agent re- Resnick, C.; Eldridge, W.; Britz, D.; and Ha, D. 2018b. inforcement learning. In Advances in Neural Information Playground: Ai research into multi-agent learning. Processing Systems, 2137–2145. https://github.com/MultiAgentLearning/ Foerster, J. N.; Nardelli, N.; Farquhar, G.; Torr, P. H. S.; playground. Kohli, P.; and Whiteson, S. 2017. Stabilising experience Resnick, C.; Eldridge, W.; Britz, D.; and Ha, D. replay for deep multi-agent reinforcement learning. CoRR 2018c. Pommerman ffa competition environment. abs/1702.08887. https://github.com/MultiAgentLearning/ Hausknecht, M.; Mupparaju, P.; Subramanian, S.; playground/blob/master/pommerman/ Kalyanakrishnan, S.; and Stone, P. 2016. Half field configs.py#L20. offense: An environment for multiagent learning and ad hoc Resnick, C.; Kulikov, I.; Cho, K.; and Weston, J. 2018d. teamwork. In AAMAS Adaptive Learning Agents (ALA) Vehicle community strategies. CoRR abs/1804.07178. Workshop. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Kocsis, L., and Szepesvári, C. 2006. Bandit based monte- Klimov, O. 2017. Proximal policy optimization algorithms. carlo planning. In Proceedings of the 17th European Con- CoRR abs/1707.06347. ference on Machine Learning, ECML’06, 282–293. Berlin, Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, Heidelberg: Springer-Verlag. L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, Lazaridou, A.; Hermann, K. M.; Tuyls, K.; and Clark, S. I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, 2018. Emergence of linguistic communication from refer- D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; ential games with symbolic and pixel input. In International Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. Conference on Learning Representations. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489. Levy, S. D., and Kirby, S. 2006. Evolving distributed repre- sentations for language with self-organizing maps. In Vogt, Steels, L. 1999. The Talking Heads Experiment. Volume 1. P.; Sugita, Y.; Tuci, E.; and Nehaniv, C. L., eds., EELC, vol- Words and Meanings. Antwerpen: Laboratorium. ume 4211 of Lecture Notes in Computer Science, 57–71. Steels, L. 2003. Evolving grounded communication for Springer. robots. Trends in Cognitive Sciences 7:308–312. Lewis, M.; Yarats, D.; Dauphin, Y. N.; Parikh, D.; and Batra, Vodopivec, T.; Samothrakis, S.; and Šter, B. 2017. On monte D. 2017. Deal or no deal? end-to-end learning for negotia- carlo tree search and reinforcement learning. J. Artif. Int. tion dialogues. arXiv preprint arXiv:1706.05125. Res. 60(1):881–936. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Zhou, H.; Gong, Y.; Mugrai, L.; Khalifa, A.; Andy, N.; and Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Togelius, J. 2018. A hybrid search agent in pommerman. In Playing atari with deep reinforcement learning. cite The International Conference on the Foundations of Digital arxiv:1312.5602Comment: NIPS Deep Learning Workshop Games (FDG). 2013. Zinkevich, M.; Johanson, M.; Bowling, M.; and Piccione, C. Moravcı́k, M.; Schmid, M.; Burch, N.; Lisý, V.; Morrill, D.; 2008. Regret minimization in games with incomplete infor- Bard, N.; Davis, T.; Waugh, K.; Johanson, M.; and Bowling, mation. In Platt, J. C.; Koller, D.; Singer, Y.; and Roweis, S. T., eds., Advances in Neural Information Processing Sys- tems 20. Curran Associates, Inc. 1729–1736.