Pommerman: A Multi-Agent Playground
Cinjon Resnick∗                Wes Eldridge                       David Ha                 Denny Britz                Jakob Foerster
       NYU                      Rebellious Labs                  Google Brain           Stanford University          University of Oxford


                              Julian Togelius                               Kyunghyun Cho and Joan Bruna
                                      NYU                                                   NYU, FAIR


                            Abstract                                      more than two players is the lack of suitable environments.
                                                                          We propose Pommerman as a solution.
  We present Pommerman, a multi-agent environment based on                   Pommerman is stylistically similar to Bomberman
  the classic console game Bomberman. Pommerman consists
                                                                          (Bomberman 1983), the famous game from Nintendo. At a
  of a set of scenarios, each having at least four players and con-
  taining both cooperative and competitive aspects. We believe            high level, there are at least four agents all traversing a grid
  that success in Pommerman will require a diverse set of tools           world. Each agent’s goal is to have their team be the last
  and methods, including planning, opponent/teammate model-               remaining. They can plant bombs that, upon expiration, de-
  ing, game theory, and communication, and consequently can               stroy anything (but rigid walls) in their vicinity. It contains
  serve well as a multi-agent benchmark. To date, we have al-             both adversarial and cooperative elements. The Free-For-All
  ready hosted one competition, and our next one will be fea-             (FFA) variant has at most one winner and, because there are
  tured in the NIPS 2018 competition track.                               four players, encourages research directions that can han-
                                                                          dle situations where the Nash payoffs are not all equivalent.
                                                                          The team variants encourage research with and without ex-
                    Why Pommerman                                         plicit communication channels, including scenarios where
In this section, we provide our motivation and goals for both             the agent has to cooperate with previously unseen team-
the Pommerman benchmark and the NIPS 2018 competi-                        mates. The latter is a recently burgeoning subfield of multi-
tion. Currently, there is no consensus benchmark involving                agent learning (Foerster et al. 2016; Resnick et al. 2018d;
either general sum game settings or settings with at least                Evtimova et al. 2017; Foerster et al. 2017; Lewis et al. 2017;
three players. Instead, recent progress has focused on two                Mordatch and Abbeel 2017; Lazaridou et al. 2018) with es-
player zero-sum games such as Go and Chess. We believe                    tablished prior work as well (Steels 1999; 2003; Levy and
that the Pommerman environment can assume this role for                   Kirby 2006; Fehervari and Elmenreich 2010), while the lat-
multi-agent learning. Additionally, we are organizing com-                ter has been underexplored.
petitions for Pommerman because we believe that they are                     We aim for the Pommerman benchmark to provide for
a strong way to push forward the state of the art and can                 multi-agent learning what the Atari Learning Environment
contribute to lasting results for years to come.                          (Bellemare et al. 2013) provided for single-agent reinforce-
                                                                          ment learning and ImageNet (Deng et al. 2009) for image
Multi-Agent Learning                                                      recognition. Beyond game theory and communication, Pom-
                                                                          merman can also serve as a testbed for research into rein-
Historically, a majority of multi-agent research has focused              forcement learning, planning, and opponent/teammate mod-
on zero-sum two player games. For example, computer com-                  eling.
petitions for Poker and Go over the past fifteen years have                  RoboCup Soccer (Nardi et al. 2014) is a similar competi-
been vital for developing methods culminating in recent su-               tion that has been running since 1997. There, eleven agents
perhuman performance (Moravcı́k et al. 2017; Noam Brown                   per side play soccer. Key differences between Pommerman
2017; Bowling et al. 2017; Silver et al. 2016). These bench-              and RoboCup Soccer are:
marks have also lead to the discovery of new algorithms
and approaches like Monte Carlo Tree Search (Vodopivec,                  1. Pommerman includes an explicit communication channel.
Samothrakis, and Šter 2017; Browne et al. 2012; Kocsis and                 This changes the dynamics of the game and adds new re-
Szepesvári 2006; Coulom 2006) and Counterfactual Regret                    search avenues.
Minimization (Zinkevich et al. 2008).                                    2. Pommerman strips away the sensor input, which means
  We believe that an aspect restraining the field from pro-                 that the game is less apt for robotics but more apt for
gressing towards general-sum research and scenarios with                    studying other aspects of AI, games, and strategy.
   ∗
   Correspondence: cinjon@nyu.edu                                        3. Pommerman uses low dimensional, discrete control and
                                                                            input representations instead of continuous ones. We be-
                                                                            lieve this makes it easier to focus on the high level strate-
   gic aspects rather than low level mechanics.                     of the board, playing with random teammates, communica-
                                                                    tion among the agents, adding power-ups, and learning to
4. In team variants, the default Pommerman setup has only
                                                                    play with human players.
   two agents per side, which makes it more amenable to bur-
                                                                       These potential extensions, and the fact that N-player
   geoning fields like emergent communication which en-
                                                                    learning by itself has few mathematical guarantees, suggest
   counter training difficulties with larger numbers of agents.
                                                                    that Pommerman will be a challenging and fruitful testbed
5. Pommerman’s FFA variant promotes research that does              for years to come.
   not reduce to a 1v1 game, which means that a lot of the             There are, however, limitations to this environment. One
   theory underlying such games (like RoboCup Soccer) is            difficulty is that a local optimum arises where the agent
   not applicable.                                                  avoids exploding itself by learning to never use the bomb
   The second, third, and fourth differences above are a posi-      action. In the long term, this is ineffective because the agent
tive or negative trade-off depending on one’s research goals.       needs to use the bomb to destroy other agents. Players have
                                                                    successfully solved this challenge (Resnick et al. 2018a), but
   Another, more recent, benchmark is Half-Field Offense
                                                                    it is an aspect of basic gameplay that has to be handled in
(Hausknecht et al. 2016), a modification of RoboCup that
                                                                    order for the multi-agent research benefits to become appar-
reduces the complexity and focuses on decision-making in
                                                                    ent.
a simplified subtask. However, unlike the FFA scenario in
Pommerman, Half-Field Offense is limited to being a zero-
sum game between two teams.                                                                Description
   In general, the communities that we want to attract to           In this section, we give details of the Pommerman environ-
benchmark their algorithms have not gravitated towards              ment. Note that all of the code to run the game and train
RoboCup but instead have relied on a large number of one-           agents can be found in our git repository (Resnick et al.
off toy tasks. This is especially true for multi-agent Deep         2018b), while our website (pommerman.com) contains fur-
RL. We think that the reasons for that could be among the           ther information on how to submit agents.
five above. Consequently, Pommerman has the potential to
unite these communities, especially when considering that           Game Information
future versions can be expanded to more than four agents.

High Quality Benchmark
There are attributes that are common to the best benchmarks
beyond satisfying the community’s research direction. These
include having mechanics and gameplay that are intuitive for
humans, being fun to play and watch, being easy to integrate
into common research setups, and having a learning problem
that is not too difficult for the current state of method devel-
opment. Most games violate at least one of these. For ex-
ample, the popular game Defense of the Ancients (OpenAI
2018) is intuitive and fun, but extremely difficult to integrate.
On the other hand, the card game Bridge is easy to integrate,       Figure 1: Pommerman start state. Each agent begins in one
but it is not intuitive; the gameplay and mechanics are slow        of four positions. Yellow squares are wood, brown are rigid,
to learn and there is a steep learning curve to understanding       and the gray are passages.
strategy.
   Pommerman satisfies these requirements. People have no              As previously mentioned, Pommerman is stylistically
trouble understanding basic strategy and mechanics. It is fun       similar to Bomberman. Every battle starts on a randomly
to play and to watch, having been developed by Nintendo             drawn symmetric 11x11 grid (‘board’) with four agents, one
for two decades. Additionally, we have purposefully made            in each corner. Teammates start on opposite corners.
the state input based not on pixel observations but rather on          In team variants, the game ends when both players on one
a symbolic interpretation so that it does not require large         team have been destroyed. In FFA, it ends when at most one
amounts of compute to build learning agents.                        agent remains alive. The winning team is the one who has
   Research game competitions disappear for two reasons             remaining members. Ties can happen when the game does
- either the administrators stop running it or participants         not end before the max steps or if the last agents are de-
stop submitting entrants. This can be due to the game be-           stroyed on the same turn. If this happens in competitions,
ing ‘solved’, but it could also be because the game just was        we will rerun the game. If it reoccurs, then we will rerun the
not enjoyable or accessible enough. We view Pommerman               game with collapsing walls until there is a winner. This is a
as having a long life ahead of it. Beyond the surface hyper-        variant where, after a fixed number of steps, the game board
parameters like board size and number of walls, early forays        becomes smaller according to a specified cadence. We have
suggest that there are many aspects of the game that can be         a working example in the repository.
modified to create a rich and long lasting research challenge          Besides the agents, the board consists of wooden and rigid
and competition venue. These include partial observability          walls. We guarantee that the agents will have an accessible
                                             Game           Intuitive?    Fun?     Integration?
                                        Bridge                  1          3            5
                                        Civilization            2          3            1
                                        Counterstrike           5          5            2
                                        Coup                    4          5            5
                                        Diplomacy               1          4            3
                                        DoTA                    3          5            2
                                        Hanabi                  2          3            5
                                        Hearthstone             1          4            1
                                        Mario Maker             4          5            3
                                        Pommerman               5          4            5
                                        PUBG                    5          5            1
                                        Rocket League           5          4            1
                                        Secret Hitler           4          4            3
                                        Settlers of Catan       4          3            3
                                        Starcraft 2             3          5            5
                                        Super Smash             5          5            1

Table 1: Comparing multi-agent games along three important axes for uptake beyond whether the game satisfies the commu-
nity’s intended research direction. Attributes are considered on a 1-5 scale where 5 represents the highest value. Fun takes into
account both watching and playing the game. The Intuitive and Fun qualities, while subjective, are noted because they have
historically been factors in whether a game is used in research.


path to each other. Initially, this path is occluded by wooden           • Bomb Blast Strength: List of Ints. The bomb blast
walls. See Figure 1 for a visual reference.                                strengths for each of the bombs in the agent’s purview.
   Rigid walls are indestructible and impassable. Wooden                 • Bomb Life: List of Ints. The remaining life for each of the
walls can be destroyed by bombs. Until they are destroyed,                 bombs in the agent’s purview.
they are impassable. After they are destroyed, they become
either a passage or a power-up.                                          • Message: 2 Ints in [0, 8]. The message being relayed from
   On every turn, agents choose from one of six actions:                   the teammate. Both Ints are zero only when a teammate is
                                                                           dead or if it is the first step. This field is not included for
1. Stop: This action is a pass.                                            non-cheap talk variants.
2. Up: Move up on the board.                                                The agent starts with one bomb (‘ammo’). Every time it
3. Left: Move left on the board.                                         lays a bomb, its ammo decreases by one. After that bomb ex-
                                                                         plodes, its ammo will increase by one. The agent also has a
4. Down: Move down on the board.                                         blast strength that starts at two. Every bomb it lays is im-
5. Right: Move right on the board.                                       bued with the current blast strength, which is how far in
                                                                         the vertical and horizontal directions that bomb will effect.
6. Bomb: Lay a bomb.                                                     A bomb has a life of ten time steps. Upon expiration, the
   Additionally, if this is a communicative scenario, then the           bomb explodes and any wooden walls, agents, power-ups or
agent emits a message every turn consisting of two words                 other bombs within reach of its blast strength are destroyed.
from a dictionary of size eight. These words are passed to its           Bombs destroyed in this manner chain their explosions.
teammate in the next step as part of the observation. In total,             Power-Ups: Half of the wooden walls have hidden power-
the agent receives the following observation each turn:                  ups that are revealed when the wall is destroyed. These are:
• Board: 121 Ints. The flattened board. In partially observed            • Extra Bomb: Picking this up increases the agent’s ammo
  variants, all squares outside of the 5x5 purview around the              by one.
  agent’s position will be covered with the value for fog (5).           • Increase Range: Picking this up increases the agent’s blast
• Position: 2 Ints, each in [0, 10]. The agent’s (x, y) position           strength by one.
  in the grid.                                                           • Can Kick: Picking this up permanently allows an agent to
• Ammo: 1 Int. The agent’s current ammo.                                   kick bombs by moving into them. The bombs travel in the
                                                                           direction that the agent was moving at one unit per time
• Blast Strength: 1 Int. The agent’s current blast strength.               step until they are impeded either by a player, a bomb, or
• Can Kick: 1 Int, 0 or 1. Whether the agent can kick or not.              a wall.
• Teammate: 1 Int in [-1, 3]. Which agent is this agent’s                Early results
  teammate. In non-team variants, this is -1.
                                                                         The environment has been public since late February and
• Enemies: 3 Ints in [-1, 3]. Which agents are this agent’s              the competitions were first announced in late March. In
  enemies. In team variants, the third int is -1.                        that time, we have seen a strong community gather around
the game, with more than 500 people in the Discord              November 21st, 2018. The featured environment will be the
server (https://discord.gg/mtW7kp) and more than half of        partially observable team variant without communication.
the repository commits from open source contributors.           Otherwise, we will be reusing the machinery that we devel-
   There have also been multiple published papers using         oped to run the FFA competition.
Pommerman (Resnick et al. 2018a; Zhou et al. 2018). These
demonstrate that the environment is challenging and we do       Submitting Agents
not yet know what are the optimal solutions in any of the       We run the competitions using Docker and expect sub-
variants. In particular, the agents in (Resnick et al. 2018a)   missions to be accompanied by a Docker file that we can
discover a novel way of playing where they treat the bombs      build on the game servers. For FFA competitions, this en-
as projectiles by laying, then kicking them at opponents.       tails submitting a (possibly private) repository having one
This is a strategy that not even novice humans attempt, yet     Docker file representing the agent. For team competitions,
the agents use it to achieve a high success rate.               this means the submission should have two Docker files to
   Preliminary analysis suggests that the game can be very      represent the two agents. Instructions and an example for
challenging for reinforcement learning algorithms out of        building Docker containers from trained agents can be found
the box. Without a very large batch size and a shaped re-       in our repository (Resnick et al. 2018b).
ward (Ng, Harada, and Russell 1999), neither of Deep Q-            The agents should follow the prescribed convention spec-
Learning (Mnih et al. 2013) nor Proximal Policy Optimiza-       ified in our example code and expose an ‘act’ endpoint that
tion (Schulman et al. 2017) learned to successfully play the    accepts the dictionary of observations. Because we are us-
game against the default learning agent (‘SimpleAgent’).        ing Docker containers and http requests, we do not have any
One reason for this is because the game has a (previously       requirements for programming language or framework.
mentioned) unique feature in that the bomb action is highly        The expected response from the agent will be a single in-
correlated with losing but must be wielded effectively to       teger in [0, 5] representing which of the six actions that agent
win.                                                            would like to take. In variants with messages, we also expect
   We also tested the effectiveness of DAgger (Daumé,          two more integers in [1, 8] representing the message. If an
Langford, and Marcu 2009) in bootstrapping agents to match      agent does not respond in an appropriate time limit for our
the SimpleAgent. We found that, while somewhat sensitive        competition constraints (100ms), then we will automatically
to hyperparameter choices, it was nonetheless effective at      issue them the Stop action and, if appropriate, have them
yielding agents that could play at or above the FFA win rate    send out the message (0, 0). This timeout is an aspect of the
of a single SimpleAgent (∼ 20%). This is less than chance       competition and not native to the game itself.
because four simple agents will draw a large percentage of
the time.                                                                               Conclusion
                                                                In this paper, we have introduced the Pommerman environ-
                     Competitions                               ment, detailed why it is a strong setup for multi-agent re-
In this section, we describe the Pommerman competitions.        search, and described early results and competitions.
This includes both the upcoming NIPS 2018 event and the            All of the code is readily available at our git reposi-
FFA competition that we already ran.                            tory (github.com/MultiAgentLearning/playground) and fur-
                                                                ther information about competitions, including NIPS 2018,
FFA competition                                                 on our website (pommerman.com).
We ran a preliminary competition on June 3rd, 2018. We did
not advertise this widely other than within our Discord so-                        Acknowledgments
cial group (https://discord.gg/mtW7kp), nor did we have any     We are especially grateful to Roberta Raileanu, Sanyam
prizes for it. Even so, we had a turnout of eight competitors   Kapoor, Lucas Beyer, Stephan Uphoff, and the whole Pom-
who submitted working agents by the May 31st deadline.          merman Discord community for their contributions, as well
   The competition environment was the FFA variant              as Jane Street, Facebook AI Research, Google Cloud, and
(Resnick et al. 2018c) where four agents enter, all of whom     NVidia Research for their sponsorship.
are opponents. The top two agents were submitted by Görög
Mrton and a team led by Yichen Gong, with the latter being                              References
the strongest.
   Görög’s agent improved upon the repository’s baseline      Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M.
agent through a number of edits. On the other hand, Yichen’s    2013. The arcade learning environment: An evaluation plat-
agent was a redesign implementing a Finite State Machine        form for general agents. J. Artif. Int. Res. 47(1):253–279.
Tree-Search approach (Zhou et al. 2018). They respectively      Bomberman, W. 1983. Wikipedia bomberman. https:
won 8 and 22 of their 35 matches (with a number of the re-      //en.wikipedia.org/wiki/Bomberman.
maining being ties).                                            Bowling, M.; Burch, N.; Johanson, M.; and Tammelin, O.
                                                                2017. Heads-up limit hold’em poker is solved. Commun.
NIPS Competition                                                ACM 60(11):81–88.
The NIPS competition will be held live at NIPS 2018 and         Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.;
competitors are required to submit a team of two agents by      Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.;
Samothrakis, S.; and Colton, S. 2012. A survey of monte          M. H. 2017. Deepstack: Expert-level artificial intelligence
carlo tree search methods. IEEE Transactions on Computa-         in no-limit poker. CoRR abs/1701.01724.
tional Intelligence and AI in games 4(1):1–43.                   Mordatch, I., and Abbeel, P. 2017. Emergence of grounded
Coulom, R. 2006. Efficient selectivity and backup operators      compositional language in multi-agent populations. arXiv
in monte-carlo tree search. In van den Herik, H. J.; Cian-       preprint arXiv:1703.04908.
carini, P.; and Donkers, H. H. L. M., eds., Computers and        Nardi, D.; Noda, I.; Ribeiro, F.; Stone, P.; von Stryk, O.; and
Games, volume 4630 of Lecture Notes in Computer Science,         Veloso, M. 2014. RoboCup soccer leagues. AI Magazine
72–83. Springer.                                                 35(3):77–85.
Daumé, H.; Langford, J.; and Marcu, D. 2009. Search-based       Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy invari-
structured prediction. Machine learning 75(3):297–325.           ance under reward transformations: Theory and application
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-      to reward shaping. In ICML, volume 99, 278–287.
Fei, L. 2009. ImageNet: A Large-Scale Hierarchical Image         Noam Brown, T. S. 2017. Libratus: The superhuman ai for
Database. In CVPR09.                                             no-limit poker. In Proceedings of the Twenty-Sixth Interna-
Evtimova, K.; Drozdov, A.; Kiela, D.; and Cho, K. 2017.          tional Joint Conference on Artificial Intelligence, IJCAI-17,
Emergent language in a multi-modal, multi-step referential       5226–5228.
game. CoRR abs/1705.10369.                                       OpenAI. 2018. Dota 2. https://blog.openai.com/
Fehervari, I., and Elmenreich, W. 2010. Evolving neural net-     dota-2/.
work controllers for a team of self-organizing robots. Jour-     Resnick, C.; Raileanu, R.; Kapoor, S.; Peysakhovich, A.;
nal of Robotics.                                                 Cho, K.; and Bruna, J. 2018a. Backplay: “Man muss im-
Foerster, J.; Assael, I. A.; de Freitas, N.; and Whiteson, S.    mer umkehren”. ArXiv e-prints.
2016. Learning to communicate with deep multi-agent re-          Resnick, C.; Eldridge, W.; Britz, D.; and Ha, D. 2018b.
inforcement learning. In Advances in Neural Information          Playground: Ai research into multi-agent learning.
Processing Systems, 2137–2145.                                   https://github.com/MultiAgentLearning/
Foerster, J. N.; Nardelli, N.; Farquhar, G.; Torr, P. H. S.;     playground.
Kohli, P.; and Whiteson, S. 2017. Stabilising experience         Resnick, C.; Eldridge, W.; Britz, D.; and Ha, D.
replay for deep multi-agent reinforcement learning. CoRR         2018c.       Pommerman ffa competition environment.
abs/1702.08887.                                                  https://github.com/MultiAgentLearning/
Hausknecht, M.; Mupparaju, P.; Subramanian, S.;                  playground/blob/master/pommerman/
Kalyanakrishnan, S.; and Stone, P. 2016. Half field              configs.py#L20.
offense: An environment for multiagent learning and ad hoc       Resnick, C.; Kulikov, I.; Cho, K.; and Weston, J. 2018d.
teamwork. In AAMAS Adaptive Learning Agents (ALA)                Vehicle community strategies. CoRR abs/1804.07178.
Workshop.                                                        Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
Kocsis, L., and Szepesvári, C. 2006. Bandit based monte-        Klimov, O. 2017. Proximal policy optimization algorithms.
carlo planning. In Proceedings of the 17th European Con-         CoRR abs/1707.06347.
ference on Machine Learning, ECML’06, 282–293. Berlin,           Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre,
Heidelberg: Springer-Verlag.                                     L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou,
Lazaridou, A.; Hermann, K. M.; Tuyls, K.; and Clark, S.          I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe,
2018. Emergence of linguistic communication from refer-          D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.;
ential games with symbolic and pixel input. In International     Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D.
Conference on Learning Representations.                          2016. Mastering the game of Go with deep neural networks
                                                                 and tree search. Nature 529(7587):484–489.
Levy, S. D., and Kirby, S. 2006. Evolving distributed repre-
sentations for language with self-organizing maps. In Vogt,      Steels, L. 1999. The Talking Heads Experiment. Volume 1.
P.; Sugita, Y.; Tuci, E.; and Nehaniv, C. L., eds., EELC, vol-   Words and Meanings. Antwerpen: Laboratorium.
ume 4211 of Lecture Notes in Computer Science, 57–71.            Steels, L. 2003. Evolving grounded communication for
Springer.                                                        robots. Trends in Cognitive Sciences 7:308–312.
Lewis, M.; Yarats, D.; Dauphin, Y. N.; Parikh, D.; and Batra,    Vodopivec, T.; Samothrakis, S.; and Šter, B. 2017. On monte
D. 2017. Deal or no deal? end-to-end learning for negotia-       carlo tree search and reinforcement learning. J. Artif. Int.
tion dialogues. arXiv preprint arXiv:1706.05125.                 Res. 60(1):881–936.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;               Zhou, H.; Gong, Y.; Mugrai, L.; Khalifa, A.; Andy, N.; and
Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013.           Togelius, J. 2018. A hybrid search agent in pommerman. In
Playing atari with deep reinforcement learning.           cite   The International Conference on the Foundations of Digital
arxiv:1312.5602Comment: NIPS Deep Learning Workshop              Games (FDG).
2013.                                                            Zinkevich, M.; Johanson, M.; Bowling, M.; and Piccione, C.
Moravcı́k, M.; Schmid, M.; Burch, N.; Lisý, V.; Morrill, D.;    2008. Regret minimization in games with incomplete infor-
Bard, N.; Davis, T.; Waugh, K.; Johanson, M.; and Bowling,       mation. In Platt, J. C.; Koller, D.; Singer, Y.; and Roweis,
S. T., eds., Advances in Neural Information Processing Sys-
tems 20. Curran Associates, Inc. 1729–1736.