Pommerman & NeurIPS 2018
Pommerman is an exciting new environment for multi-agent research based on the classic game Bomberman. This publication covers its inaugural NeurIPS competition (and second overall), held at NeurIPS 2018, and featuring the 2v2 team environment.
In the first chapter, the first section familiarizes the audience with the game and its nuances, and the second section describes the competition and the results. In the remaining chapters, we then move on to the competitors’ descriptions in order of competition result.
Chapters two and four describe two agents made by colleagues at IBM. Chapter three’s dynamic Pommerman (dypm) agent is a particular implementation of real-time tree search with pessimistic scenarios, where standard tree search is limited to a specified depth, but each leaf is evaluated under a deterministic and pessimistic scenario. The evaluation with the deterministic scenario does not involve branching, contrary to the standard tree search, and can efficiently take into account significant events that the agent can encounter far ahead in the future. The pessimistic scenario is generated by assuming super strong enemies, and the level of pessimism is tuned via self-play. Using these techniques, the dypm agent can meet the real-time constraint when it is implemented with Python. Chapter one’s agent was similar to this, but uses a real-time search tree to evaluate moves. It is then followed by self-play for tuning.
Chapter three’s Eisenach agent was second at the Pommerman Team Competition, matching the performance of its predecessor on the earlier free-for-all competition. The chosen framework was online mini-max tree search with a quick C++ simulator, which enabled deeper search within the allowed 0.1 s. Several tactics were successfully applied to lower the amount of ties and avoid repeating situations. These helped to make games even more dense and exciting, while increasing the measured difference between agents. Bayes-based cost-optimization was applied, however it didn’t prove useful. The resulting agent passed the first 3 rounds at the competition without any tie or defeat and could even win against the overall winner in some of the matches.
Chapter five featured the Navocado agent. It was trained using Advantage-Actor-Critic (A2C) algorithm and guided by the Continual Match Based Training (COMBAT) framework. This agent first transformed the original continuous state representations into discrete state representations. This made it easier for the deep model to learn. Then, a new action space was proposed that allowed it to use its proposed destination as an action, enabling longer-term planning. Finally, the COMBAT framework allowed it to define adaptive rewards in different game stages. The Navocado agent was the top learning agent in the competition.
Finally, chapter six featured the nn_team_skynet955_skynet955 agent, which ranked second place in the learning agents category and fifth place overall. Equipped with an automatic module for action pruning, this agent was directly trained by end-to-end deep reinforcement learning in the partially observable team environment against a curriculum of opponents together with reward shaping. A single trained neural net model was selected to form a team for participating in the competition. This chapter discusses the difficulty of Pommerman as a benchmark for model-free reinforcement learning and describes the core elements upon which the agent was built.
- 1.Cinjon Resnick, Wes Eldridge, David Ha, Denny Britz, Jakob Foerster, Julian Togelius, Kyunghyun Cho, and Joan Bruna. Pommerman: A multi-agent playground. CoRR, abs/1809.07124, 2018.Google Scholar
- 2.Cinjon Resnick, Wes Eldridge, Denny Britz, and David Ha. Playground: Ai research into multi-agent learning. https://github.com/MultiAgentLearning/playground, 2018.
- 3.C. Resnick, R. Raileanu, S. Kapoor, A. Peysakhovich, K. Cho, and J. Bruna. Backplay: “Man muss immer umkehren”. ArXiv e-prints, July 2018.Google Scholar
- 4.Hongwei Zhou, Yichen Gong, Luvneesh Mugrai, Ahmed Khalifa, Nealen Andy, and Julian Togelius. A hybrid search agent in pommerman. In The International Conference on the Foundations of Digital Games (FDG), 2018.Google Scholar
- 5.Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.Google Scholar
- 6.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013. cite arxiv:1312.5602Comment: NIPS Deep Learning Workshop 2013.Google Scholar
- 7.John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.Google Scholar
- 9.Cinjon Resnick and Wes Eldridge. Pommerman neurips 2018 competition video. https://youtu.be/3U3yKZ6Yzew, 2018.
- 10.Cinjon Resnick and Wes Eldridge. Pommerman neurips 2018 replays. https://www.pommerman.com/leaderboard, 2018.
- 11.Takayuki Osogami and Toshihiro Takahashi. Real-time tree search with pessimistic scenarios. newblock Technical Report RT0982, IBM Research, February 2019.Google Scholar
- 12.Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.Google Scholar
- 13.Peng Peng, Liang Pang, Yufeng Yuan, and Chao Gao. Continual match based training in pommerman: Technical report. arXiv preprint arXiv:1812.07297, 2018.Google Scholar
- 14.Cinjon Resnick, Wes Eldridge, David Ha, Denny Britz, Jakob Foerster, Julian Togelius, Kyunghyun Cho, and Joan Bruna. Pommerman: A multi-agent playground. arXiv preprint arXiv:1809.07124, 2018.Google Scholar
- 15.John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.Google Scholar
- 16.Sam Devlin, Logan Yliniemi, Daniel Kudenko, and Kagan Tumer. Potential-based difference rewards for multiagent reinforcement learning. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 165–172, 2014.Google Scholar
- 17.Jakob N. Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip H S Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning. In International Conference on Machine Learning, 2017.Google Scholar