Multi-agent deep reinforcement learning with type-based hierarchical group communication

Abstract

Real-world multi-agent tasks often involve varying types and quantities of agents. These agents connected by complex interaction relationships causes great difficulty for policy learning because they need to learn various interaction types to complete a given task. Therefore, simplifying the learning process is an important issue. In multi-agent systems, agents with a similar type often interact more with each other and exhibit behaviors more similar. That means there are stronger collaborations between these agents. Most existing multi-agent reinforcement learning (MARL) algorithms expect to learn the collaborative strategies of all agents directly in order to maximize the common rewards. This causes the difficulty of policy learning to increase exponentially as the number and types of agents increase. To address this problem, we propose a type-based hierarchical group communication (THGC) model. This model uses prior domain knowledge or predefine rule to group agents, and maintains the group’s cognitive consistency through knowledge sharing. Subsequently, we introduce a group communication and value decomposition method to ensure cooperation between the various groups. Experiments demonstrate that our model outperforms state-of-the-art MARL methods on the widely adopted StarCraft II benchmarks across different scenarios, and also possesses potential value for large-scale real-world applications.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

References

  1. 1.

    Bear A, Kagan A, Rand DG (2017) Co-evolution of cooperation and cognition: the impact of imperfect deliberation and context-sensitive intuition. Proc Royal Soc B Biol Sci 284(1851):20162326

    Google Scholar 

  2. 2.

    Bresciani PG, Giunchiglia P, Mylopoulos F, Perini J, TROPOS A (2004) An agent oriented software development methodology. Journal of autonomous agents and multiagent systems, Kluwer Academic Publishers

  3. 3.

    Butler E (2012) The condensed wealth of nations. Centre for Independent Studies

  4. 4.

    Carion N, Usunier N, Synnaeve G, Lazaric A (2019) A structured prediction approach for generalization in cooperative multi-agent reinforcement learning. In: Advances in neural information processing systems, pp 8130–8140

  5. 5.

    Chen Y, Zhou M, Wen Y, Yang Y, Su Y, Zhang W, Zhang D, Wang J, Liu H (2018) Factorized q-learning for large-scale multi-agent systems. arXiv:1809.03738

  6. 6.

    Chuang L, Chao X, Jie H, Wenzhuo L, et al. (2017) Hierarchical architecture design of computer system. Chinese J Comput 40(09):1996–2017

    MathSciNet  Google Scholar 

  7. 7.

    Clevert DA, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv:1511.07289

  8. 8.

    Cossentino M, Gaglio S, Sabatucci L, Seidita V (2005) The passi and agile passi mas meta-models compared with a unifying proposal. In: International central and eastern european conference on multi-agent systems, pp 183–192. Springer

  9. 9.

    Cossentino M, Hilaire V, Molesini A, Seidita V (2014) Handbook on agent-oriented design processes. Springer, Berlin

    Google Scholar 

  10. 10.

    Das A, Gervet T, Romoff J, Batra D, Parikh D, Rabbat M, Pineau J (2018) Tarmac: Targeted multi-agent communication. arXiv:1810.11187

  11. 11.

    Dugas C, Bengio Y, Bélisle F., Nadeau C, Garcia R (2009) Incorporating functional knowledge in neural networks. J Mach Learn Res 10(Jun):1239–1262

    MathSciNet  MATH  Google Scholar 

  12. 12.

    Foerster JN, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018) Counterfactual multi-agent policy gradients. In: Thirty-second AAAI conference on artificial intelligence

  13. 13.

    Gordon DM (1996) The organization of work in social insect colonies. Nature 380(6570):121–124

    Article  Google Scholar 

  14. 14.

    Ha D, Dai A, Le QV (2016) Hypernetworks. arXiv:1609.09106

  15. 15.

    Henriques R, Madeira SC (2016) Bicnet: Flexible module discovery in large-scale biological networks using biclustering. Algorithms Mol Biol 11(1):14

    Article  Google Scholar 

  16. 16.

    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computat 9(8):1735–1780

    Article  Google Scholar 

  17. 17.

    Iqbal S, Sha F (2018) Actor-attention-critic for multi-agent reinforcement learning. arXiv:1810.02912

  18. 18.

    Jeanson R, Kukuk PF, Fewell JH (2005) Emergence of division of labour in halictine bees: contributions of social interactions and behavioural variance. Anim Behav 70(5):1183–1193

    Article  Google Scholar 

  19. 19.

    Jiang J, Dun C, Lu Z (2018) Graph convolutional reinforcement learning for multi-agent cooperation. arXiv:1810.09202,2(3)

  20. 20.

    Jiang J, Lu Z (2018) Learning attentional communication for multi-agent cooperation. In: Advances in neural information processing systems, pp 7254–7264

  21. 21.

    Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980

  22. 22.

    Liu Y, Hu Y, Gao Y, Chen Y, Fan C (2019) Value function transfer for deep multi-agent reinforcement learning based on n-step returns. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, pp 457–463

  23. 23.

    Liu Y, Wang W, Hu Y, Hao J, Chen X, Gao Y (2019) Multi-agent game abstraction via graph attention neural network. arXiv:1911.10715

  24. 24.

    Long Q, Zhou Z, Gupta A, Fang F, Wu Y, Wang X (2020) Evolutionary population curriculum for scaling multi-agent reinforcement learning. arXiv:2003.10423

  25. 25.

    Lowe R, Wu YI, Tamar A, Harb J, Abbeel OP, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in neural information processing systems, pp 6379–6390

  26. 26.

    Mao H, Liu W, Hao J, Luo J, Li D, Zhang Z, Wang J, Xiao Z (2019) Neighborhood cognition consistent multi-agent reinforcement learning. arXiv:1912.01160

  27. 27.

    Melo FS, Veloso M (2011) Decentralized mdps with sparse interactions. Artif Intell 175 (11):1757–1789

    MathSciNet  Article  Google Scholar 

  28. 28.

    Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  29. 29.

    Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: ICML

  30. 30.

    Oliehoek FA, Amato C, et al. (2016) A concise introduction to decentralized POMDPs, vol 1. Springer, Berlin

    Google Scholar 

  31. 31.

    OroojlooyJadid A, Hajinezhad D (2019) A review of cooperative multi-agent deep reinforcement learning. arXiv:1908.03963

  32. 32.

    Pal SK, Mitra S (1992) Multilayer perceptron, fuzzy sets classifiaction

  33. 33.

    Ryu H, Shin H, Park J (2020) Multi-agent actor-critic with hierarchical graph attention network. In: AAAI, pp 7236–7243

  34. 34.

    Samvelyan M, Rashid T, de Witt CS, Farquhar G, Nardelli N, Rudner TG, Hung CM, Torr PH, Foerster J, Whiteson S (2019) The starcraft multi-agent challenge. arXiv:1902.04043

  35. 35.

    Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347

  36. 36.

    Singh A, Jain T, Sukhbaatar S (2018) Learning when to communicate at scale in multiagent cooperative and competitive tasks. arXiv:1812.09755

  37. 37.

    Son K, Kim D, Kang WJ, Hostallero DE, Yi Y (2019) Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. arXiv:1905.05408

  38. 38.

    Stone P, Veloso M (2000) Multiagent systems: a survey from a machine learning perspective. Auton Robot 8(3):345–383

    Article  Google Scholar 

  39. 39.

    Sukhbaatar S, Fergus R, et al. (2016) Learning multiagent communication with backpropagation. In: Advances in neural information processing systems, pp 2244–2252

  40. 40.

    Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Lanctot M, Sonnerat N, Leibo JZ, Tuyls K et al (2017) Value-decomposition networks for cooperative multi-agent learning. arXiv:1706.05296

  41. 41.

    Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, pp 1057–1063

  42. 42.

    Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv:1710.10903

  43. 43.

    Wang W, Yang T, Liu Y, Hao J, Hao X, Hu Y, Chen Y, Fan C, Gao Y (2020) From few to more: large-scale dynamic multiagent curriculum learning. In: AAAI, pp 7293–7300

  44. 44.

    Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78 (10):1550–1560

    Article  Google Scholar 

  45. 45.

    Whiteson S (2018) Qmix: Monotonic value function factorisation for deep multi- agent reinforcement learning

  46. 46.

    Wooldridge M, Jennings NR, Kinny D (2000) The gaia methodology for agent-oriented analysis and design. Auton Agents Multi-Agent Syst 3(3):285–312

    Article  Google Scholar 

  47. 47.

    Yang Y, Luo R, Li M, Zhou M, Zhang W, Wang J (2018) Mean field multi-agent reinforcement learning. arXiv:1802.05438

  48. 48.

    Yu C, Zhang M, Ren F, Tan G (2015) Multiagent learning of coordination in loosely coupled multiagent systems. IEEE Trans Cybern 45(12):2853–2867

    Article  Google Scholar 

  49. 49.

    Zhang Z, Yang J, Zha H (2019) Integrating independent and centralized multi-agent reinforcement learning for traffic signal network optimization. arXiv:1909.10651

Download references

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China under Grant No.2017YFB1001901, in part by the Key Program of Tianjin Science and Technology Development Plan under Grant No.18ZXZNGX00120 and in part by the China Postdoctoral Science Foundation under Grant No.2018M643900.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Dianxi Shi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Environment details

We follow the settings of SMAC [34], which could be referred in the SMAC paper. For clarity and completeness, we state these environment details again.

A.1 States and observations

At each time step, agents receive local observations within their field of view. This encompasses information about the map within a circular area around each unit with a radius equal to the sight range, which is set to 9. The sight range makes the environment partially observable for agents. An agent can only observe others if they are both alive and located within its sight range. Hence, there is no way for agents to distinguish whether their teammates are far away or dead. If one unit (both for allies and enemies) is dead or unseen from another agent’s observation, then its unit feature vector is reset to all zeros. The feature vector observed by each agent contains the following attributes for both allied and enemy units within the sight range: distance, relative x, relative y, health, shield, and unit type. If agents are homogeneous, the unit type feature will be omitted. All Protos units have shields, which serve as a source of protection to offset the damage and can regenerate if no new damage is received. Lastly, agents can observe the terrain features surrounding them, in particular, the values of eight points at a fixed radius indicating height and walkability.

The global state is composed of the joint unit features of both ally and enemy soldiers. Specifically, the state vector includes the coordinates of all agents relative to the center of the map, together with unit features present in the observations. Additionally, the state stores the energy/cooldown of the allied units based on the unit property, which represents the minimum delay between attacks/healing. All features, both in the global state and in individual observations of agents, are normalized by their maximum values

A.2 Action space

The discrete set of actions which agents are allowed to take consists of move[direction], attack[enemy id], stop and no-op. Dead agents can only take no-op action while live agents cannot. Agents can only move with a fixed movement amount 2 in four directions: north, south, east, or west. To ensure decentralization of the task, agents are restricted to use the attack[enemy id] action only towards enemies in their shooting range. This additionally constrains the ability of the units to use the built-in attack-move micro-actions on the enemies that are far away. The shooting range is set to be 6 for all agents. Having a larger sight range than a shooting range allows agents to make use of the move commands before starting to fire. The unit behavior of automatically responding to enemy fire without being explicitly ordered is also disabled. As healer units, Medivacs use heal[agent id] actions instead of attack[enemy id].

A.3 Rewards

At each time step, the agents receive a joint reward equal to the total damage dealt on the enemy units. In addition, agents receive a bonus of 10 points after killing each opponent, and 200 points after killing all opponents for winning the battle. The rewards are scaled so that the maximum cumulative reward achievable in each scenario is around 20.

Appendix B: Training configurations

The training time is about 14 hours to 24 hours on these maps (Intel (R) Core (TM) i7-8700 CPU @ 3.20GHz, 32 GB RAM, Nvidia GTX 1050 GPU), which is ranging based on the agent numbers and map features of each map. The number of the total training steps is about 2 million and every 10 thousand steps we train and test the model. When training, a batch of 32 epochs are retrieved from the replay buffer which contains the most recent 1000 epochs. We use 𝜖-greedy policy for exploration. The starting exploration rate is set to 1 and the end exploration rate is 0.05. Exploration rate decays linearly at the first 50 thousand steps. We keep the default configurations of environment parameters. Hyperparameters were based on the PyMARL [34] implementation of QMIX and are listed in Table 3. All hyperparameters are the same in StarCraft II.

Table 3 Hyperparameter settings across all runs and algorithms/baselines

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jiang, H., Shi, D., Xue, C. et al. Multi-agent deep reinforcement learning with type-based hierarchical group communication. Appl Intell (2021). https://doi.org/10.1007/s10489-020-02065-9

Download citation

Keywords

  • Multi-agent reinforcement learning
  • Group cognitive consistency
  • Group communication
  • Value decomposition