Autonomous Agents and Multi-Agent Systems

, Volume 33, Issue 4, pp 403–429 | Cite as

SA-IGA: a multiagent reinforcement learning method towards socially optimal outcomes

  • Chengwei Zhang
  • Xiaohong Li
  • Jianye HaoEmail author
  • Siqi Chen
  • Karl Tuyls
  • Wanli Xue
  • Zhiyong Feng


In multiagent environments, the capability of learning is important for an agent to behave appropriately in face of unknown opponents and dynamic environment. From the system designer’s perspective, it is desirable if the agents can learn to coordinate towards socially optimal outcomes, while also avoiding being exploited by selfish opponents. To this end, we propose a novel gradient ascent based algorithm (SA-IGA) which augments the basic gradient-ascent algorithm by incorporating social awareness into the policy update process. We theoretically analyze the learning dynamics of SA-IGA using dynamical system theory and SA-IGA is shown to have linear dynamics for a wide range of games including symmetric games. The learning dynamics of two representative games (the prisoner’s dilemma game and the coordination game) are analyzed in detail. Based on the idea of SA-IGA, we further propose a practical multiagent learning algorithm, called SA-PGA, based on Q-learning update rule. Simulation results show that SA-PGA agent can achieve higher social welfare than previous social-optimality oriented Conditional Joint Action Learner (CJAL) and also is robust against individually rational opponents by reaching Nash equilibrium solutions.


Multiagent reinforcement learning Social welfare Gradient ascent Nonlinear analysis 



The work is supported by the National Natural Science Foundation of China (Grant Nos. 61702362, U1836214, 61572349, 61872262, 61602391), Special Program of Artificial Intelligence, Tianjin Research Program of Application Foundation and Advanced Technology (No. 16JCQNJC00100), and Special Program of Artificial Intelligence of Tianjin Municipal Science and Technology Commission (No. 569 17ZXRGGX00150), and the Fundamental Research Funds for the Central Universities (No. 3132019207).


  1. 1.
    Abdallah, S., & Lesser, V. (2008). A multiagent reinforcement learning algorithm with non-linear dynamics. Journal of Artificial Intelligence Research, 33(1), 521–549.MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Alvard, M. S. (2004) The ultimatum game, fairness, and cooperation among big game hunters. In Foundations of human sociality (pp. 413–435).Google Scholar
  3. 3.
    Andreoni, J., & Croson, R. (1998). Partners versus strangers: Random rematching in public goods experiments. Amsterdam: Elsevier.Google Scholar
  4. 4.
    Banerjee, B., & Peng, J. (2003). Adaptive policy gradient in multiagent learning. In International joint conference on autonomous agents and multiagent systems (pp. 686–692).Google Scholar
  5. 5.
    Banerjee, B., & Peng, J. (2004). The role of reactivity in multiagent learning. In International joint conference on autonomous agents and multiagent systems (pp. 538–545).Google Scholar
  6. 6.
    Banerjee, B., & Peng, J. (2005). Efficient learning of multi-step best response. In Proceedings of the fourth international joint conference on autonomous agents and multiagent systems (pp. 60–66).Google Scholar
  7. 7.
    Banerjee, D., & Sen, S. (2007). Reaching pareto optimality in Prisoner’s Dilemma using conditional joint action learning. In AAMAS’07 (pp. 91–108).Google Scholar
  8. 8.
    Bloembergen, D., Tuyls, K., Hennes, D., & Kaisers, M. (2015). Evolutionary dynamics of multi-agent learning: A survey. Journal of Artificial Intelligence Research, 53, 659–697.MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Bowling, M. (2004). Convergence and no-regret in multiagent learning. In International conference on neural information processing systems (pp. 209–216).Google Scholar
  10. 10.
    Bowling, M. H., & Veloso, M. M. (2003). Multiagent learning using a variable learning rate. Artificial Intelligence, 136, 215–250.MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 38(2), 156–172.CrossRefGoogle Scholar
  12. 12.
    Chakraborty, D., & Stone, P. (2014). Multiagent learning in the presence of memory-bounded agents. Autonomous Agents and Multi-agent Systems, 28(2), 182–213.CrossRefGoogle Scholar
  13. 13.
    Coddington, E. A., & Levinson, N. (1955). Theory of ordinary differential equations. New York: McGraw-Hill.zbMATHGoogle Scholar
  14. 14.
    Conitzer, V., & Sandholm, T. (2007). Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1–2), 23–43.CrossRefGoogle Scholar
  15. 15.
    Crandall, J. W. (2013). Just add pepper: Extending learning algorithms for repeated matrix games to repeated Markov games. In International conference on autonomous agents and multiagent systems (pp. 399–406).Google Scholar
  16. 16.
    Foerster, J. N., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., & Mordatch, I. (2017). Learning with opponent-learning awareness. CoRR arXiv:1709.04326.
  17. 17.
    Hauert, C., & Szab, G. (2003). Prisoner’s Dilemma and public goods games in different geometries: Compulsory versus voluntary interactions. Complexity, 8(4), 31–38.MathSciNetCrossRefGoogle Scholar
  18. 18.
    Hu, J., & Wellman, M. P. (2003). Nash q-learning for general-sum stochastic games. The Journal of Machine Learning Research, 4, 1039–1069.MathSciNetzbMATHGoogle Scholar
  19. 19.
    Hughes, E., Leibo, J. Z., Phillips, M., Tuyls, K., Dueñez-Guzman, E., Castañeda, A.G., et al. (2018). Inequity aversion improves cooperation in intertemporal social dilemmas. In Advances in neural information processing systems (pp. 3330–3340).Google Scholar
  20. 20.
    Lauer, M., & Rienmiller, M. (2000). An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In ICML’00 (pp. 535–542).Google Scholar
  21. 21.
    Littman, M. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th international conference on machine learning, (pp. 322–328).Google Scholar
  22. 22.
    Littman, M. L. (2001). Friend-or-foe q-learning in general-sum games. In ICML (Vol. 1, pp. 322–328).Google Scholar
  23. 23.
    Matignon, L., Laurent, G. J., & Le Fort-Piat, N. (2012). Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems. The Knowledge Engineering Review, 27(01), 1–31.CrossRefGoogle Scholar
  24. 24.
    Peysakhovich, A., & Lerer, A. (2017). Prosocial learning agents solve generalized stag hunts better than selfish ones. CoRR arXiv:1709.02865.
  25. 25.
    Powers, R., & Shoham, Y. (2005). Learning against opponents with bounded memory. In IJCAI (Vol. 5, pp. 817–822).Google Scholar
  26. 26.
    Rodrigues Gomes, E., & Kowalczyk, R. (2009). Dynamic analysis of multiagent q-learning with \(\varepsilon \)-greedy exploration. In Proceedings of the 26th annual international conference on machine learning (pp. 369–376). ACM.Google Scholar
  27. 27.
    Shilnikov, L. P., Shilnikov, A. L., Turaev, D. V., & Chua, L. O. (2001). Methods of qualitative theory in nonlinear dynamics (Vol. 5). Singapore: World Scientific.CrossRefzbMATHGoogle Scholar
  28. 28.
    Shivshankar, S., & Jamalipour, A. (2015). An evolutionary game theory-based approach to cooperation in vanets under different network conditions. IEEE Transactions on Vehicular Technology, 64(5), 2015–2022.CrossRefGoogle Scholar
  29. 29.
    Singh, S., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In Proceedings of the sixteenth conference on uncertainty in artificial intelligence (pp. 541–548). Morgan Kaufmann.Google Scholar
  30. 30.
    Tuyls, K., Hoen, P. J., & Vanschoenwinkel, B. (2006). An evolutionary dynamical analysis of multi-agent learning in iterated games. Autonomous Agents and Multi-agent Systems, 12(1), 115–153.CrossRefGoogle Scholar
  31. 31.
    Tuyls, K., Verbeeck, K., & Lenaerts, T. (2003). A selection-mutation model for q-learning in multi-agent systems. In Proceedings of the second international joint conference on autonomous agents and multiagent systems (pp. 693–700). ACM.Google Scholar
  32. 32.
    Vohra, R. V., & Wellman, M. P. (2007). Foundations of multi-agent learning. Artificial Intelligence, 171(7), 363–452.MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Watkins, C. J. C. H. (1989). Learning from delayed rewards. Robotics & Autonomous Systems, 15(4), 233–235.Google Scholar
  34. 34.
    Watkins, C. J. C. H., & Dayan, P. D. (1992). Q-learning. Machine Learning, 8, 279–292.zbMATHGoogle Scholar
  35. 35.
    Wei, G., Zhu, P., Vasilakos, A. V., & Mao, Y. (2013). Cooperation dynamics on collaborative social networks of heterogeneous population. IEEE Journal on Selected Areas in Communications, 31(6), 1135–1146.CrossRefGoogle Scholar
  36. 36.
    Zhang, C., & Lesser, V. R. (2010). Multi-agent learning with policy prediction. In Proceedings of the twenty-fourth AAAI conference on artificial intelligence (pp. 927–934).Google Scholar
  37. 37.
    Zhang, Z., Zhao, D., Gao, J., Wang, D., & Dai, Y. (2017). Fmrq—A multiagent reinforcement learning algorithm for fully cooperative tasks. IEEE Transactions on Cybernetics, 47(6), 1367–1379.CrossRefGoogle Scholar
  38. 38.
    Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In ICML (pp. 928–936).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.College of Intelligence and ComputingTianjin UniversityTianjinChina
  2. 2.College of Information Science and TechnologyDalian Maritime UniversityDalianChina
  3. 3.University of LiverpoolLiverpoolUK
  4. 4.School of Computer Science and EngineeringTianjin University of TechnologyTianjinChina

Personalised recommendations