Off-Policy Integral Reinforcement Learning Method for Multi-player Non-zero-Sum Games

  • Ruizhuo SongEmail author
  • Qinglai Wei
  • Qing Li
Part of the Studies in Systems, Decision and Control book series (SSDC, volume 166)


This chapter establishes an off-policy integral reinforcement learning (IRL) method to solve nonlinear continuous-time non-zero-sum (NZS) games with unknown system dynamics. The IRL algorithm is presented to obtain the iterative control and off-policy learning is used to allow the dynamics to be completely unknown. Off-policy IRL is designed to do policy evaluation and policy improvement in policy iteration (PI) algorithm. Critic and action networks are used to obtain the performance index and control for each player. Gradient descent algorithm makes the update of critic and action weights simultaneously. The convergence analysis of the weights is given. The asymptotic stability of the closed-loop system and the existence of Nash equilibrium are proven. Simulation study demonstrates the effectiveness of the developed method for nonlinear continuous-time NZS games with unknown system dynamics.


  1. 1.
    Vamvoudakis, K., Lewis, F.: Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica 47(8), 1556–1569 (2011)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Zhang, H., Wei, Q., Luo, Y.: A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans. Syst. Man Cybern. Part B-Cybern. 38(4), 937–942 (2008)CrossRefGoogle Scholar
  3. 3.
    Wei, Q., Wang, F., Liu, D., Yang, X.: Finite-approximation-error based discrete-time iterative adaptive dynamic programming. IEEE Trans. Cybern. 44(12), 2820–2833 (2014)CrossRefGoogle Scholar
  4. 4.
    Wei, Q., Liu, D.: A novel iterative-Adaptive dynamic programming for discrete-time nonlinear. IEEE Trans. Automat. Sci. Eng. 11(4), 1176–1190 (2014)CrossRefGoogle Scholar
  5. 5.
    Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complex-valued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014)CrossRefGoogle Scholar
  6. 6.
    Song, R., Lewis, F., Wei, Q., Zhang, H., Jiang, Z., Levine, D.: Multiple Actor-Critic Structures for Continuous-Time Optimal Control Using Input-Output Data. IEEE Trans. Neural Netw. Learn. Syst. 26(4), 851–865 (2015)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Modares, H., Lewis, F., Naghibi-Sistani, M.B.: Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks. IEEE Trans. Neural Netw. Learn. Syst. 24(10), 1513–1525 (2013)CrossRefGoogle Scholar
  8. 8.
    Modares, H., Lewis, F.: Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning. Automatica 50(7), 1780–1792 (2014)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Modares, H., Lewis, F., Naghibi-Sistani, M.: Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 50, 193–202 (2014)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Kiumarsi, B., Lewis, F., Naghibi-Sistani, M., Karimpour, A.: Approximate dynamic programming for optimal tracking control of unknown linear systems using measured data. IEEE Trans. Cybern. 45(12), 2770–2779 (2015)CrossRefGoogle Scholar
  11. 11.
    Jiang, Y., Jiang, Z.: Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48(10), 2699–2704 (2012)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Luo, B., Wu, H., Huang, T.: Off-policy reinforcement learning for H control design. IEEE Trans. Cybern. 45(1), 65–76 (2015)CrossRefGoogle Scholar
  13. 13.
    Song, R., Lewis, F., Wei, Q., Zhang, H.: Off-policy actor-critic structure for optimal control of unknown systems with disturbances. IEEE Trans. Cybern. 46(5), 1041–1050 (2016)CrossRefGoogle Scholar
  14. 14.
    Lewis, F., Vrabie, D., Syrmos, V.L.: Optimal Control, 3rd edn. Wiley, Hoboken (2012)CrossRefGoogle Scholar
  15. 15.
    Vamvoudakis, K., Lewis, F., Hudas, G.: Multi-agent differential graphical games: online adaptive learning solution for synchronization with optimality. Automatica 48(8), 1598–1611 (2012)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems withsaturating actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Leake, R., Liu, R.: Construction of suboptimal control sequences. SIAM J. Control 5(1), 54–63 (1967)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Jungers, M., De Pieri, E., Abou-Kandil, H.: Solving coupled algebraic Riccati equations from closed-loop Nash strategy, by lack of trust approach. Int. J. Tomogr. Stat. 7(F07), 49–54 (2007)MathSciNetGoogle Scholar
  19. 19.
    Limebeer, D., Anderson, B., Hendel, H.: A Nash game approach to mixed H2/H control. IEEE Trans. Autom. Control 39(1), 69–82 (1994)CrossRefGoogle Scholar
  20. 20.
    Liu, D., Li, H., Wang, D.: Online synchronous approximate optimal learning algorithm for multiplayer nonzero-sum games with unknown dynamics. IEEE Trans. Syst. Man Cybern.: Syst. 44(8), 1015–1027 (2014)CrossRefGoogle Scholar

Copyright information

© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.University of Science and Technology BeijingBeijingChina
  2. 2.Institute of AutomationChinese Academy of SciencesBeijingChina

Personalised recommendations