Reachability and Differential Based Heuristics for Solving Markov Decision Processes

  • Shoubhik DebnathEmail author
  • Lantao Liu
  • Gaurav Sukhatme
Conference paper
Part of the Springer Proceedings in Advanced Robotics book series (SPAR, volume 10)


The solution convergence of Markov Decision Processes (MDPs) can be accelerated by prioritized sweeping of states ranked by their potential impacts to other states. In this paper, we present new heuristics to speed up the solution convergence of MDPs. First, we quantify the level of reachability of every state using the Mean First Passage Time (MFPT) and show that such reachability characterization very well assesses the importance of states which is used for effective state prioritization. Then, we introduce the notion of backup differentials as an extension to the prioritized sweeping mechanism, in order to evaluate the impacts of states at an even finer scale. Finally, we extend the state prioritization to the temporal process, where only partial sweeping can be performed during certain intermediate value iteration stages. To validate our design, we have performed numerical evaluations by comparing the proposed new heuristics with corresponding classic baseline mechanisms. The evaluation results showed that our reachability based framework and its differential variants have outperformed the state-of-the-art solutions in terms of both practical runtime and number of iterations.


  1. 1.
    Andre, D., Friedman, N., Parr, R.: Generalized prioritized sweeping. Advances in Neural Information Processing Systems (1998)Google Scholar
  2. 2.
    Andre, D., Russell, S.J.: State abstraction for programmable reinforcement learning agents. In: AAAI/IAAI, pp. 119–125 (2002)Google Scholar
  3. 3.
    Assaf, D., Shared, M., Shanthikumar, J.G.: First-passage times with PFr densities. J. Appl. Probab. 22(1), 185–196 (1985)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72(1–2), 81–138 (1995)CrossRefGoogle Scholar
  5. 5.
    Bertsekas, D.P.: Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs (1987)Google Scholar
  6. 6.
    Bonet, B., Geffner, H.: Labeled RTDP: improving the convergence of real-time dynamic programming. ICAPS 3, 12–21 (2003)Google Scholar
  7. 7.
    Boutilier, C., Brafman, R.I., Geib, C.: Structured reachability analysis for Markov decision processes. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pp. 24–32. Morgan Kaufmann Publishers Inc., (1998)Google Scholar
  8. 8.
    Boutilier, C., Dean, T., Hanks, S.: Decision-theoretic planning: structural assumptions and computational leverage. J. Artif. Intell. Res. 11, 1–94 (1999)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Boutilier, C., Dearden, R., Goldszmidt, M.: Stochastic dynamic programming with factored representations. Artif. Intell. 121(1), 49–107 (2000)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators, vol. 39. CRC Press, Boca Raton (2010)Google Scholar
  11. 11.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)Google Scholar
  12. 12.
    Hansen, E.A., Zilberstein, S.: Lao: a heuristic search algorithm that finds solutions with loops. Artif. Intell. 129(1–2), 35–62 (2001)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Howard, R.A.: Dynamic Programming and Markov Processes. MIT Press, Cambridge (1960)Google Scholar
  14. 14.
    Kemeny, J.G., Mirkill, H., Snell, J.L., Thompson, G.L.: Finite Mathematical Structures. Prentice-Hall, Upper Saddle River (1959)Google Scholar
  15. 15.
    Li, L., Walsh, T.J., Littman, M.L.: Towards a unified theory of state abstraction for MDPs. In: ISAIM (2006)Google Scholar
  16. 16.
    Moore, A.W., Atkeson, C.G.: Prioritized sweeping: reinforcement learning with less data and less time. Mach. Learn., 103–130 (1993)Google Scholar
  17. 17.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (2014)Google Scholar
  18. 18.
    Russell, S., Norvig, P.: Artifical intelligence: a modern approach. (2002). Accessed 22 Oct 2004
  19. 19.
    Sigaud, O., Buffet, O.: Markov Decision Processes in Artificial Intelligence. Wiley, Hoboken (2013)Google Scholar
  20. 20.
    Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the Seventh International Conference on Machine Learning, pp. 216–224 (1990)Google Scholar
  21. 21.
    van Otterlo, M., Wiering, M.: Reinforcement learning and Markov decision processes. Reinforcement Learning, pp. 3–42. Springer, Berlin (2012)Google Scholar
  22. 22.
    White, D.J.: A survey of applications of Markov decision processes. J. Oper. Res. Soc. 44(11), 1073–1096 (1993)CrossRefGoogle Scholar
  23. 23.
    Wingate, D., Seppi, K.D.: Prioritization methods for accelerating MDP solvers. J. Mach. Learn. Res. 6, 851–881 (2005)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.NVIDIA CorporationSanta ClaraUSA
  2. 2.Intelligent Systems Engineering DepartmentIndiana UniversityBloomingtonUSA

Personalised recommendations