Advertisement

Monte Carlo Tree Search for Verifying Reachability in Markov Decision Processes

  • Pranav Ashok
  • Tomáš Brázdil
  • Jan KřetínskýEmail author
  • Ondřej Slámečka
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11245)

Abstract

The maximum reachability probabilities in a Markov decision process can be computed using value iteration (VI). Recently, simulation-based heuristic extensions of VI have been introduced, such as bounded real-time dynamic programming (BRTDP), which often manage to avoid explicit analysis of the whole state space while preserving guarantees on the computed result. In this paper, we introduce a new class of such heuristics, based on Monte Carlo tree search (MCTS), a technique celebrated in various machine-learning settings. We provide a spectrum of algorithms ranging from MCTS to BRTDP. We evaluate these techniques and show that for larger examples, where VI is no more applicable, our techniques are more broadly applicable than BRTDP with only a minor additional overhead.

References

  1. [ABKS18]
    Ashok, P., Brázdil, T., Křetínský, J., Slámečka, O.: Monte Carlo tree search for verifying reachability in Markov decision processes. CoRR abs/1809.03299 (2018)Google Scholar
  2. [ACD+17]
    Ashok, P., Chatterjee, K., Daca, P., Křetínský, J., Meggendorfer, T.: Value iteration for long-run average reward in Markov decision processes. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 201–221. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-63387-9_10CrossRefGoogle Scholar
  3. [AK87]
    Abramson, B., Korf, R.E.: A model of two-player evaluation functions. In: Proceedings of the 6th National Conference on Artificial Intelligence (AAAI 1987), pp. 90–94. Morgan Kaufmann (1987)Google Scholar
  4. [BCC+14]
    Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11936-6_8CrossRefGoogle Scholar
  5. [Bel57]
    Bellman, R.: A Markovian decision process. 6:15 (1957)MathSciNetCrossRefGoogle Scholar
  6. [BPW+12]
    Browne, C., et al.: A survey of monte carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4, 1–43 (2012)CrossRefGoogle Scholar
  7. [Cou07]
    Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. In: van den Herik, H.J., Ciancarini, P., Donkers, H.H.L.M.J. (eds.) CG 2006. LNCS, vol. 4630, pp. 72–83. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-75538-8_7CrossRefGoogle Scholar
  8. [dA97]
    de Alfaro, L.: Formal verification of probabilistic systems. Ph.D. thesis, Stanford University (1997)Google Scholar
  9. [DJKV17]
    Dehnert, C., Junges, S., Katoen, J.P., Volk, M.: A storm is coming: a modern probabilistic model checker. In: Majumdar, R., Kunčak, V. (eds.) Computer Aided Verification. CAV 2017. Lecture Notes in Computer Science, vol. 10427, pp. 592–600. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-63390-9_31CrossRefGoogle Scholar
  10. [DLST15]
    D’Argenio, P., Legay, A., Sedwards, S., Traonouez, L.-M.: Smart sampling for lightweight verification of Markov decision processes. Int. J. Softw. Tools Technol. Transf. 17(4), 469–484 (2015)CrossRefGoogle Scholar
  11. [GW06]
    Gelly, S., Wang, Y.: Exploration exploitation in Go: UCT for Monte-Carlo Go. In: NIPS: Neural Information Processing Systems Conference On-line Trading of Exploration and Exploitation Workshop, Canada, December 2006Google Scholar
  12. [HM17]
    Haddad, S., Monmege, B.: Interval iteration algorithm for MDPs and IMDPs. Theor. Comput. Sci. 735, 111–131 (2018)MathSciNetCrossRefGoogle Scholar
  13. [HMZ+12]
    Henriques, D., Martins, J., Zuliani, P., Platzer, A., Clarke, E.M.: Statistical model checking for Markov decision processes. In: QEST, pp. 84–93 (2012)Google Scholar
  14. [JKR17]
    James, S., Konidaris, G., Rosman, B.: An analysis of Monte Carlo tree search. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 4–9 February 2017, San Francisco, California, USA, pp. 3576–3582 (2017)Google Scholar
  15. [KKKW18]
    Kelmedi, E., Krämer, J., Kretínský, J., Weininger, M.: Value iteration for simple stochastic games: stopping criterion and learning algorithm. In: CAV (1), pp. 623–642. Springer (2018)Google Scholar
  16. [KM17]
    Křetínský, J., Meggendorfer, T.: Efficient strategy iteration for mean payoff in Markov decision processes. In: ATVA, pp. 380–399 (2017)Google Scholar
  17. [KNP11]
    Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-22110-1_47CrossRefGoogle Scholar
  18. [KNP12]
    Kwiatkowska, M., Norman, G., Parker, D.: The PRISM benchmark suite. In: Proceedings of 9th International Conference on Quantitative Evaluation of SysTems (QEST 2012), pp. 203–204. IEEE CS Press (2012)Google Scholar
  19. [Kre16]
    Křetínský, J.: Survey of statistical verification of linear unbounded properties: model checking and distances. In: Margaria, T., Steffen, B. (eds.) ISoLA 2016. LNCS, vol. 9952, pp. 27–45. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-47166-2_3CrossRefGoogle Scholar
  20. [KS06]
    Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006).  https://doi.org/10.1007/11871842_29CrossRefGoogle Scholar
  21. [KSW06]
    Kocsis, L., Szepesvári, C., Willemson, J.: Improved Monte-Carlo search (2006)Google Scholar
  22. [MLG05]
    Brendan McMahan, H., Likhachev, M., Gordon, G.J.: Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees. In: ICML 2005 (2005)Google Scholar
  23. [Put14]
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (2014)zbMATHGoogle Scholar
  24. [SHM+16]
    Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)CrossRefGoogle Scholar
  25. [SHS+17]
    Silver, D., et al.: Mastering chess and Shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815 (2017)Google Scholar
  26. [SLW+06]
    Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: ICML, pp. 881–888 (2006)Google Scholar
  27. [SSM12]
    Silver, D., Sutton, R.S., Mueller, M.: Temporal-difference search in computer Go. Mach. Learn. 87, 183–219 (2012)MathSciNetCrossRefGoogle Scholar
  28. [ST09]
    Silver, D., Tesauro, G.: Monte-Carlo simulation balancing. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 945–952. ACM (2009)Google Scholar
  29. [VSS17]
    Vodopivec, T., Samothrakis, S., Ster, B.: On Monte Carlo tree search and reinforcement learning. J. Artif. Intell. Res. 60, 881–936 (2017)MathSciNetCrossRefGoogle Scholar
  30. [YS02]
    Younes, H.L.S., Simmons, R.G.: Probabilistic verification of discrete event systems using acceptance sampling. In: Brinksma, E., Larsen, K.G. (eds.) CAV 2002. LNCS, vol. 2404, pp. 223–235. Springer, Heidelberg (2002).  https://doi.org/10.1007/3-540-45657-0_17CrossRefzbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Pranav Ashok
    • 1
  • Tomáš Brázdil
    • 2
  • Jan Křetínský
    • 1
    Email author
  • Ondřej Slámečka
    • 2
  1. 1.Technical University of MunichMunichGermany
  2. 2.Masaryk UniversityBrnoCzech Republic

Personalised recommendations