Planning Under Uncertainty Through Goal-Driven Action Selection

  • Juan Carlos SaboríoEmail author
  • Joachim Hertzberg
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11352)


Online planning in domains with uncertainty and partial observability conveys a series of performance challenges: agents must obtain information about the environment, quickly select actions with high reward prospects and avoid very expensive mistakes, while interleaving planning and execution in highly variable and uncertain domains. In order to reduce the amount of mistakes and help an agent focus on directly relevant actions, we propose a goal-driven, action selection method for planning in (PO)MDP’s. This method introduces a reward bonus and a rollout policy for MCTS planners, both of which depend almost exclusively on a clear specification of the goal and produced promising results when planning in large domains of interest to cognitive and mobile robotics.



We would like to thank our colleagues Sebastian Pütz and Felix Igelbrink for their suggested reward distribution in the Cellar domain, and the DAAD for supporting this work with a research grant.


  1. 1.
    Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006). Scholar
  2. 2.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002)CrossRefGoogle Scholar
  3. 3.
    Silver, D., Veness, J.: Monte-Carlo planning in large POMDPs. Adv. Neural Inf. Process. Syst. 23, 2164–2172 (2010)Google Scholar
  4. 4.
    Saborío, J.C., Hertzberg, J.: Towards domain-independent biases for action selection in robotic task-planning under uncertainty. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence, ICAART, INSTICC, vol. 2, pp. 85–93. SciTePress (2018)Google Scholar
  5. 5.
    Smallwood, R.D., Sondik, E.J.: The optimal control of partially observable Markov processes over a finite Horizon. Oper. Res. 21, 1071–1088 (1973)CrossRefGoogle Scholar
  6. 6.
    Cassandra, A.R., Kaelbling, L.P., Littman, M.L.: Acting optimally in partially observable stochastic domains. In: Proceedings of the 12th National Conference on Artificial Intelligence, Seattle, WA, USA, 31 July – 4 August, vol. 2, pp. 1023–1028 (1994)Google Scholar
  7. 7.
    Cassandra, A.R., Littman, M.L., Zhang, N.L.: Incremental pruning: a simple, fast, exact method for partially observable markov decision processes. In: Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, UAI 1997, Brown University, Providence, Rhode Island, USA, 1–3 August 1997, pp. 54–61 (1997)Google Scholar
  8. 8.
    Smith, T., Simmons, R.: Heuristic search value iteration for POMDPs. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI 2004, pp. 520–527. AUAI Press, Arlington (2004)Google Scholar
  9. 9.
    Pineau, J., Gordon, G.J., Thrun, S.: Anytime point-based approximations for large POMDPs. J. Artif. Intell. Res. 27, 335–380 (2006)CrossRefGoogle Scholar
  10. 10.
    Kurniawati, H., Hsu, D., Lee, W.S.: SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Robotics: Science and Systems IV, Eidgenössische Technische Hochschule Zürich, Zurich, Switzerland, 25–28 June 2008 (2008)Google Scholar
  11. 11.
    Ong, S.C.W., Png, S.W., Hsu, D., Lee, W.S.: Planning under uncertainty for robotic tasks with mixed observability. Int. J. Rob. Res. 29, 1053–1068 (2010)CrossRefGoogle Scholar
  12. 12.
    Somani, A., Ye, N., Hsu, D., Lee, W.S.: DESPOT: online POMDP planning with regularization. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 1772–1780. Curran Associates, Inc. (2013)Google Scholar
  13. 13.
    Pineau, J., Gordon, G., Thrun, S.: Policy-contingent abstraction for robust robot control. In: Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, UAI 2003, pp. 477–484. Morgan Kaufmann Publishers Inc., San Francisco (2003)Google Scholar
  14. 14.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2012). (to be published)Google Scholar
  15. 15.
    Hester, T., Stone, P.: TEXPLORE: real-time sample-efficient reinforcement learning for robots. Mach. Learn. 90, 385–429 (2013)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Sutton, R., Precup, D., Singh, S.: Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112, 181–211 (1999)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Dietterich, T.G.: Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res. 13, 227–303 (2000)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Konidaris, G.: Constructing abstraction hierarchies using a skill-symbol loop. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016, pp. 1648–1654 (2016)Google Scholar
  19. 19.
    Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 278–287. Morgan Kaufmann (1999)Google Scholar
  20. 20.
    Eck, A., Soh, L.K., Devlin, S., Kudenko, D.: Potential-based reward shaping for finite Horizon online POMDP planning. Auton. Agents Multi-agent Syst. 30, 403–445 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Institute of Computer Science, University of OsnabrückOsnabrückGermany
  2. 2.DFKI Robotics Innovation Center (Osnabrück)OsnabrückGermany

Personalised recommendations