HI-VAL: Iterative Learning of Hierarchical Value Functions for Policy Generation

  • Roberto Capobianco
  • Francesco Riccio
  • Daniele Nardi
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 867)


Task decomposition is effective in various applications where the global complexity of a problem makes planning and decision-making too demanding. This is true, for example, in high-dimensional robotics domains, where (1) unpredictabilities and modeling limitations typically prevent the manual specification of robust behaviors, and (2) learning an action policy is challenging due to the curse of dimensionality. In this work, we borrow the concept of Hierarchical Task Networks (HTNs) to decompose the learning procedure, and we exploit Upper Confidence Tree (UCT) search to introduce Hi-Val, a novel iterative algorithm for hierarchical optimistic planning with learned value functions. To obtain better generalization and generate policies, Hi-Val simultaneously learns and uses action values. These are used to formalize constraints within the search space and to reduce the dimensionality of the problem. We evaluate our algorithm both on a fetching task using a simulated 7-DOF KUKA light weight arm and, on a pick and delivery task with a Pioneer robot.


  1. 1.
    Agostini, A., Celaya, E.: Reinforcement learning with a Gaussian mixture model. In: The 2010 International Joint Conference on Neural Networks, pp. 1–8, July 2010Google Scholar
  2. 2.
    Anand, A., Grover, A., Mausam, M., Singla, P.: ASAP-UCT: abstraction of state-action pairs in UCT. In: Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI 2015, pp. 1509–1515. AAAI Press (2015).
  3. 3.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)CrossRefGoogle Scholar
  4. 4.
    Bagnell, J.A., Schneider, J.G.: Autonomous helicopter control using reinforcement learning policy search methods. In: 2001 IEEE International Conference on Robotics and Automation, vol. 2, pp. 1615–1620 (2001)Google Scholar
  5. 5.
    Chowdhary, G., Liu, M., Grande, R., Walsh, T., How, J., Carin, L.: Off-policy reinforcement learning with Gaussian processes. IEEE/CAA J. Autom. Sinica 1(3), 227–238 (2014)CrossRefGoogle Scholar
  6. 6.
    Clair, A.S., Saldanha, C., Boteanu, A., Chernova, S.: Interactive hierarchical task learning via crowdsourcing for robot adaptability. In: Refereed Workshop Planning for Human-Robot Interaction: Shared Autonomy and Collaborative Robotics at Robotics: Science and Systems, Ann Arbor, Michigan. RSS (2016)Google Scholar
  7. 7.
    Dietterich, T.G.: Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res. (JAIR) 13, 227–303 (2000)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Erol, K., Hendler, J., Nau, D.S.: HTN planning: complexity and expressivity. In: the Twelfth National Conference on Artificial Intelligence, vol. 2, AAAI 1994, pp. 1123–1128. American Association for Artificial Intelligence, Menlo Park (1994).
  9. 9.
    Hostetler, J., Fern, A., Dietterich, T.G.: Sample-based tree search with fixed and adaptive state abstractions. J. Artif. Intell. Res. 60, 717–777 (2017). Scholar
  10. 10.
    Jun, M., Kenji, D.: Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Rob. Autonom. Syst. 36(1), 37–51 (2001)CrossRefGoogle Scholar
  11. 11.
    Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Rob. Res. (2013)Google Scholar
  12. 12.
    Kober, J., Peters, J.R.: Policy search for motor primitives in robotics. In: Advances in Neural Information Processing Systems, pp. 849–856 (2009)Google Scholar
  13. 13.
    Kocsis, L., Szepesvári, C.: Bandit based monte-carlo planning, pp. 282–293. Springer, Heidelberg (2006). Scholar
  14. 14.
    Kohl, N., Stone, P.: Policy gradient reinforcement learning for fast quadrupedal locomotion. In: 2004 IEEE International Conference on Robotics and Automation, vol. 3, pp. 2619–2624, April 2004Google Scholar
  15. 15.
    Konidaris, G., Kuindersma, S., Grupen, R., Barto, A.: Robot learning from demonstration by constructing skill trees. Int. J. Rob. Res. 31(3), 360–375 (2012)CrossRefGoogle Scholar
  16. 16.
    Riccio, F., Capobianco, R., Nardi, D.: DOP: deep optimistic planning with approximate value function evaluation. In: Proceedings of the 2018 International Conference on Autonomous Agents and Multiagent Systems (AAMAS) (2018)Google Scholar
  17. 17.
    Riccio, F., Capobianco, R., Nardi, D.: Q-CP: learning action values for cooperative planning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018)Google Scholar
  18. 18.
    Ross, S., Gordon, G.J., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011)Google Scholar
  19. 19.
    Schaul, T., Ring, M.: Better generalization with forecasts. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI 2013, pp. 1656–1662. AAAI Press (2013)Google Scholar
  20. 20.
    Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–503 (2016)CrossRefGoogle Scholar
  21. 21.
    Silver, D., Sutton, R.S., Müller, M.: Temporal-difference search in computer Go. Mach. Learn. 87(2), 183–219 (2012)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Stulp, F., Schaal, S.: Hierarchical reinforcement learning with movement primitives. In: 2011 IEEE-RAS International Conference on Humanoid Robots, pp. 231–238, October 2011Google Scholar
  23. 23.
    Sutton, R.S., Precup, D., Singh, S.: Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112(1–2), 181–211 (1999)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Roberto Capobianco
    • 1
  • Francesco Riccio
    • 1
  • Daniele Nardi
    • 1
  1. 1.Department of Computer, Control, and Management EngineeringSapienza University of RomeRomeItaly

Personalised recommendations