# Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning

- 7 Downloads

**Part of the following topical collections:**

## Abstract

We consider scenarios where a swarm of unmanned vehicles (UxVs) seek to satisfy a number of diverse, spatially distributed objectives. The UxVs strive to determine an efficient plan to service the objectives while operating in a coordinated fashion. We focus on developing autonomous high-level planning, where low-level controls are leveraged from previous work in distributed motion, target tracking, localization, and communication. We rely on the use of state and action abstractions in a Markov decision processes framework to introduce a hierarchical algorithm, *Dynamic Domain Reduction for Multi-Agent Planning*, that enables multi-agent planning for large multi-objective environments. Our analysis establishes the correctness of our search procedure within specific subsets of the environments, termed ‘sub-environment’ and characterizes the algorithm performance with respect to the optimal trajectories in single-agent and sequential multi-agent deployment scenarios using tools from submodularity. Simulated results show significant improvement over using a standard *Monte Carlo tree search* in an environment with large state and action spaces.

## Keywords

Reinforcement learning Multi-agent planning Distributed robotics Semi-Markov decision processes Markov decision processes Upper confidence bound tree search Hierarchical planning Hierarchical Markov decision processes Model-based reinforcement learning Swarm robotics Dynamic domain reduction Submodularity## Notes

### Acknowledgements

This work was supported by ONR Award N00014-16-1-2836. The authors would like to thank the organizers of the International Symposium on Multi-Robot and Multi-Agent Systems (MRS 2017), which provided us with the opportunity to obtain valuable feedback on this research, and the reviewers.

## References

- Agha-mohammadi, A. A., Chakravorty, S., & Amato, N. M. (2011). FIRM: Feedback controller-based information-state roadmap-a framework for motion planning under uncertainty. In
*IEEE/RSJ international conference on intelligent robots and systems*(pp. 4284–4291). San Francisco, CA.Google Scholar - Bai, A., Srivastava, S., & Russell, S. (2016). Markovian state and action abstractions for MDPs via hierarchical MCTS. In
*Proceedings of the twenty-fifth international joint conference on artificial intelligence*(pp. 3029–3039). New York, NY: IJCAI.Google Scholar - Ben-Tal, A., & Nemirovski, A. (1998). Robust convex optimization.
*Mathematics of Operations Research*,*23*, 769–805.MathSciNetCrossRefzbMATHGoogle Scholar - Bertsekas, D. P. (1995).
*Dynamic programming and optimal control*. Belmont: Athena Scientific.zbMATHGoogle Scholar - Bian, A. A., Buhmann, J. M., Krause, A., & Tschiatschek, S. (2017). Guarantees for greedy maximization of non-submodular functions with applications. In
*International conference on machine learning*Vol. 70 (pp. 498–507). Sydney.Google Scholar - Blum, A., Chawla, S., Karger, D. R., Lane, T., Meyerson, A., & Minkoff, M. (2007). Approximation algorithms for orienteering and discounted-reward TSP.
*SIAM Journal on Computing*,*37*(2), 653–670.MathSciNetCrossRefzbMATHGoogle Scholar - Broz, F., Nourbakhsh, I., & Simmons, R. (2008). Planning for human–robot interaction using time-state aggregated POMDPs.
*AAAI*,*8*, 1339–1344.Google Scholar - Bullo, F., Cortés, J., & Martínez, S. (2009).
*Distributed control of robotic networks. Applied mathematics series*. Princeton, NJ: Princeton University Press.zbMATHGoogle Scholar - Campi, M. C., Garatti, S., & Prandini, M. (2009). The scenario approach for systems and control design.
*Annual Reviews in Control*,*32*(2), 149–157.CrossRefGoogle Scholar - Clark, A., Alomair, B., Bushnell, L., & Poovendran, R. (2016).
*Submodularity in dynamics and control of networked systems*., Communications and control engineering New York: Springer.CrossRefzbMATHGoogle Scholar - Cortés, J., & Egerstedt, M. (2017). Coordinated control of multi-robot systems: A survey.
*SICE Journal of Control, Measurement, and System Integration*,*10*(6), 495–503.CrossRefGoogle Scholar - Das, A., & Kempe, D. (February 2011). Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In:
*CoRR*.Google Scholar - Das, J., Py, F., Harvey, J. B. J., Ryan, J. P., Gellene, A., Graham, R., et al. (2015). Data-driven robotic sampling for marine ecosystem monitoring.
*The International Journal of Robotics Research*,*34*(12), 1435–1452.CrossRefGoogle Scholar - Dunbabin, M., & Marques, L. (2012). Robots for environmental monitoring: Significant advancements and applications.
*IEEE Robotics and Automation Magazine*,*19*(1), 24–39.CrossRefGoogle Scholar - Gerkey, B. P., & Mataric, M. J. (2004). A formal analysis and taxonomy of task allocation in multi-robot systems.
*International Journal of Robotics Research*,*23*(9), 939–954.CrossRefGoogle Scholar - Ghaoui, L. E., Oustry, F., & Lebret, H. (1998). Robust solutions to uncertain semidefinite programs.
*SIAM Journal on Optimization*,*9*(1), 33–52.MathSciNetCrossRefzbMATHGoogle Scholar - Goundan, P. R., & Schulz, A. S. (2007). Revisiting the greedy approach to submodular set function maximization.
*Optimization Online*(pp. 1–25). Google Scholar - Hansen, E. A., & Feng, Z. (2000). Dynamic programming for POMDPs using a factored state representation. In
*International conference on artificial intelligence planning systems*(pp. 130–139). Breckenridge, CO.Google Scholar - Howard, R. (1960).
*Dynamic programming and Markov processes*. Cambridge: M.I.T. Press.zbMATHGoogle Scholar - Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In
*ECML*Vol. 6 (pp. 282–293). Springer.Google Scholar - LaValle, S. M., & Kuffner, J. J. (2000). Rapidly-exploring random trees: Progress and prospects. In
*Workshop on algorithmic foundations of robotics*(pp. 293–308). Dartmouth, NH.Google Scholar - Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed Markov decision processes.
*Annals of Operations Research*,*28*(1), 47–65.MathSciNetCrossRefzbMATHGoogle Scholar - Ma, A., Ouimet, M., & Cortés, J. (2017). Dynamic domain reduction for multi-agent planning. In
*International symposium on multi-robot and multi-agent systems*(pp. 142–149). Los Angeles, CA.Google Scholar - McCallum, A. K., & Ballard, D. (1996).
*Reinforcement learning with selective perception and hidden state*. Ph.D. Dissertation, University of Rochester. Department of Computer Science.Google Scholar - Mesbahi, M., & Egerstedt, M. (2010).
*Graph theoretic methods in multiagent networks*., Applied mathematics series Princeton: Princeton University Press.CrossRefzbMATHGoogle Scholar - Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning.
*Nature*,*518*(7540), 529.CrossRefGoogle Scholar - Nemhauser, G., Wolsey, L., & Fisher, M. (1978). An analysis of the approximations for maximizing submodular set functions.
*Mathematical Programming*,*14*, 265–294.MathSciNetCrossRefzbMATHGoogle Scholar - Oliehoek, F. A., & Amato, C. (2016).
*A concise introduction to decentralized POMDPs*., SpringerBriefs in intelligent systems New York: Springer.CrossRefzbMATHGoogle Scholar - Omidshafiei, S., Agha-mohammadi, A. A., Amato, C., & How, J. P. (May. 2015). Decentralized control of partially observable Markov decision processes using belief space macro-actions. In
*IEEE international conference on robotics and automation*(pp. 5962–5969). Seattle, WA.Google Scholar - Papadimitriou, C. H., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes.
*Mathematics of Operations Research*,*12*(3), 441–450.MathSciNetCrossRefzbMATHGoogle Scholar - Parr, R., & Russell, S. (1998).
*Hierarchical control and learning for Markov decision processes*. Berkeley, CA: University of California.Google Scholar - Prentice, S., & Roy, N. (2010). The belief roadmap: Efficient planning in linear POMDPs by factoring the covariance. In
*Robotics research*(pp. 293–305). Springer.Google Scholar - Puterman, M. (2014).
*Markov decision processes: Discrete stochastic dynamic programming*. Hoboken: Wiley.zbMATHGoogle Scholar - Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In
*International conference on machine learning*(pp. 1889–1897). Lille: France.Google Scholar - Sutton, R., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.
*Artificial Intelligence*,*112*(1–2), 181–211.MathSciNetCrossRefzbMATHGoogle Scholar - Theocharous, G., & Kaelbling, L. P. (2004). Approximate planning in POMDPs with macro-actions. In
*Advances in neural information processing systems*(pp. 775–782).Google Scholar - Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In
*Advances in neural information processing systems*Vol. 30 (pp. 5285–5294).Google Scholar