Learning the optimal state-feedback via supervised imitation learning

Tailor, Dharmesh; Izzo, Dario

doi:10.1007/s42064-019-0054-0

Learning the optimal state-feedback via supervised imitation learning

Research Article
Published: 04 September 2019

Volume 3, pages 361–374, (2019)
Cite this article

Astrodynamics Aims and scope Submit manuscript

Dharmesh Tailor¹ &
Dario Izzo¹

368 Accesses
23 Citations
Explore all metrics

A Correction to this article was published on 11 February 2022

This article has been updated

Abstract

Imitation learning is a control design paradigm that seeks to learn a control policy reproducing demonstrations from expert agents. By substituting expert demonstrations for optimal behaviours, the same paradigm leads to the design of control policies closely approximating the optimal state-feedback. This approach requires training a machine learning algorithm (in our case deep neural networks) directly on state-control pairs originating from optimal trajectories. We have shown in previous work that, when restricted to low-dimensional state and control spaces, this approach is very successful in several deterministic, non-linear problems in continuous-time. In this work, we refine our previous studies using as a test case a simple quadcopter model with quadratic and time-optimal objective functions. We describe in detail the best learning pipeline we have developed, that is able to approximate via deep neural networks the state-feedback map to a very high accuracy. We introduce the use of the softplus activation function in the hidden units of neural networks showing that it results in a smoother control profile whilst retaining the benefits of rectifiers. We show how to evaluate the optimality of the trained state-feedback, and find that already with two layers the objective function reached and its optimal value differ by less than one percent. We later consider also an additional metric linked to the system asymptotic behaviour-time taken to converge to the policy’s fixed point. With respect to these metrics, we show that improvements in the mean absolute error do not necessarily correspond to better policies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-agent deep reinforcement learning: a survey

Article Open access 15 April 2021

Deep learning: systematic review, models, challenges, and research directions

Article Open access 07 September 2023

A review of motion planning algorithms for intelligent robots

Article Open access 25 November 2021

Change history

11 February 2022
A Correction to this paper has been published: https://doi.org/10.1007/s42064-022-0136-2

References

Kirk, D. E. Optimal Control Theory. Prentice-Hall, 1970.
Google Scholar
Bardi, M., Capuzzo-Dolcetta, I. Continuous viscosity solutions of Hamilton-Jacobi equations. In: Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations. Systems & Control: Foundations & Applications. Birkhäuser, 1997: 25–96.
MATH Google Scholar
Hadamard, J. Sur les problèmes aux dérivées partielles et leur signification physique. Princeton University Bulletin, 1902: 49–52.
Google Scholar
Beard, R. W., Saridis, G. N., Wen, J. T. Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation. Automatica, 1997, 33(12):2159–2177.
Article MathSciNet MATH Google Scholar
Pontryagin, L. S., Boltyanskii, V., Gamkrelidze, R., Mishchenko, E. The Mathematical Theory of Optimal Processes. Interscience, 1962.
MATH Google Scholar
Sánchez-Sánchez, C., Izzo, D. Real-time optimal control via deep neural networks: Study on landing problems. Journal of Guidance, Control, and Dynamics, 2018, 41(5):1122–1135.
Article Google Scholar
Pomerleau, D. A. ALVINN: An autonomous land vehicle in a neural network. In: Proceedings of the 1st International Conference on Neural Information Processing Systems, 1988: 305–313.
Google Scholar
Ross, S., Bagnell, J. A. Efficient reductions for imitation learning. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 661–668.
Google Scholar
Mordatch, I., Todorov, E. Combining the benefits of function approximation and trajectory optimization. In: Robotics: Science and Systems, 2014.
Google Scholar
Ross, S., Gordon, G., Bagnell, J. A. A reduction of imitation learning and structured prediction to noregret online learning. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011: 627–635.
Google Scholar
Levine, S., Koltun, V. Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning, 2013: 1–9.
Google Scholar
Levine, S., Koltun, V. Variational policy search via trajectory optimization. In: Proceedings of the 26th International Conference on Neural Information Processing Systems — Volume 1, 2013: 207–215.
Google Scholar
Stolle, M., Atkeson, C. G. Policies based on trajectory libraries. In: Proceedings of the 2006 IEEE International Conference on Robotics and Automation, 2006: 3344–3349.
Google Scholar
Furfaro, R, Bloise, I., Orlandelli, M., Di Lizia, P., Topputo, F., Linares, R. A recurrent deep architecture for quasi-optimal feedback guidance in planetary landing. In: Proceedings of the IAA SciTech Forum on Space Flight Mechanics and Space Structures and Materials, 2018: 1–24.
Google Scholar
Izzo, D., Sprague, C. I., Tailor, D. V. Machine learning and evolutionary techniques in interplanetary trajectory design. In: Modeling and Optimization in Space Engineering. Springer Optimization and Its Applications, Vol. 144. Fasano, G., Pintér, J. Eds. Springer Cham, 2019: 191–210.
Chapter Google Scholar
Hehn, M., Ritz, R., D’Andrea, R. Performance benchmarking of quadrotor systems using time-optimal control. Autonomous Robots, 2012, 33(1–2):69–88.
Article Google Scholar
Betts, J. T. Survey of numerical methods for trajectory optimization. Journal of Guidance, Control, and Dynamics, 1998, 21(2):193–207.
Article MathSciNet MATH Google Scholar
Betts, J. T. Practical Methods for Optimal Control and Estimation using Nonlinear Programming. Society for Industrial and Applied Mathematics, 2010.
Book MATH Google Scholar
Gill, P. E., Murray, W., Saunders, M. A. SNOPT: An SQP algorithm for large-scale constrained optimization. SIAM Review, 2005, 47(1):99–131.
Article MathSciNet MATH Google Scholar
Nair, V., Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, 2010: 807–814.
Google Scholar
Glorot, X., Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 249–256.
Google Scholar
Kingma, D. P., Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Google Scholar
LeCun, Y. A., Bottou, L., Orr, G. B., Müller, K. R. Efficient BackProp. In:Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, Vol. 7700. Montavon, G., Orr, G. B., Müller, K. R. Eds. Springer Berlin Heidelberg, 2012: 9–48.
Google Scholar
Smith, L. N. Cyclical learning rates for training neural networks. In: Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision, 2017: 464–472.
Chapter Google Scholar
Bengio, Y. Practical recommendations for gradientbased training of deep architectures. In: Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, Vol. 7700. Montavon, G., Orr, G. B., Müller, K. R. Eds. Springer Berlin Heidelberg, 2012: 437–478.
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(1):1929–1958.
MathSciNet MATH Google Scholar
Ioffe, S., Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning — Volume 37, 2015: 448–456.
Google Scholar
He, K. M., Zhang, X. Y., Ren, S. Q., Sun, J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778.
Google Scholar

Download references

Author information

Authors and Affiliations

Advanced Concepts Team, European Space Agency, Noordwijk, 2201 AZ, the Netherlands
Dharmesh Tailor & Dario Izzo

Authors

Dharmesh Tailor
View author publications
You can also search for this author in PubMed Google Scholar
Dario Izzo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dario Izzo.

Additional information

Dharmesh Tailor has his bachelor degree in mathematics and computer science from Imperial College London (United Kingdom) and his master degree in artificial intelligence from the University of Edinburgh (United Kingdom). Following his studies, he joined the European Space Agency as a Young Graduate Trainee in the Advanced Concepts Team. His research looked at machine learning techniques for optimal control. He currently works at the RIKEN Center for AI Project (Japan) in the Approximate Bayesian Inference Team researching reinforcement learning and probabilistic inference.

Dario Izzo graduated as a doctor of aeronautical engineering from the University Sapienza of Rome (Italy). He then took his second master in “satellite platforms” at the University of Cranfield in the United Kingdom and completed his Ph.D. degree in mathematical modelling at the University Sapienza of Rome where he lectured classical mechanics and space flight mechanics. Dario Izzo later joined the European Space Agency and became the scientific coordinator of its Advanced Concepts Team. He devised and managed the Global Trajectory Optimization Competitions events, the ESA’s Summer of Code in Space and the Kelvins innovation and competition platform for space problems. He published more than 170 papers in international journals and conferences making key contributions to the understanding of flight mechanics and spacecraft control and pioneering techniques based on evolutionary and machine learning approaches. Dario Izzo received the Humies Gold Medal and led the team winning the 8th edition of the Global Trajectory Optimization Competition.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tailor, D., Izzo, D. Learning the optimal state-feedback via supervised imitation learning. Astrodyn 3, 361–374 (2019). https://doi.org/10.1007/s42064-019-0054-0

Download citation

Received: 07 January 2019
Accepted: 03 May 2019
Published: 04 September 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s42064-019-0054-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning the optimal state-feedback via supervised imitation learning

Abstract

Access this article

Similar content being viewed by others

Multi-agent deep reinforcement learning: a survey

Deep learning: systematic review, models, challenges, and research directions

A review of motion planning algorithms for intelligent robots

Change history

11 February 2022

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning the optimal state-feedback via supervised imitation learning

Abstract

Access this article

Similar content being viewed by others

Multi-agent deep reinforcement learning: a survey

Deep learning: systematic review, models, challenges, and research directions

A review of motion planning algorithms for intelligent robots

Change history

11 February 2022

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation