Active deep Q-learning with demonstration

  • Si-An ChenEmail author
  • Voot Tangkaratt
  • Hsuan-Tien Lin
  • Masashi Sugiyama
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2020 Journal Track


Reinforcement learning (RL) is a machine learning technique aiming to learn how to take actions in an environment to maximize some kind of reward. Recent research has shown that although the learning efficiency of RL can be improved with expert demonstration, it usually takes considerable efforts to obtain enough demonstration. The efforts prevent training decent RL agents with expert demonstration in practice. In this work, we propose Active Reinforcement Learning with Demonstration, a new framework to streamline RL in terms of demonstration efforts by allowing the RL agent to query for demonstration actively during training. Under the framework, we propose Active deep Q-Network, a novel query strategy based on a classical RL algorithm called deep Q-network (DQN). The proposed algorithm dynamically estimates the uncertainty of recent states and utilizes the queried demonstration data by optimizing a supervised loss in addition to the usual DQN loss. We propose two methods of estimating the uncertainty based on two state-of-the-art DQN models, namely the divergence of bootstrapped DQN and the variance of noisy DQN. The empirical results validate that both methods not only learn faster than other passive expert demonstration methods with the same amount of demonstration and but also reach super-expert level of performance across four different tasks.


Active learning Reinforcement learning Learning from demonstration 



MS was supported by KAKENHI 17H00757. SC and HL were partially supported by MOST 107-2628-E-002-008-MY3 of Taiwan.


  1. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Systems, Man, and Cybernetics, 13(5), 834–846.CrossRefGoogle Scholar
  2. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym. CoRR abs/1606.01540. arXiv:1606.01540.
  3. Brys, T., Harutyunyan, A., Suay, H.B., Chernova, S., Taylor, M.E., & Nowé, A. (2015). Reinforcement learning from demonstration through shaping. In IJCAI AAAI Press, pp. 3352–3358.Google Scholar
  4. Dagan, I., & Engelson, S.P. (1995). Committee-based sampling for training probabilistic classifiers. In Machine learning, proceedings of the twelfth international conference on machine learning, Tahoe City, California, USA, July 9–12, 1995, pp. 150–157, Scholar
  5. Fortunato, M., Azar, M.G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Blundell, C., & Legg, S. (2017). Noisy networks for exploration. CoRR abs/1706.10295. arXiv:1706.10295.
  6. Gal, Y., Islam, R., & Ghahramani, Z. (2017). Deep bayesian active learning with image data. In Precup, D., Teh, Y. W. (eds.) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, PMLR, Proceedings of Machine Learning Research (Vol. 70, Pp. 1183–1192).
  7. Geramifard, A., Dann, C., Klein, R. H., Dabney, W., & How, J. P. (2015). Rlpy: a value-function-based reinforcement learning framework for education and research. Journal of Machine Learning Research, 16, 1573–1578.Google Scholar
  8. Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J., Leibo, J.Z., & Gruslys, A. (2018). Deep Q-learning from demonstrations. In McIlraith SA, Weinberger KQ (eds) AAAI, AAAI Press. Retrieved April 6, 2018, from
  9. Hosu, I., Rebedea, T. (2016). Playing atari games with deep reinforcement learning and human checkpoint replay. CoRR abs/1607.05077.Google Scholar
  10. Judah, K., Fern, A. P., Dietterich, T. G., et al. (2014). Active lmitation learning: formal and practical reductions to iid learning. The Journal of Machine Learning Research, 15(1), 3925–3963.zbMATHGoogle Scholar
  11. Kang, B., Jie, Z., Feng, J. (2018). Policy optimization with demonstrations. In Dy. J,, Krause, A. (Eds.) Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholmsmässan, Stockholm Sweden, Proceedings of Machine Learning Research (Vol. 80, pp. 2469–2478). Retrieved April 3, 2018, from
  12. Krawczyk, B., & Wozniak, M. (2017). Online query by committee for active learning from drifting data streams. In 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK, USA, May 14–19, 2017, IEEE, pp 2120–2127.
  13. Lipton, Z.C., Gao, J., Li, L., Li, X., Ahmed, F., & Deng, L. (2016). Efficient exploration for dialog policy learning with deep BBQ networks \({\backslash }\) & replay buffer spiking. CoRR abs/1608.05081Google Scholar
  14. Mitchell, T. M. (1982). Generalization as search. Artificial Intelligence, 18(2), 203–226. Scholar
  15. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. Scholar
  16. Moore, A.W. (1990). Efficient memory-based learning for robot control. Tech. rep.Google Scholar
  17. Osband, I., Blundell, C., Pritzel, A., Roy, B.V. (2016). Deep exploration via bootstrapped DQN. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (Eds.) NIPS (pp. 4026–4034). Retrieved April 7, 2018, from
  18. Piot, B., Geist, M., Pietquin, O. (2014). Boosted bellman residual minimization handling expert demonstrations. In ECML/PKDD (2), Springer, Lecture Notes in Computer Science (vol 8725, pp. 549–564).CrossRefGoogle Scholar
  19. Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R.Y., Chen, X., Asfour, T., Abbeel, P., Andrychowicz, M. (2017). Parameter space noise for exploration. CoRR abs/1706.01905. arXiv:1706.01905
  20. Ross, S., Gordon, G.J., Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS,, JMLR Proceedings (Vol. 15, pp. 627–635).Google Scholar
  21. Schaal, S. (1996). Learning from demonstration. In NIPS (pp. 1040–1046). MIT Press .Google Scholar
  22. Schaul, T., Quan, J., Antonoglou, I., Silver, D. (2016). Prioritized experience replay. In International Conference on Learning Representations. Puerto RicoGoogle Scholar
  23. Settles, B. (2009). Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–MadisonGoogle Scholar
  24. Shon, A.P., Verma, D., Rao, R.P.N. (2007). Active imitation learning. In AAAI (pp. 756–762) AAAI Press.Google Scholar
  25. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.CrossRefGoogle Scholar
  26. Suay, H.B., Brys, T., Taylor, M.E., Chernova, S. (2016). Learning from demonstration for shaping through inverse reinforcement learning. In AAMAS (pp. 429–437). ACM.Google Scholar
  27. Subramanian, K., Jr CLI, Thomaz, A.L. (2016). Exploration from demonstration for interactive reinforcement learning. In AAMAS (pp 447–456). ACM.Google Scholar
  28. Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., & Bagnell, J. A. (2017). Deeply aggrevated: Differentiable imitation learning for sequential prediction. ICML, PMLR, Proceedings of Machine Learning Research, 70, 3309–3318.Google Scholar
  29. Sutton, R.S. (1995). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In NIPS (pp. 1038–1044). MIT PressGoogle Scholar
  30. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning–an introduction. Adaptive computation and machine learning. Cambridge: MIT Press.Google Scholar
  31. Taylor, M.E., Suay, H.B., Chernova, S. (2011). Integrating reinforcement learning with human demonstrations of varying ability. In AAMAS, IFAAMAS (pp. 617–624).Google Scholar
  32. van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. In Schuurmans D, Wellman MP (eds) Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12–17, 2016 (pp. 2094–2100). Phoenix, AZ: AAAI Press. Retrieved April 11, 2018, from
  33. Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., & Riedmiller, M.A. (2017). Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. CoRR abs/1707.08817. arXiv:1707.08817.
  34. Wang, Z., & Taylor, M.E. (2017). Improving reinforcement learning with confidence-based demonstrations. In IJCAI (pp 3027–3033).Google Scholar
  35. Watter, M., Springenberg, J.T., Boedecker, J., & Riedmiller, M.A. (2015). Embed to control: A locally linear latent dynamics model for control from raw images. In NIPS (pp 2746–2754).Google Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.CSIE DepartmentNational Taiwan UniversityTaipeiTaiwan
  2. 2.RIKEN Center for Advanced Intelligence ProjectTokyoJapan
  3. 3.Graduate School of Frontier SciencesThe University of TokyoChibaJapan

Personalised recommendations