Advertisement

Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation

  • Xin Wang
  • Wenhan Xiong
  • Hongmin Wang
  • William Yang Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11220)

Abstract

Existing research studies on vision and language grounding for robot navigation focus on improving model-free deep reinforcement learning (DRL) models in synthetic environments. However, model-free DRL models do not consider the dynamics in the real-world environments, and they often fail to generalize to new scenes. In this paper, we take a radical approach to bridge the gap between synthetic studies and real-world practices—We propose a novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task. Our look-ahead module tightly integrates a look-ahead policy model with an environment model that predicts the next state and the reward. Experimental results suggest that our proposed method significantly outperforms the baselines and achieves the best on the real-world Room-to-Room dataset. Moreover, our scalable method is more generalizable when transferring to unseen environments.

Keywords

Vision-and-language navigation First-person view video Model-based reinforcement learning 

Supplementary material

474218_1_En_3_MOESM1_ESM.pdf (1.8 mb)
Supplementary material 1 (pdf 1827 KB)

References

  1. 1.
    Alomari, M., Duckworth, P., Hawasly, M., Hogg, D.C., Cohn, A.G.: Natural language grounding and grammar induction for robotic manipulation commands. In: Proceedings of the First Workshop on Language Grounding for Robotics, pp. 35–43 (2017)Google Scholar
  2. 2.
    Alomari, M., Duckworth, P., Hogg, D.C., Cohn, A.G.: Learning of object properties, spatial relations, and actions for embodied agents from language and vision. In: The AAAI 2017 Spring Symposium on Interactive Multisensory Object Perception for Embodied Agents Technical Report SS-17-05, pp. 444–448. AAAI Press (2017)Google Scholar
  3. 3.
    Anderson, P., et al.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2018)Google Scholar
  4. 4.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)Google Scholar
  5. 5.
    Beattie, C., et al.: Deepmind lab. arXiv preprint arXiv:1612.03801 (2016)
  6. 6.
    Borenstein, J., Koren, Y.: Real-time obstacle avoidance for fast mobile robots. IEEE Trans. Syst. Man Cybern. 19(5), 1179–1187 (1989)CrossRefGoogle Scholar
  7. 7.
    Borenstein, J., Koren, Y.: The vector field histogram-fast obstacle avoidance for mobile robots. IEEE Trans. Robot. Autom. 7(3), 278–288 (1991)CrossRefGoogle Scholar
  8. 8.
    Chang, A., et al.: Matterport3d: learning from RGB-D data in indoor environments. arXiv preprint arXiv:1709.06158 (2017)
  9. 9.
    Chen, X., Lawrence Zitnick, C.: Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422–2431 (2015)Google Scholar
  10. 10.
    Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  11. 11.
    Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. IEEE (2017)Google Scholar
  12. 12.
    Gu, S., Lillicrap, T., Sutskever, I., Levine, S.: Continuous deep q-learning with model-based acceleration. In: International Conference on Machine Learning, pp. 2829–2838 (2016)Google Scholar
  13. 13.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  14. 14.
    Huang, T.H.K., et al.: Visual storytelling. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1233–1239 (2016)Google Scholar
  15. 15.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)Google Scholar
  16. 16.
    Kempka, M., Wydmuch, M., Runc, G., Toczek, J., Jaśkowski, W.: ViZDoom: A Doom-based AI research platform for visual reinforcement learning. In: 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. IEEE (2016)Google Scholar
  17. 17.
    Kim, D., Nevatia, R.: Symbolic navigation with a generic map. Auton. Robot. 6(1), 69–88 (1999)CrossRefGoogle Scholar
  18. 18.
    Lenz, I., Knepper, R.A., Saxena, A.: DeepMPC: learning deep latent features for model predictive control. In: Robotics: Science and Systems (2015)Google Scholar
  19. 19.
    Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
  20. 20.
    Mei, H., Bansal, M., Walter, M.R.: Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In: AAAI, vol. 1, p. 2 (2016)Google Scholar
  21. 21.
    Misra, D.K., Langford, J., Artzi, Y.: Mapping instructions and visual observations to actions with reinforcement learning. arXiv preprint arXiv:1704.08795 (2017)
  22. 22.
    Oh, J., Singh, S., Lee, H.: Value prediction network. In: Advances in Neural Information Processing Systems, pp. 6120–6130 (2017)Google Scholar
  23. 23.
    Oriolo, G., Vendittelli, M., Ulivi, G.: On-line map building and navigation for autonomous mobile robots. In: Proceedings., 1995 IEEE International Conference on Robotics and Automation, 1995, vol. 3, pp. 2900–2906. IEEE (1995)Google Scholar
  24. 24.
    Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: International Conference on Machine Learning (ICML), vol. 2017 (2017)Google Scholar
  25. 25.
    Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015)
  26. 26.
    Silver, D., et al.: The predictron: end-to-end learning and planning. arXiv preprint arXiv:1612.08810 (2016)
  27. 27.
    Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine Learning Proceedings 1990, pp. 216–224. Elsevier (1990)Google Scholar
  28. 28.
    Talvitie, E.: Agnostic system identification for monte carlo planning. In: AAAI, pp. 2986–2992 (2015)Google Scholar
  29. 29.
    Tamar, A., Wu, Y., Thomas, G., Levine, S., Abbeel, P.: Value iteration networks. In: Advances in Neural Information Processing Systems, pp. 2154–2162 (2016)Google Scholar
  30. 30.
    Thomason, J., Sinapov, J., Mooney, R.: Guiding interaction behaviors for multi-modal grounded language learning. In: Proceedings of the First Workshop on Language Grounding for Robotics, pp. 20–24 (2017)Google Scholar
  31. 31.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. IEEE (2015)Google Scholar
  32. 32.
    Wang, X., Chen, W., Wang, Y.F., Wang, W.Y.: No metrics are perfect: adversarial reward learning for visual storytelling. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 899–909. Association for Computational Linguistics (2018)Google Scholar
  33. 33.
    Wang, X., Chen, W., Wu, J., Wang, Y.F., Wang, W.Y.: Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4213–4222 (2018)Google Scholar
  34. 34.
    Wang, X., Wang, Y.F., Wang, W.Y.: Watch, listen, and describe: Globally and locally aligned cross-modal attentions for video captioning. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 795–801. Association for Computational Linguistics (2018)Google Scholar
  35. 35.
    Watter, M., Springenberg, J., Boedecker, J., Riedmiller, M.: Embed to control: a locally linear latent dynamics model for control from raw images. In: Advances in Neural Information Processing Systems, pp. 2746–2754 (2015)Google Scholar
  36. 36.
    Weber, T., et al.: Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203 (2017)
  37. 37.
    Xiong, W., Guo, X., Yu, M., Chang, S., Zhou, B., Wang, W.Y.: Scheduled policy optimization for natural language communication with intelligent agents. arXiv preprint arXiv:1806.06187 (2018)
  38. 38.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  39. 39.
    Yao, H., Bhatnagar, S., Diao, D., Sutton, R.S., Szepesvári, C.: Multi-step dyna planning for policy evaluation and control. In: Advances in Neural Information Processing Systems, pp. 2187–2195 (2009)Google Scholar
  40. 40.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2016)Google Scholar
  41. 41.
    Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. IEEE (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Xin Wang
    • 1
  • Wenhan Xiong
    • 1
  • Hongmin Wang
    • 1
  • William Yang Wang
    • 1
  1. 1.University of CaliforniaSanta BarbaraUSA

Personalised recommendations