Model-Based Robot Imitation with Future Image Similarity

  • A. WuEmail author
  • A. J. Piergiovanni
  • M. S. Ryoo
Part of the following topical collections:
  1. Special Issue: Deep Learning for Robotic Vision


We present a visual imitation learning framework that enables learning of robot action policies solely based on expert samples without any robot trials. Robot exploration and on-policy trials in a real-world environment could often be expensive/dangerous. We present a new approach to address this problem by learning a future scene prediction model solely from a collection of expert trajectories consisting of unlabeled example videos and actions, and by enabling action selection using future image similarity. In this approach, the robot learns to visually imagine the consequences of taking an action, and obtains the policy by evaluating how similar the predicted future image is to an expert sample. We develop an action-conditioned convolutional autoencoder, and present how we take advantage of future images for zero-online-trial imitation learning. We conduct experiments in simulated and real-life environments using a ground mobility robot with and without obstacles in reaching target objects. We explicitly compare our models to multiple baseline methods requiring only offline samples. The results confirm that our proposed methods perform superior to previous methods, including 1.5 \(\times \) and 2.5 \(\times \) higher success rate in two different tasks than behavioral cloning.


Robot action policy learning Behavioral cloning Model-based RL 



  1. Abbeel, P., & Ng, A. Y. (2004) Apprenticeship learning via inverse reinforcement learning. In International conference on machine learning (ICML).Google Scholar
  2. Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 31, 469–483.CrossRefGoogle Scholar
  3. Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017). Stochastic variational video prediction. In CoRR
  4. Baram, N., Anschel, O., Caspi, I., & Mannor, S. (2017). End-to-end differentiable adversarial imitation learning. In International conference on machine learning (ICML) (pp. 390–399).Google Scholar
  5. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. (2016) End to end learning for self-driving cars. arXiv:1604.07316.
  6. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  7. Chao, Y. W., Yang, J., Price, B., Cohen, S., & Deng, J. (2016). Forecasting human dynamics from static images. In: IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  8. Chiappa, S., Racanière, S., Wierstra, D., & Mohamed, S. (2017). Recurrent environment simulators. In CoRR.
  9. Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. In CoRR. arXiv:1802.07687.
  10. Dosovitskiy, A., Springenberg, J. T., Tatarchenko, M., & Brox, T. (2017). Learning to generate chairs, tables and cars with convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 692–705.Google Scholar
  11. Finn, C., & Levine, S. (2017). Deep visual foresight for planning robot motion. In IEEE international conference on robotics and automation (ICRA). IEEE (pp. 2786–2793).Google Scholar
  12. Finn, C., Goodfellow, I. J., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In CoRR.
  13. Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. arXiv:1603.00448.
  14. Giusti, A., Guzzi, J., Cireşan, D. C., He, F.-L., Rodríguez, J. P., Fontana, F., et al. (2016). A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2), 661–667.CrossRefGoogle Scholar
  15. Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In Advances in neural information processing systems (NIPS).Google Scholar
  16. Ho, J., Gupta, J., & Ermon, S. (2016). Model-free imitation learning with policy optimization. arXiv:1605.08478.
  17. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
  18. Laskey, M., Lee, J., Hsieh, W., Liaw, R., Mahler, J., Fox, R., & Goldberg, K. (2017). Iterative noise injection for scalable imitation learning. arXiv:1703.09327.
  19. Lee, J., & Ryoo, M. S. (2017). Learning robot activities from first-person human videos using convolutional future regression. In IEEE/RSJ international conference on intelligent robots and systems (IROS).Google Scholar
  20. Levine, S., Pastor, P., Krizhevsky, A., & Quillen, D. (2016). Learning hand-eye coordination for robotic grasping with large-scale data collection. In International symposium on experimental robotics (pp. 173–184). Springer.Google Scholar
  21. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In European conference on computer vision (ECCV).Google Scholar
  22. Liu, Y., Gupta, A., Abbeel, P., & Levine, S. (2018). Imitation from observation: learning to imitate behaviors from raw video via context translation. arXiv:1707.03374.
  23. Liu, Z., Yeh, R. A., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In IEEE international conference on computer vision (ICCV).Google Scholar
  24. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.CrossRefGoogle Scholar
  25. Ng, A. Y., & Jordan, M. I. (2000). Inverse reinforcement learning. In International conference on machine learning (ICML).Google Scholar
  26. Oh, J., Guo, X., Lee, H., Lewis, R. L., & Singh, S. (2015). Action-conditional video prediction using deep networks in atari games. In CoRR. arXiv:1507.08750.
  27. Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, Y., Efros, A. A., & Darrell, T. (2018). Zero-shot visual imitation. arXiv:1804.08606.
  28. Peng, X. B., Abbeel, P., Levine, S., & van de Panne, M. (2018). Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. In ACM SIGGRAPH.Google Scholar
  29. Piergiovanni, A. J., & Ryoo, M. S. (2018). Learning latent super-events to detect multiple activities in videos. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  30. Pomerleau, D. A. (1989). Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems (NIPS) (pp. 305–313).Google Scholar
  31. Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1), 88–97.CrossRefGoogle Scholar
  32. Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434.
  33. Ross, S., Gordon, G., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International conference on artificial intelligence and statistics (pp. 627–635).Google Scholar
  34. Sadeghi, F., Toshev, A., Jang, E., & Levine, S. (2017). Sim2real view invariant visual servoing by recurrent control. arXiv:1712.07642.
  35. Salvador, S., & Chan, P. (2004). Fastdtw: Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5), 561–580.CrossRefGoogle Scholar
  36. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
  37. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge: MIT press.zbMATHGoogle Scholar
  38. Tatarchenko, M., Dosovitskiy, A., & Brox, T. (2016). Multi-view 3D models from single images with a convolutional network. In European conference on computer vision (ECCV).Google Scholar
  39. Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. arXiv:1805.01954.
  40. Vakanski, A., Mantegh, I., Irish, A., & Janabi-Sharifi, F. (2012). Trajectory learning for robot programming by demonstration using hidden markov model and dynamic time warping. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(4), 1039–1052.CrossRefGoogle Scholar
  41. Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating visual representations from unlabeled video. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 98–106).Google Scholar
  42. Walker, J., Gupta, A., & Hebert, M. (2014). Patch to the future: Unsupervised visual prediction. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3302–3309).Google Scholar
  43. Walker, J., Marino, K., Gupta, A., & Hebert, M. (2017). The pose knows: Video forecasting by generating pose futures. In IEEE international conference on computer vision (ICCV) (pp. 3352–3361).Google Scholar
  44. Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.CrossRefGoogle Scholar
  45. Wulfmeier, M., Ondruska, P., & Posner, I. (2015). Deep inverse reinforcement learning. arXiv:1507.04888.
  46. Zhou, T., Tulsiani, S., Sun, W., Malik, J., & Efros, A. A. (2016). View synthesis by appearance flow. In European conference on computer vision (ECCV) (2016) (pp. 286–301).Google Scholar
  47. Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2016). Target-driven visual navigation in indoor scenes using deep reinforcement learning. arXiv:1609.05143.
  48. Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In PAAAI conference on artificial intelligence (AAAI).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Indiana UniversityBloomingtonUSA
  2. 2.Stony Brook UniversityStony BrookUSA

Personalised recommendations