Look-Ahead Before You Leap: End-to-End Active Recognition by Forecasting the Effect of Motion

  • Dinesh JayaramanEmail author
  • Kristen Grauman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


Visual recognition systems mounted on autonomous moving agents face the challenge of unconstrained data, but simultaneously have the opportunity to improve their performance by moving to acquire new views of test data. In this work, we first show how a recurrent neural network-based system may be trained to perform end-to-end learning of motion policies suited for this “active recognition” setting. Further, we hypothesize that active vision requires an agent to have the capacity to reason about the effects of its motions on its view of the world. To verify this hypothesis, we attempt to induce this capacity in our active recognition pipeline, by simultaneously learning to forecast the effects of the agent’s motions on its internal representation of the environment conditional on all past views. Results across two challenging datasets confirm both that our end-to-end system successfully learns meaningful policies for active category recognition, and that “learning to look ahead” further boosts recognition performance.


Active Recognition Camera Motion Active Vision Partially Observable Markov Decision Process Object Instance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research is supported in part by ONR PECASE N00014-15-1-2291. We also thank Texas Advanced Computing Center for their generous support, and Mohsen Malmir and Jianxiong Xiao for their assistance sharing GERMS and SUN360 data respectively.

Supplementary material

419978_1_En_30_MOESM1_ESM.pdf (22.3 mb)
Supplementary material 1 (pdf 22805 KB)


  1. 1.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)Google Scholar
  2. 2.
    Aloimonos, J., Weiss, I., Bandyopadhyay, A.: Active vision. IJCV 1, 333–356 (1988)CrossRefGoogle Scholar
  3. 3.
    Andreopoulos, A., Tsotsos, J.: A theory of active object localization. In: ICCV (2009)Google Scholar
  4. 4.
    Andreopoulos, A., Tsotsos, J.: 50 years of object recognition: directions forward. CVIU 117, 827–891 (2013)Google Scholar
  5. 5.
    Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. In: ICLR (2015)Google Scholar
  6. 6.
    Bajcsy, R.: Active perception. Proc. IEEE 76, 996–1005 (1988)CrossRefGoogle Scholar
  7. 7.
    Ballard, D.: Animate vision. Artif. Intell. 48, 57–86 (1991)CrossRefGoogle Scholar
  8. 8.
    Bazzani, L., et al.: Learning attentional policies for tracking and recognition in video with deep networks. In: ICML (2011)Google Scholar
  9. 9.
    Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. JMLR 13, 281–305 (2012)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Borotschnig, H., Paletta, L., Prantl, M., Pinz, A., et al.: Active object recognition in parametric eigenspace. In: BMVC (1998)Google Scholar
  11. 11.
    Bowling, M., Ghodsi, A., Wilkinson, D.: Action respecting embedding. In: ICML (2005)Google Scholar
  12. 12.
    Brentano, F.: Psychologie vom empirischen Standpunkte (1874)Google Scholar
  13. 13.
    Butko, N., Movellan, J.: Optimal scanning for faster object detection. In: CVPR (2009)Google Scholar
  14. 14.
    Callari, F., Ferrie, F.: Active object recognition: looking for differences. IJCV 43, 189–204 (2001)CrossRefzbMATHGoogle Scholar
  15. 15.
    Chen, C., Seff, A., Kornhauser, A., Xiao, J.: DeepDriving: learning affordance for direct perception in autonomous driving. In: ICCV (2015)Google Scholar
  16. 16.
    Cohen, T.S., Welling, M.: Transformation properties of learned visual representations. arXiv preprint arXiv:1412.7659 (2014)
  17. 17.
    Denzler, J., Brown, C.M.: Information theoretic sensor data selection for active object recognition and state estimation. TPAMI 24, 145–157 (2002)CrossRefGoogle Scholar
  18. 18.
    Dickinson, S., Christensen, H., Tsotsos, J., Olofsson, G.: Active object recognition integrating attention and viewpoint control. CVIU 67, 239–260 (1997)Google Scholar
  19. 19.
    Ding, W., Taylor, G.W.: Mental rotation by optimizing transforming distance. In: NIPS DL Workshop (2014)Google Scholar
  20. 20.
    Flynn, J., Neulander, I., Philbin, J., Snavely, N.: DeepStereo: Learning to predict new views from the world’s imagery. In: CVPR (2016)Google Scholar
  21. 21.
    Garcia, A.G., Vezhnevets, A., Ferrari, V.: An active search strategy for efficient object detection. In: CVPR (2015)Google Scholar
  22. 22.
    Helmer, S., et al.: Semantic robot vision challenge: current state and future directions. In: IJCAI Workshop (2009)Google Scholar
  23. 23.
    Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: ICCV (2015)Google Scholar
  24. 24.
    Jayaraman, D., Grauman, K.: Slow and steady feature analysis: higher order temporal coherence in video. In: CVPR (2016)Google Scholar
  25. 25.
    Kulkarni, T.D., Whitney, W., Kohli, P., Tenenbaum, J.B.: Deep convolutional inverse graphics network. In: NIPS (2015)Google Scholar
  26. 26.
    Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-End training of deep visuomotor policies. In: ICRA (2015)Google Scholar
  27. 27.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Heidelberg (2014)Google Scholar
  28. 28.
    Malmir, M., Sikka, K., Forster, D., Movellan, J., Cottrell, G.W.: Deep Q-learning for active recognition of GERMS. In: BMVC (2015)Google Scholar
  29. 29.
    Mishra, A., Aloimonos, Y., Fermuller, C.: Active segmentation for robotics. In: IROS (2009)Google Scholar
  30. 30.
    Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS (2014)Google Scholar
  31. 31.
    Paletta, L., Pinz, A.: Active object recognition by view integration and reinforcement learning. In: RAS (2000)Google Scholar
  32. 32.
    Ramanathan, V., Pinz, A.: Active object categorization on a humanoid robot. In: VISAPP (2011)Google Scholar
  33. 33.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
  34. 34.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Schiele, B., Crowley, J.: Transinformation for active object recognition. In: ICCV (1998)Google Scholar
  36. 36.
    Sermanet, P., Frome, A., Real, E.: Attention for fine-grained categorization. arXiv (2014)Google Scholar
  37. 37.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv (2014)Google Scholar
  38. 38.
    Soatto, S.: Actionable information in vision. In: ICCV (2009)Google Scholar
  39. 39.
    Stober, J., Miikkulainen, R., Kuipers, B.: Learning geometry from sensorimotor experience. In: ICDL (2011)Google Scholar
  40. 40.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  41. 41.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. In: CVPR (2016)Google Scholar
  42. 42.
    Walker, J., Gupta, A., Hebert, M.: Dense optical flow prediction from a static image. In: ICCV (2015)Google Scholar
  43. 43.
    Watter, M., Springenberg, J.T., Boedecker, J., Riedmiller, M.: Embed to control: a locally linear latent dynamics model for control from raw images. In: NIPS (2015)Google Scholar
  44. 44.
    Wilkes, D., Tsotsos, J.: Active object recognition. In: CVPR (1992)Google Scholar
  45. 45.
    Williams, R.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. JMLR 8, 229–256 (1992)zbMATHGoogle Scholar
  46. 46.
    Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D ShapeNets: a deep representation for volumetric shape modeling. In: CVPR (2015)Google Scholar
  47. 47.
    Xiao, J., Ehinger, K., Oliva, A., Torralba, A., et al.: Recognizing scene viewpoint using panoramic place representation. In: CVPR (2012)Google Scholar
  48. 48.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML (2015)Google Scholar
  49. 49.
    Yu, X., Fermuller, C., Teo, C.L., Yang, Y., Aloimonos, Y.: Active scene recognition with vision and language. In: CVPR (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.The University of Texas at AustinAustinUSA

Personalised recommendations