Advertisement

Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions

  • Misha Wagner
  • Hector BaseviEmail author
  • Rakshith Shetty
  • Wenbin Li
  • Mateusz Malinowski
  • Mario Fritz
  • Aleš Leonardis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11129)

Abstract

In-depth scene descriptions and question answering tasks have greatly increased the scope of today’s definition of scene understanding. While such tasks are in principle open ended, current formulations primarily focus on describing only the current state of the scenes under consideration. In contrast, in this paper, we focus on the future states of the scenes which are also conditioned on actions. We posit this as a question answering task, where an answer has to be given about a future scene state, given observations of the current scene, and a question that includes a hypothetical action. Our solution is a hybrid model which integrates a physics engine into a question answering architecture in order to anticipate future scene states resulting from object-object interactions caused by an action. We demonstrate first results on this challenging new problem and compare to baselines, where we outperform fully data-driven end-to-end learning approaches.

Keywords

Scene understanding Visual turing test Visual question answering Intuitive physics 

Notes

Acknowledgements

We acknowledge MoD/Dstl and EPSRC for providing the grant to support the UK academics involvement in a Department of Defense funded MURI project through EPSRC grant EP/N019415/1.

Supplementary material

478770_1_En_32_MOESM1_ESM.pdf (1.4 mb)
Supplementary material 1 (pdf 1413 KB)

References

  1. 1.
    Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: International Conference on Learning Representations (2016)Google Scholar
  2. 2.
    Bhattacharyya, A., Malinowski, M., Schiele, B., Fritz, M.: Long-term image boundary extrapolation. In: Association for the Advancement of Artificial Intelligence (2018)Google Scholar
  3. 3.
    Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: International Conference on Machine Learning (2016)Google Scholar
  4. 4.
    Mottaghi, R., Rastegari, M., Gupta, A., Farhadi, A.: “What happens if...” learning to predict the effect of forces in images. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 269–285. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_17CrossRefGoogle Scholar
  5. 5.
    McCloskey, M.: Intuitive physics. Sci. Am. 248(4), 122–131 (1983)CrossRefGoogle Scholar
  6. 6.
    Grzeszczuk, R., Terzopoulos, D., Hinton, G.: Neuroanimator: fast neural network emulation and control of physics-based models. In: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, pp. 9–20. ACM (1998)Google Scholar
  7. 7.
    Battaglia, P.W., Hamrick, J.B., Tenenbaum, J.B.: Simulation as an engine of physical scene understanding. Proc. Natl. Acad. Sci. 110(45), 18327–18332 (2013)CrossRefGoogle Scholar
  8. 8.
    Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J., et al.: Interaction networks for learning about objects, relations and physics. In: Advances in Neural Information Processing Systems, pp. 4502–4510 (2016)Google Scholar
  9. 9.
    Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P., Zoran, D.: Visual interaction networks. In: Advances in Neural Information Processing Systems (2017)Google Scholar
  10. 10.
    Wu, J., Yildirim, I., Lim, J.J., Freeman, B., Tenenbaum, J.: Galileo: perceiving physical object properties by integrating a physics engine with deep learning. In: Advances in Neural Information Processing Systems, pp. 127–135 (2015)Google Scholar
  11. 11.
    Li, W., Leonardis, A., Fritz, M.: Visual stability prediction for robotic manipulation. In: Proceedings of the IEEE International Conference on Robotics and Automation (2017)Google Scholar
  12. 12.
    Li, W., Azimi, S., Leonardis, A., Fritz, M.: To fall or not to fall: a visual approach to physical stability prediction. CoRR abs/1604.00066 (2016)Google Scholar
  13. 13.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. CoRR abs/1412.6604 (2014)Google Scholar
  14. 14.
    Mottaghi, R., Bagherinezhad, H., Rastegari, M., Farhadi, A.: Newtonian scene understanding: unfolding the dynamics of objects in static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3521–3529 (2016)Google Scholar
  15. 15.
    Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)Google Scholar
  16. 16.
    Malinowski, M., Fritz, M.: Towards a visual turing challenge. CoRR abs/1410.8027 (2014)Google Scholar
  17. 17.
    Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual Turing test for computer vision systems. Proc. Natl. Acad. Sci. 112(12), 3618–3623 (2015)Google Scholar
  18. 18.
    Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems (2015)Google Scholar
  19. 19.
    Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual madlibs: fill in the blank description generation and question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2461–2469. IEEE (2015)Google Scholar
  20. 20.
    Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)Google Scholar
  21. 21.
    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4631–4640 (2016)Google Scholar
  22. 22.
    Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  23. 23.
    Kafle, K., Cohen, S., Price, B., Kanan, C.: DVQA: understanding data visualizations via question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  24. 24.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)Google Scholar
  25. 25.
    Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a deep learning approach to visual question answering. Int. J. Comput. Vis. 125(1–3), 110–135 (2017)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)Google Scholar
  27. 27.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. CoRR abs/1606.01847 (2016)Google Scholar
  28. 28.
    Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (2017)Google Scholar
  29. 29.
    Santoro, A., et al.: A simple neural network module for relational reasoning. In: Advances in Neural Information Processing Systems, pp. 4974–4983 (2017)Google Scholar
  30. 30.
    Ehrhardt, S., Monszpart, A., Mitra, N.J., Vedaldi, A.: Taking visual motion prediction to new heightfields. CoRR abs/1712.09448 (2017)Google Scholar
  31. 31.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)Google Scholar
  32. 32.
    Beattie, C., et al.: DeepMind Lab. CoRR abs/1612.03801 (2016)Google Scholar
  33. 33.
    Kempka, M., Wydmuch, M., Runc, G., Toczek, J., Jaśkowski, W.: ViZDoom: a doom-based AI research platform for visual reinforcement learning. In: IEEE Conference on Computational Intelligence and Games, pp. 1–8. IEEE (2016)Google Scholar
  34. 34.
    Shah, S., Dey, D., Lovett, C., Kapoor, A.: AirSim: high-fidelity visual and physical simulation for autonomous vehicles. In: Hutter, M., Siegwart, R. (eds.) Field and Service Robotics. SPAR, vol. 5, pp. 621–635. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-67361-5_40CrossRefGoogle Scholar
  35. 35.
    Wu, Y., Wu, Y., Gkioxari, G., Tian, Y.: Building generalizable agents with a realistic and rich 3D environment. CoRR abs/1801.02209 (2018)Google Scholar
  36. 36.
    Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Advances in Neural Information Processing Systems, pp. 64–72 (2016)Google Scholar
  37. 37.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  38. 38.
    Çalli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: Benchmarking in manipulation research: the YCB object and model set and benchmarking protocols. CoRR abs/1502.03143 (2015)Google Scholar
  39. 39.
    Coumans, E.: Bullet 3 (2018). https://github.com/bulletphysics/bullet3
  40. 40.
    Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  41. 41.
    Chollet, F., et al.: Keras (2015). https://github.com/keras-team/keras
  42. 42.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). Software available from tensorflow.org

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Misha Wagner
    • 1
  • Hector Basevi
    • 1
    Email author
  • Rakshith Shetty
    • 2
  • Wenbin Li
    • 2
  • Mateusz Malinowski
    • 2
  • Mario Fritz
    • 3
  • Aleš Leonardis
    • 1
  1. 1.University of BirminghamBirminghamUK
  2. 2.Max Planck Institute for InformaticsSaarbrückenGermany
  3. 3.CISPA Helmholtz Center i.G.SaarbrückenGermany

Personalised recommendations