A Dataset and Architecture for Visual Reasoning with a Working Memory

  • Guangyu Robert YangEmail author
  • Igor Ganichev
  • Xiao-Jing Wang
  • Jonathon Shlens
  • David Sussillo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11214)


A vexing problem in artificial intelligence is reasoning about events that occur in complex, changing visual stimuli such as in video analysis or game play. Inspired by a rich tradition of visual reasoning and memory in cognitive psychology and neuroscience, we developed an artificial, configurable visual question and answer dataset (COG) to parallel experiments in humans and animals. COG is much simpler than the general problem of video analysis, yet it addresses many of the problems relating to visual and logical reasoning and memory – problems that remain challenging for modern deep learning architectures. We additionally propose a deep learning architecture that performs competitively on other diagnostic VQA datasets (i.e. CLEVR) as well as easy settings of the COG dataset. However, several settings of COG result in datasets that are progressively more challenging to learn. After training, the network can zero-shot generalize to many new tasks. Preliminary analyses of the network architectures trained on COG demonstrate that the network accomplishes the task in a manner interpretable to humans.


Visual reasoning Visual question answering Recurrent network Working memory 

Supplementary material

474197_1_En_44_MOESM1_ESM.pdf (2.1 mb)
Supplementary material 1 (pdf 2152 KB)


  1. 1.
    Hassabis, D., Kumaran, D., Summerfield, C., Botvinick, M.: Neuroscience-inspired artificial intelligence. Neuron 95(2), 245–258 (2017)CrossRefGoogle Scholar
  2. 2.
    Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. CoRR, abs/1704.05526, vol. 3 (2017)Google Scholar
  3. 3.
    Johnson, J., et al.: Inferring and executing programs for visual reasoning. arXiv preprint arXiv:1705.03633 (2017)
  4. 4.
    Santoro, A., et al.: A simple neural network module for relational reasoning. In: Advances in Neural Information Processing Systems, pp. 4974–4983 (2017)Google Scholar
  5. 5.
    Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871 (2017)
  6. 6.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)Google Scholar
  7. 7.
    Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in Neural Information Processing Systems, pp. 2296–2304 (2015)Google Scholar
  8. 8.
    Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)Google Scholar
  9. 9.
    Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)Google Scholar
  10. 10.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1988–1997. IEEE (2017)Google Scholar
  11. 11.
    Sturm, B.L.: A simple method to determine if a music information retrieval system is a horse. IEEE Trans. Multimed. 16(6), 1636–1644 (2014)CrossRefGoogle Scholar
  12. 12.
    Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models. arXiv preprint arXiv:1606.07356 (2016)
  13. 13.
    Winograd, T.: Understanding Natural Language. Academic Press Inc., Orlando (1972)Google Scholar
  14. 14.
    Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
  15. 15.
    Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)CrossRefGoogle Scholar
  16. 16.
    Vinyals, O., et al.: StarCraft II: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782 (2017)
  17. 17.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  18. 18.
    Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
  19. 19.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)Google Scholar
  20. 20.
    Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. In: International Conference on Learning Representations (2018)Google Scholar
  21. 21.
    Diamond, A.: Executive functions. Ann. Rev. Psychol. 64, 135–168 (2013)CrossRefGoogle Scholar
  22. 22.
    Miyake, A., Friedman, N.P., Emerson, M.J., Witzki, A.H., Howerter, A., Wager, T.D.: The unity and diversity of executive functions and their contributions to complex frontal lobe tasks: a latent variable analysis. Cogn. Psychol. 41(1), 49–100 (2000)CrossRefGoogle Scholar
  23. 23.
    Berg, E.A.: A simple objective technique for measuring flexibility in thinking. J. Gen. Psychol. 39(1), 15–22 (1948)CrossRefGoogle Scholar
  24. 24.
    Milner, B.: Effects of different brain lesions on card sorting: the role of the frontal lobes. Arch. Neurol. 9(1), 90–100 (1963)CrossRefGoogle Scholar
  25. 25.
    Baddeley, A.: Working memory. Science 255(5044), 556–559 (1992)CrossRefGoogle Scholar
  26. 26.
    Miller, E.K., Erickson, C.A., Desimone, R.: Neural mechanisms of visual working memory in prefrontal cortex of the macaque. J. Neurosci. 16(16), 5154–5167 (1996)CrossRefGoogle Scholar
  27. 27.
    Miller, E.K., Cohen, J.D.: An integrative theory of prefrontal cortex function. Ann. Rev. Neurosci. 24(1), 167–202 (2001)CrossRefGoogle Scholar
  28. 28.
    Newsome, W.T., Britten, K.H., Movshon, J.A.: Neuronal correlates of a perceptual decision. Nature 341(6237), 52 (1989)CrossRefGoogle Scholar
  29. 29.
    Romo, R., Salinas, E.: Cognitive neuroscience: flutter discrimination: neural codes, perception, memory and decision making. Nat. Rev. Neurosci. 4(3), 203 (2003)CrossRefGoogle Scholar
  30. 30.
    Mante, V., Sussillo, D., Shenoy, K.V., Newsome, W.T.: Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature 503(7474), 78 (2013)CrossRefGoogle Scholar
  31. 31.
    Rigotti, M., et al.: The importance of mixed selectivity in complex cognitive tasks. Nature 497(7451), 585 (2013)CrossRefGoogle Scholar
  32. 32.
    Yntema, D.B.: Keeping track of several things at once. Hum. Factors 5(1), 7–17 (1963)CrossRefGoogle Scholar
  33. 33.
    Zelazo, P.D., Frye, D., Rapus, T.: An age-related dissociation between knowing rules and using them. Cogn. Dev. 11(1), 37–63 (1996)CrossRefGoogle Scholar
  34. 34.
    Owen, A.M., McMillan, K.M., Laird, A.R., Bullmore, E.: N-back working memory paradigm: a meta-analysis of normative functional neuroimaging studies. Hum. Brain Mapp. 25(1), 46–59 (2005)CrossRefGoogle Scholar
  35. 35.
    Graves, A., Wayne, G., Danihelka, I.: Neural turing machines. CoRR abs/1410.5401 (2014)Google Scholar
  36. 36.
    Joulin, A., Mikolov, T.: Inferring algorithmic patterns with stack-augmented recurrent nets. CoRR abs/1503.01007 (2015)Google Scholar
  37. 37.
    Collins, J., Sohl-Dickstein, J., Sussillo, D.: Capacity and trainability in recurrent neural networks. Stat 1050, 28 (2017)Google Scholar
  38. 38.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  39. 39.
    Graves, A., et al.: Hybrid computing using a neural network with dynamic external memory. Nature 538(7626), 471–476 (2016)CrossRefGoogle Scholar
  40. 40.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  41. 41.
    Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4694–4702. IEEE (2015)Google Scholar
  42. 42.
    Weston, J., et al.: Towards AI-complete question answering: a set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698 (2015)
  43. 43.
    Zitnick, C.L., Parikh, D.: Bringing semantics into focus using visual abstraction. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3009–3016. IEEE (2013)Google Scholar
  44. 44.
    Kuhnle, A., Copestake, A.: ShapeWorld-a new test methodology for multimodal language understanding. arXiv preprint arXiv:1704.04517 (2017)
  45. 45.
    Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). Scholar
  46. 46.
    Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. In: International Conference on Learning Representations (ICLR) (2017)Google Scholar
  47. 47.
    Luck, S.J., Vogel, E.K.: The capacity of visual working memory for features and conjunctions. Nature 390(6657), 279 (1997)CrossRefGoogle Scholar
  48. 48.
    Cole, M.W., Laurent, P., Stocco, A.: Rapid instructed task learning: a new window into the human brains unique capacity for flexible cognitive control. Cogn. Affect. Behav. Neurosci. 13(1), 1–22 (2013)CrossRefGoogle Scholar
  49. 49.
    Graves, A.: Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983 (2016)
  50. 50.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
  51. 51.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  52. 52.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997)CrossRefGoogle Scholar
  53. 53.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  54. 54.
    Andersen, R.A., Snyder, L.H., Bradley, D.C., Xing, J.: Multimodal representation of space in the posterior parietal cortex and its use in planning movements. Ann. Rev. Neurosci. 20(1), 303–330 (1997)CrossRefGoogle Scholar
  55. 55.
    Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)Google Scholar
  56. 56.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  57. 57.
    Yang, G.R., Song, H.F., Newsome, W.T., Wang, X.J.: Clustering and compositionality of task representations in a neural network trained to perform many cognitive tasks. bioRxiv, p. 183632 (2017)Google Scholar
  58. 58.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Guangyu Robert Yang
    • 1
    • 3
    Email author
  • Igor Ganichev
    • 2
  • Xiao-Jing Wang
    • 1
  • Jonathon Shlens
    • 2
  • David Sussillo
    • 2
  1. 1.Center for Neural ScienceNew York UniversityNew YorkUSA
  2. 2.Google BrainMountain ViewUSA
  3. 3.Department of NeuroscienceColumbia UniversityNew YorkUSA

Personalised recommendations