Skip to main content

Towards a Visual Story Network Using Multiple Views for Object Recognition at Different Levels of Spatiotemporal Context

  • Chapter
  • First Online:
The Physics of the Mind and Brain Disorders

Part of the book series: Springer Series in Cognitive and Neural Systems ((SSCNS,volume 11))

Abstract

We present a general computational multi-class visual recognition model, which we term the Visual Story Network (VSN). Our proposed model aims to generalize and integrate ideas from different successful hierarchical automated recognition approaches, relating models from computer vision and brain science, such as today’s successful deep neural networks and more classical ideas for visual learning and recognition from neuroscience, such as the well-known Adaptive Resonance Theory. Our recursive graph-based model has the advantage of enabling rich interactions between classes and features from different levels of interpretation and abstraction. The Visual Story Network offers multiple views of a visual concept: the basic, bottom-up view, is based on the objects’s current local appearance. The higher level view is based on the larger spatiotemporal context, such as the role played by that concept in the overall story. This story includes the spatial relations and interactions to other objects, as well as events and global information from the scene. The structure of the VSN can be efficiently constructed by step by step updates, during which new features or complex classifiers are added one by one. Given a certain VSN structure, its weights could also be fully learned or fine-tuned, end-to-end, by efficient methods such as backpropagation with stochastic gradient descent. VSN is, in its general form, a graph of nonlinear classifiers or feature nodes that are automatically selected from a large pool and combined to form new nodes. Then, each newly learned node becomes a potential new usable feature. Our feature pool can contain both manually designed features or more complex classifiers pre-learned from previous steps, each copied many times at different scales and locations. In this manner we can learn and grow both a deep, complex graph of classifiers and a rich pool of features at different levels of abstraction and interpretation. At every stage the VSN cand be fully trained, end-to-end, either in a supervised way or in a novel naturally self-supervised way, which we will discuss in detail. Our proposed graph of classifiers becomes a multi-class system with a recursive structure, suitable for deep detection and recognition of several classes simultaneously.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Aradhye H, Toderici G, Yagnik J (2009) Video 2text: learning to annotate video content. In: International Conference on Data Mining Workshops

    Google Scholar 

  • Belongie S, Malik J, Puzicha J (2000) Shape context: a new descriptor for shape matching and object recognition. In: NIPS

    Google Scholar 

  • Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: Proceedings of the 26th annual international conference on machine learning, pp 41–48. ACM

    Google Scholar 

  • Bengio Y, Courville AC, Vincent P (2013) Unsupervised feature learning and deep learning: a review and new perspectives. PAMI

    Google Scholar 

  • Carpenter GA, Grossberg S (1987) A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vision Graph Image Process 37(1):54–115

    Article  Google Scholar 

  • Chang HC, Grossberg S, Cao Y (2014) Wheres waldo? How perceptual, cognitive, and emotional brain processes cooperate during learning to categorize and find desired objects in a cluttered scene. Front Integr Neurosci 8(43)

    Google Scholar 

  • Chen Q, Song Z, Hua Y, Huang Z, Yan S (2012) Hierarchical matching with side information for image classification. In: CVPR

    Google Scholar 

  • Collins RT, Liu Y, Leordeanu M (2005) Online selection of discriminative tracking features. Pattern Anal Mach Intell, IEEE Trans 27(10):1631–1643

    Article  Google Scholar 

  • Connelly FM, Clandinin DJ (1990) Stories of experience and narrative inquiry. Educ Res 19(5)

    Google Scholar 

  • Conte D, Foggia P, Sansone C, Vento M (2004) Thirty years of graph matching in pattern recognition. IJPRAI 18(3)

    Google Scholar 

  • Dalal N, Triggs B (2005) Histogram of oriented gradients for human detection. In: CVPR

    Google Scholar 

  • Dalal N, Schmid C, Triggs B (2006) Human detection using oriented histograms of flow and appearance. In: ECCV

    Google Scholar 

  • Desai C, Ramanan D, Fowlkes CC (2011) Discriminative models for multi-class object layout. Int J Comput Vis 95(1):1–12

    Article  Google Scholar 

  • Edelman G, Mountcastle V (1978) The mindful brain: Cortical organization and the groupselective theory of higher brain function. MIT Press

    Google Scholar 

  • Everingham M, Gool LV, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. IJCV 88(2)

    Google Scholar 

  • Fahlman S, Lebiere C (1990) The Cascade Correlation learning article. Tech. Rep. CMU-CS-90-100, Carnegie Mellon

    Google Scholar 

  • Farah MJ (2004) Visual agnosia. MIT Press

    Google Scholar 

  • Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision. Springer, pp 15–29

    Google Scholar 

  • Fazl A, Grossberg S, Mingolla E (2009) View-invariant object category learning, recognition, and search: how spatial and object attention are coordinated using surface-based attentional shrouds. Cogn Psychol 58(1):1–48

    Article  Google Scholar 

  • Felzenszwalb P, Girshick R, McAllester D, Ramanan D (2010a) Object detection with discriminatively trained part-based models. PAMI 32(9)

    Google Scholar 

  • Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010b) Object detection with discriminatively trained part-based models. Pattern Anal Mach Intell, IEEE Trans 32(9): 1627–1645

    Article  Google Scholar 

  • Fine S, Singer Y, Tishby N (1998) The hierarchical hidden Markov model: analysis and applications. Mach Learn 32(1)

    Google Scholar 

  • George D, Hawkins J (2005) A hierarchical bayesian model of invariant pattern recognition in the visual cortex. In: International joint conference on neural networks

    Google Scholar 

  • Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: ICML

    Google Scholar 

  • Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets of image features. In: ICCV

    Google Scholar 

  • Grossberg S (1976) Adaptive pattern classification and universal recoding: I. parallel development and coding of neural feature detectors. Biol Cybern 23(3):121–134

    Article  CAS  Google Scholar 

  • Grossberg S (2000) The complementary brain: unifying brain dynamics and modularity. Trends Cogn Sci 4(6):233–246

    Article  CAS  Google Scholar 

  • Grossberg S (2013) Adaptive resonance theory: how a brain learns to consciously attend, learn, and recognize a changing world. Neural Netw 37:1–47

    Article  Google Scholar 

  • Grossberg S (2015) From brain synapses to systems for learning and memory:object recognition, spatial navigation, timed conditioning, and movement control. Brain Res 1621:270–293

    Article  CAS  Google Scholar 

  • He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385

    Google Scholar 

  • Hernandez D (2013) “Chinese Google” unveils visual search engine powered by fake brains. Wired http://www.wired.com/wiredenterprise/2013/06/baidu-virtual-search/

  • Hinton G (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1)

    Google Scholar 

  • Hinton G, Osindero S, Yee-Whye T (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7)

    Google Scholar 

  • Hinton G, Krizhevsky A, Wang S (2011) Transforming auto-encoders. In: ICANN

    Google Scholar 

  • Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012a) Deep neural networks for acoustic modeling in speech recognition — the shared views of four research groups. IEEE Signal Process Mag

    Google Scholar 

  • Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012b) Improving neural networks by preventing co-adaptation of feature detectors. ArXiv preprint arXiv:1207.0580

    Google Scholar 

  • Hoiem D, Efros AA, Hebert M (2008) Putting objects in perspective. Int J Comput Vis 80(1):3–15

    Article  Google Scholar 

  • Jensen FV, Nielsen TD (2007) Bayesian networks and decision graphs. Springer

    Google Scholar 

  • Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley

    Google Scholar 

  • Koralek A, Jin X, II JL, Costa R, Carmena J (2012) Corticostriatal plasticity is necessary for learning intentional neuroprosthetic skills. Nature 483(7389)

    Google Scholar 

  • Koza J, III FB, Stiffelman O (1999) Genetic programming as a Darwinian invention machine. Springer

    Google Scholar 

  • Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS

    Google Scholar 

  • Lashley KS (1950) In search of the engram. Society for experimental biology, Symposium 4. Physiological mechanisms in animal behavior, pp 2–31

    Google Scholar 

  • Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR

    Google Scholar 

  • Leordeanu M, Hebert M (2005) A spectral technique for correspondence problems using pairwise constraints. In: ICCV

    Google Scholar 

  • Leordeanu M, Hebert M (2008) Smoothing-based optimization. In: CVPR

    Google Scholar 

  • Leordeanu M, Sukthankar R (2014) Thoughts on a recursive classifier graph: a multiclass network for deep object recognition. arXiv preprint arXiv:1404.2903

    Google Scholar 

  • Leordeanu M, Hebert M, Sukthankar R (2007) Beyond local appearance: category recognition from pairwise interactions of simple features. In: CVPR

    Google Scholar 

  • Leordeanu M, Sukthankar R, Hebert M (2009) Unsupervised learning for graph matching. IJCV 96(1)

    Google Scholar 

  • Leordeanu M, Sukthankar R, Sminchisescu C (2014) Generalized boundaries from multiple image interpretations. IEEE Trans Pattern Anal Mach Intell 36(7):1312–1324

    Article  Google Scholar 

  • Leordeanu M, Radu A, Baluja S, Sukthankar R (2016) Labeling the features not the samples: Efficient video classification with minimal supervision. In: Thirtieth AAAI conference on artificial intelligence

    Google Scholar 

  • Li X, Wang L, Sung E (2008) Adaboost with svm-based component classifiers. engineering applications of artificial intelligence. Eng Appl Artif Intell

    Google Scholar 

  • Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(4)

    Google Scholar 

  • McCarthy RA, Warrington EK (1986) Visual associative agnosia: a clinico-anatomical study of a single case. J Neurol Neurosurg Psychiatry 49(11):1233–1240

    Article  CAS  Google Scholar 

  • Memisevic R, Hinton GE (2010) Learning to represent spatial transformations with factored higher-order boltzmann machines. Neural Comput 22(6)

    Google Scholar 

  • Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3)

    Google Scholar 

  • Oliva A, Torralba A (2007) The role of context in object recognition. Trends Cogn Sci 11(12): 520–527

    Article  Google Scholar 

  • Pahl K, Rowsell J (2010) Artifactual literacies: every object tells a story. Teachers College Press, New York

    Google Scholar 

  • Quattoni A, Wang S, Morency L, Collins M, Darrell T (2007) Hidden conditional random fields. PAMI 10(29)

    Google Scholar 

  • Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S (2007) Objects in context. In: IEEE 11th international conference on, Computer vision, 2007. ICCV 2007, pp 1–8. IEEE

    Google Scholar 

  • Rifai S, Vincent P, Muller X, Glorot X, Bengio Y (2011) Contractive auto-encoders: explicit invariance during feature extraction. In: ICML

    Google Scholar 

  • Rosenberg C (2013) Improving photo search: a step across the semantic gap. Google Research Blog http://googleresearch.blogspot.com/2013/06/ improving-photo-search-step-across.html

  • Schank RC, Abelson RP (1995) Knowledge and memory: the real story. Knowledge and memory: the real story. Adv Soc Cogn 8

    Google Scholar 

  • Sigala N, Logothetis NK (2002) Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415(6869):318–320

    Article  CAS  Google Scholar 

  • Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

    Google Scholar 

  • Song Z, Chen Q, Huang Z, Hua Y, Yan S (2011) Contextualizing object detection and classification. In: CVPR

    Google Scholar 

  • Torralba A (2003) Contextual priming for object detection. Int J Comput Vis 53(2):169–191

    Article  Google Scholar 

  • Tu Z, Bai X (2010) Auto-context and its application to high-level vision tasks and 3d brain image segmentation. PAMI 32(10)

    Google Scholar 

  • Viola P, Jones M (2004) Robust real-time face detection. IJCV 57(2)

    Google Scholar 

  • Wang E (2013) Deep learning for image understanding in Bing. Bing blogs http://www.bing.com/blogs/site_blogs/b/searchquality/archive/2013/11/22/ deep-learning-for-image-understanding-in-bing.aspx

  • Warrington EK, James M (1988) Visual apperceptive agnosia: a clinico-anatomical study of three cases. Cortex 24(1):13–32

    Article  CAS  Google Scholar 

  • Weinzaepfel P, Revaud J, Harchaoui Z, Schmid C (2013) DeepFlow: large displacement optical flow with deep matching. In: ICCV

    Google Scholar 

  • Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1(2)

    Google Scholar 

  • Yao J, Fidler S, Urtasun R (2012) Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In: 2012 IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp 702–709. IEEE

    Google Scholar 

  • Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759

    Google Scholar 

Download references

Acknowledgments

The authors thank artist Cristina Lazar for “Her Eyes”—the original drawing of the face in Figs. 26.1, 26.2, and 26.3, which has been reproduced here with the artist’s permission. The authors also thank Shumeet Baluja and Jay Yagnik for interesting discussions and valuable feedback on these ideas. M. Leordeanu was supported by CNCS-UEFISCDI, under PNIII-P4-ID-ERC-2016-0007.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marius Leordeanu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Leordeanu, M., Sukthankar, R. (2017). Towards a Visual Story Network Using Multiple Views for Object Recognition at Different Levels of Spatiotemporal Context. In: Opris, I., Casanova, M.F. (eds) The Physics of the Mind and Brain Disorders. Springer Series in Cognitive and Neural Systems, vol 11. Springer, Cham. https://doi.org/10.1007/978-3-319-29674-6_26

Download citation

Publish with us

Policies and ethics