Towards a Visual Story Network Using Multiple Views for Object Recognition at Different Levels of Spatiotemporal Context

Leordeanu, Marius; Sukthankar, Rahul

doi:10.1007/978-3-319-29674-6_26

Marius Leordeanu⁴ &
Rahul Sukthankar⁵

Part of the book series: Springer Series in Cognitive and Neural Systems ((SSCNS,volume 11))

1809 Accesses
1 Citations

Abstract

We present a general computational multi-class visual recognition model, which we term the Visual Story Network (VSN). Our proposed model aims to generalize and integrate ideas from different successful hierarchical automated recognition approaches, relating models from computer vision and brain science, such as today’s successful deep neural networks and more classical ideas for visual learning and recognition from neuroscience, such as the well-known Adaptive Resonance Theory. Our recursive graph-based model has the advantage of enabling rich interactions between classes and features from different levels of interpretation and abstraction. The Visual Story Network offers multiple views of a visual concept: the basic, bottom-up view, is based on the objects’s current local appearance. The higher level view is based on the larger spatiotemporal context, such as the role played by that concept in the overall story. This story includes the spatial relations and interactions to other objects, as well as events and global information from the scene. The structure of the VSN can be efficiently constructed by step by step updates, during which new features or complex classifiers are added one by one. Given a certain VSN structure, its weights could also be fully learned or fine-tuned, end-to-end, by efficient methods such as backpropagation with stochastic gradient descent. VSN is, in its general form, a graph of nonlinear classifiers or feature nodes that are automatically selected from a large pool and combined to form new nodes. Then, each newly learned node becomes a potential new usable feature. Our feature pool can contain both manually designed features or more complex classifiers pre-learned from previous steps, each copied many times at different scales and locations. In this manner we can learn and grow both a deep, complex graph of classifiers and a rich pool of features at different levels of abstraction and interpretation. At every stage the VSN cand be fully trained, end-to-end, either in a supervised way or in a novel naturally self-supervised way, which we will discuss in detail. Our proposed graph of classifiers becomes a multi-class system with a recursive structure, suitable for deep detection and recognition of several classes simultaneously.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Aradhye H, Toderici G, Yagnik J (2009) Video 2text: learning to annotate video content. In: International Conference on Data Mining Workshops
Google Scholar
Belongie S, Malik J, Puzicha J (2000) Shape context: a new descriptor for shape matching and object recognition. In: NIPS
Google Scholar
Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: Proceedings of the 26th annual international conference on machine learning, pp 41–48. ACM
Google Scholar
Bengio Y, Courville AC, Vincent P (2013) Unsupervised feature learning and deep learning: a review and new perspectives. PAMI
Google Scholar
Carpenter GA, Grossberg S (1987) A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vision Graph Image Process 37(1):54–115
Article Google Scholar
Chang HC, Grossberg S, Cao Y (2014) Wheres waldo? How perceptual, cognitive, and emotional brain processes cooperate during learning to categorize and find desired objects in a cluttered scene. Front Integr Neurosci 8(43)
Google Scholar
Chen Q, Song Z, Hua Y, Huang Z, Yan S (2012) Hierarchical matching with side information for image classification. In: CVPR
Google Scholar
Collins RT, Liu Y, Leordeanu M (2005) Online selection of discriminative tracking features. Pattern Anal Mach Intell, IEEE Trans 27(10):1631–1643
Article Google Scholar
Connelly FM, Clandinin DJ (1990) Stories of experience and narrative inquiry. Educ Res 19(5)
Google Scholar
Conte D, Foggia P, Sansone C, Vento M (2004) Thirty years of graph matching in pattern recognition. IJPRAI 18(3)
Google Scholar
Dalal N, Triggs B (2005) Histogram of oriented gradients for human detection. In: CVPR
Google Scholar
Dalal N, Schmid C, Triggs B (2006) Human detection using oriented histograms of flow and appearance. In: ECCV
Google Scholar
Desai C, Ramanan D, Fowlkes CC (2011) Discriminative models for multi-class object layout. Int J Comput Vis 95(1):1–12
Article Google Scholar
Edelman G, Mountcastle V (1978) The mindful brain: Cortical organization and the groupselective theory of higher brain function. MIT Press
Google Scholar
Everingham M, Gool LV, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. IJCV 88(2)
Google Scholar
Fahlman S, Lebiere C (1990) The Cascade Correlation learning article. Tech. Rep. CMU-CS-90-100, Carnegie Mellon
Google Scholar
Farah MJ (2004) Visual agnosia. MIT Press
Google Scholar
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision. Springer, pp 15–29
Google Scholar
Fazl A, Grossberg S, Mingolla E (2009) View-invariant object category learning, recognition, and search: how spatial and object attention are coordinated using surface-based attentional shrouds. Cogn Psychol 58(1):1–48
Article Google Scholar
Felzenszwalb P, Girshick R, McAllester D, Ramanan D (2010a) Object detection with discriminatively trained part-based models. PAMI 32(9)
Google Scholar
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010b) Object detection with discriminatively trained part-based models. Pattern Anal Mach Intell, IEEE Trans 32(9): 1627–1645
Article Google Scholar
Fine S, Singer Y, Tishby N (1998) The hierarchical hidden Markov model: analysis and applications. Mach Learn 32(1)
Google Scholar
George D, Hawkins J (2005) A hierarchical bayesian model of invariant pattern recognition in the visual cortex. In: International joint conference on neural networks
Google Scholar
Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: ICML
Google Scholar
Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets of image features. In: ICCV
Google Scholar
Grossberg S (1976) Adaptive pattern classification and universal recoding: I. parallel development and coding of neural feature detectors. Biol Cybern 23(3):121–134
Article CAS Google Scholar
Grossberg S (2000) The complementary brain: unifying brain dynamics and modularity. Trends Cogn Sci 4(6):233–246
Article CAS Google Scholar
Grossberg S (2013) Adaptive resonance theory: how a brain learns to consciously attend, learn, and recognize a changing world. Neural Netw 37:1–47
Article Google Scholar
Grossberg S (2015) From brain synapses to systems for learning and memory:object recognition, spatial navigation, timed conditioning, and movement control. Brain Res 1621:270–293
Article CAS Google Scholar
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385
Google Scholar
Hernandez D (2013) “Chinese Google” unveils visual search engine powered by fake brains. Wired http://www.wired.com/wiredenterprise/2013/06/baidu-virtual-search/
Hinton G (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1)
Google Scholar
Hinton G, Osindero S, Yee-Whye T (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7)
Google Scholar
Hinton G, Krizhevsky A, Wang S (2011) Transforming auto-encoders. In: ICANN
Google Scholar
Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012a) Deep neural networks for acoustic modeling in speech recognition — the shared views of four research groups. IEEE Signal Process Mag
Google Scholar
Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012b) Improving neural networks by preventing co-adaptation of feature detectors. ArXiv preprint arXiv:1207.0580
Google Scholar
Hoiem D, Efros AA, Hebert M (2008) Putting objects in perspective. Int J Comput Vis 80(1):3–15
Article Google Scholar
Jensen FV, Nielsen TD (2007) Bayesian networks and decision graphs. Springer
Google Scholar
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley
Google Scholar
Koralek A, Jin X, II JL, Costa R, Carmena J (2012) Corticostriatal plasticity is necessary for learning intentional neuroprosthetic skills. Nature 483(7389)
Google Scholar
Koza J, III FB, Stiffelman O (1999) Genetic programming as a Darwinian invention machine. Springer
Google Scholar
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS
Google Scholar
Lashley KS (1950) In search of the engram. Society for experimental biology, Symposium 4. Physiological mechanisms in animal behavior, pp 2–31
Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR
Google Scholar
Leordeanu M, Hebert M (2005) A spectral technique for correspondence problems using pairwise constraints. In: ICCV
Google Scholar
Leordeanu M, Hebert M (2008) Smoothing-based optimization. In: CVPR
Google Scholar
Leordeanu M, Sukthankar R (2014) Thoughts on a recursive classifier graph: a multiclass network for deep object recognition. arXiv preprint arXiv:1404.2903
Google Scholar
Leordeanu M, Hebert M, Sukthankar R (2007) Beyond local appearance: category recognition from pairwise interactions of simple features. In: CVPR
Google Scholar
Leordeanu M, Sukthankar R, Hebert M (2009) Unsupervised learning for graph matching. IJCV 96(1)
Google Scholar
Leordeanu M, Sukthankar R, Sminchisescu C (2014) Generalized boundaries from multiple image interpretations. IEEE Trans Pattern Anal Mach Intell 36(7):1312–1324
Article Google Scholar
Leordeanu M, Radu A, Baluja S, Sukthankar R (2016) Labeling the features not the samples: Efficient video classification with minimal supervision. In: Thirtieth AAAI conference on artificial intelligence
Google Scholar
Li X, Wang L, Sung E (2008) Adaboost with svm-based component classifiers. engineering applications of artificial intelligence. Eng Appl Artif Intell
Google Scholar
Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(4)
Google Scholar
McCarthy RA, Warrington EK (1986) Visual associative agnosia: a clinico-anatomical study of a single case. J Neurol Neurosurg Psychiatry 49(11):1233–1240
Article CAS Google Scholar
Memisevic R, Hinton GE (2010) Learning to represent spatial transformations with factored higher-order boltzmann machines. Neural Comput 22(6)
Google Scholar
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3)
Google Scholar
Oliva A, Torralba A (2007) The role of context in object recognition. Trends Cogn Sci 11(12): 520–527
Article Google Scholar
Pahl K, Rowsell J (2010) Artifactual literacies: every object tells a story. Teachers College Press, New York
Google Scholar
Quattoni A, Wang S, Morency L, Collins M, Darrell T (2007) Hidden conditional random fields. PAMI 10(29)
Google Scholar
Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S (2007) Objects in context. In: IEEE 11th international conference on, Computer vision, 2007. ICCV 2007, pp 1–8. IEEE
Google Scholar
Rifai S, Vincent P, Muller X, Glorot X, Bengio Y (2011) Contractive auto-encoders: explicit invariance during feature extraction. In: ICML
Google Scholar
Rosenberg C (2013) Improving photo search: a step across the semantic gap. Google Research Blog http://googleresearch.blogspot.com/2013/06/ improving-photo-search-step-across.html
Schank RC, Abelson RP (1995) Knowledge and memory: the real story. Knowledge and memory: the real story. Adv Soc Cogn 8
Google Scholar
Sigala N, Logothetis NK (2002) Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415(6869):318–320
Article CAS Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Google Scholar
Song Z, Chen Q, Huang Z, Hua Y, Yan S (2011) Contextualizing object detection and classification. In: CVPR
Google Scholar
Torralba A (2003) Contextual priming for object detection. Int J Comput Vis 53(2):169–191
Article Google Scholar
Tu Z, Bai X (2010) Auto-context and its application to high-level vision tasks and 3d brain image segmentation. PAMI 32(10)
Google Scholar
Viola P, Jones M (2004) Robust real-time face detection. IJCV 57(2)
Google Scholar
Wang E (2013) Deep learning for image understanding in Bing. Bing blogs http://www.bing.com/blogs/site_blogs/b/searchquality/archive/2013/11/22/ deep-learning-for-image-understanding-in-bing.aspx
Warrington EK, James M (1988) Visual apperceptive agnosia: a clinico-anatomical study of three cases. Cortex 24(1):13–32
Article CAS Google Scholar
Weinzaepfel P, Revaud J, Harchaoui Z, Schmid C (2013) DeepFlow: large displacement optical flow with deep matching. In: ICCV
Google Scholar
Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1(2)
Google Scholar
Yao J, Fidler S, Urtasun R (2012) Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In: 2012 IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp 702–709. IEEE
Google Scholar
Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759
Google Scholar

Download references

Acknowledgments

The authors thank artist Cristina Lazar for “Her Eyes”—the original drawing of the face in Figs. 26.1, 26.2, and 26.3, which has been reproduced here with the artist’s permission. The authors also thank Shumeet Baluja and Jay Yagnik for interesting discussions and valuable feedback on these ideas. M. Leordeanu was supported by CNCS-UEFISCDI, under PNIII-P4-ID-ERC-2016-0007.

Author information

Authors and Affiliations

Institute of Mathematics of the Romanian Academy, Bucharest, Romania
Marius Leordeanu
Google, Mountain View, CA, USA
Rahul Sukthankar

Authors

Marius Leordeanu
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Sukthankar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marius Leordeanu .

Editor information

Editors and Affiliations

Miami Project to Cure Paralysis, Department of Neurological Surgery, Miller School of Medicine, University of Miami, Miami, FL, USA
Ioan Opris
Greenville Health System, University of South Carolina, School of Medicine Greenville, Greenville, SC, USA
Manuel F. Casanova

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Leordeanu, M., Sukthankar, R. (2017). Towards a Visual Story Network Using Multiple Views for Object Recognition at Different Levels of Spatiotemporal Context. In: Opris, I., Casanova, M.F. (eds) The Physics of the Mind and Brain Disorders. Springer Series in Cognitive and Neural Systems, vol 11. Springer, Cham. https://doi.org/10.1007/978-3-319-29674-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-29674-6_26
Published: 02 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29672-2
Online ISBN: 978-3-319-29674-6
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics