Abstract
We present a general computational multi-class visual recognition model, which we term the Visual Story Network (VSN). Our proposed model aims to generalize and integrate ideas from different successful hierarchical automated recognition approaches, relating models from computer vision and brain science, such as today’s successful deep neural networks and more classical ideas for visual learning and recognition from neuroscience, such as the well-known Adaptive Resonance Theory. Our recursive graph-based model has the advantage of enabling rich interactions between classes and features from different levels of interpretation and abstraction. The Visual Story Network offers multiple views of a visual concept: the basic, bottom-up view, is based on the objects’s current local appearance. The higher level view is based on the larger spatiotemporal context, such as the role played by that concept in the overall story. This story includes the spatial relations and interactions to other objects, as well as events and global information from the scene. The structure of the VSN can be efficiently constructed by step by step updates, during which new features or complex classifiers are added one by one. Given a certain VSN structure, its weights could also be fully learned or fine-tuned, end-to-end, by efficient methods such as backpropagation with stochastic gradient descent. VSN is, in its general form, a graph of nonlinear classifiers or feature nodes that are automatically selected from a large pool and combined to form new nodes. Then, each newly learned node becomes a potential new usable feature. Our feature pool can contain both manually designed features or more complex classifiers pre-learned from previous steps, each copied many times at different scales and locations. In this manner we can learn and grow both a deep, complex graph of classifiers and a rich pool of features at different levels of abstraction and interpretation. At every stage the VSN cand be fully trained, end-to-end, either in a supervised way or in a novel naturally self-supervised way, which we will discuss in detail. Our proposed graph of classifiers becomes a multi-class system with a recursive structure, suitable for deep detection and recognition of several classes simultaneously.
References
Aradhye H, Toderici G, Yagnik J (2009) Video 2text: learning to annotate video content. In: International Conference on Data Mining Workshops
Belongie S, Malik J, Puzicha J (2000) Shape context: a new descriptor for shape matching and object recognition. In: NIPS
Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: Proceedings of the 26th annual international conference on machine learning, pp 41–48. ACM
Bengio Y, Courville AC, Vincent P (2013) Unsupervised feature learning and deep learning: a review and new perspectives. PAMI
Carpenter GA, Grossberg S (1987) A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vision Graph Image Process 37(1):54–115
Chang HC, Grossberg S, Cao Y (2014) Wheres waldo? How perceptual, cognitive, and emotional brain processes cooperate during learning to categorize and find desired objects in a cluttered scene. Front Integr Neurosci 8(43)
Chen Q, Song Z, Hua Y, Huang Z, Yan S (2012) Hierarchical matching with side information for image classification. In: CVPR
Collins RT, Liu Y, Leordeanu M (2005) Online selection of discriminative tracking features. Pattern Anal Mach Intell, IEEE Trans 27(10):1631–1643
Connelly FM, Clandinin DJ (1990) Stories of experience and narrative inquiry. Educ Res 19(5)
Conte D, Foggia P, Sansone C, Vento M (2004) Thirty years of graph matching in pattern recognition. IJPRAI 18(3)
Dalal N, Triggs B (2005) Histogram of oriented gradients for human detection. In: CVPR
Dalal N, Schmid C, Triggs B (2006) Human detection using oriented histograms of flow and appearance. In: ECCV
Desai C, Ramanan D, Fowlkes CC (2011) Discriminative models for multi-class object layout. Int J Comput Vis 95(1):1–12
Edelman G, Mountcastle V (1978) The mindful brain: Cortical organization and the groupselective theory of higher brain function. MIT Press
Everingham M, Gool LV, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. IJCV 88(2)
Fahlman S, Lebiere C (1990) The Cascade Correlation learning article. Tech. Rep. CMU-CS-90-100, Carnegie Mellon
Farah MJ (2004) Visual agnosia. MIT Press
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision. Springer, pp 15–29
Fazl A, Grossberg S, Mingolla E (2009) View-invariant object category learning, recognition, and search: how spatial and object attention are coordinated using surface-based attentional shrouds. Cogn Psychol 58(1):1–48
Felzenszwalb P, Girshick R, McAllester D, Ramanan D (2010a) Object detection with discriminatively trained part-based models. PAMI 32(9)
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010b) Object detection with discriminatively trained part-based models. Pattern Anal Mach Intell, IEEE Trans 32(9): 1627–1645
Fine S, Singer Y, Tishby N (1998) The hierarchical hidden Markov model: analysis and applications. Mach Learn 32(1)
George D, Hawkins J (2005) A hierarchical bayesian model of invariant pattern recognition in the visual cortex. In: International joint conference on neural networks
Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: ICML
Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets of image features. In: ICCV
Grossberg S (1976) Adaptive pattern classification and universal recoding: I. parallel development and coding of neural feature detectors. Biol Cybern 23(3):121–134
Grossberg S (2000) The complementary brain: unifying brain dynamics and modularity. Trends Cogn Sci 4(6):233–246
Grossberg S (2013) Adaptive resonance theory: how a brain learns to consciously attend, learn, and recognize a changing world. Neural Netw 37:1–47
Grossberg S (2015) From brain synapses to systems for learning and memory:object recognition, spatial navigation, timed conditioning, and movement control. Brain Res 1621:270–293
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385
Hernandez D (2013) “Chinese Google” unveils visual search engine powered by fake brains. Wired http://www.wired.com/wiredenterprise/2013/06/baidu-virtual-search/
Hinton G (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1)
Hinton G, Osindero S, Yee-Whye T (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7)
Hinton G, Krizhevsky A, Wang S (2011) Transforming auto-encoders. In: ICANN
Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012a) Deep neural networks for acoustic modeling in speech recognition — the shared views of four research groups. IEEE Signal Process Mag
Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012b) Improving neural networks by preventing co-adaptation of feature detectors. ArXiv preprint arXiv:1207.0580
Hoiem D, Efros AA, Hebert M (2008) Putting objects in perspective. Int J Comput Vis 80(1):3–15
Jensen FV, Nielsen TD (2007) Bayesian networks and decision graphs. Springer
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley
Koralek A, Jin X, II JL, Costa R, Carmena J (2012) Corticostriatal plasticity is necessary for learning intentional neuroprosthetic skills. Nature 483(7389)
Koza J, III FB, Stiffelman O (1999) Genetic programming as a Darwinian invention machine. Springer
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS
Lashley KS (1950) In search of the engram. Society for experimental biology, Symposium 4. Physiological mechanisms in animal behavior, pp 2–31
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR
Leordeanu M, Hebert M (2005) A spectral technique for correspondence problems using pairwise constraints. In: ICCV
Leordeanu M, Hebert M (2008) Smoothing-based optimization. In: CVPR
Leordeanu M, Sukthankar R (2014) Thoughts on a recursive classifier graph: a multiclass network for deep object recognition. arXiv preprint arXiv:1404.2903
Leordeanu M, Hebert M, Sukthankar R (2007) Beyond local appearance: category recognition from pairwise interactions of simple features. In: CVPR
Leordeanu M, Sukthankar R, Hebert M (2009) Unsupervised learning for graph matching. IJCV 96(1)
Leordeanu M, Sukthankar R, Sminchisescu C (2014) Generalized boundaries from multiple image interpretations. IEEE Trans Pattern Anal Mach Intell 36(7):1312–1324
Leordeanu M, Radu A, Baluja S, Sukthankar R (2016) Labeling the features not the samples: Efficient video classification with minimal supervision. In: Thirtieth AAAI conference on artificial intelligence
Li X, Wang L, Sung E (2008) Adaboost with svm-based component classifiers. engineering applications of artificial intelligence. Eng Appl Artif Intell
Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(4)
McCarthy RA, Warrington EK (1986) Visual associative agnosia: a clinico-anatomical study of a single case. J Neurol Neurosurg Psychiatry 49(11):1233–1240
Memisevic R, Hinton GE (2010) Learning to represent spatial transformations with factored higher-order boltzmann machines. Neural Comput 22(6)
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3)
Oliva A, Torralba A (2007) The role of context in object recognition. Trends Cogn Sci 11(12): 520–527
Pahl K, Rowsell J (2010) Artifactual literacies: every object tells a story. Teachers College Press, New York
Quattoni A, Wang S, Morency L, Collins M, Darrell T (2007) Hidden conditional random fields. PAMI 10(29)
Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S (2007) Objects in context. In: IEEE 11th international conference on, Computer vision, 2007. ICCV 2007, pp 1–8. IEEE
Rifai S, Vincent P, Muller X, Glorot X, Bengio Y (2011) Contractive auto-encoders: explicit invariance during feature extraction. In: ICML
Rosenberg C (2013) Improving photo search: a step across the semantic gap. Google Research Blog http://googleresearch.blogspot.com/2013/06/ improving-photo-search-step-across.html
Schank RC, Abelson RP (1995) Knowledge and memory: the real story. Knowledge and memory: the real story. Adv Soc Cogn 8
Sigala N, Logothetis NK (2002) Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415(6869):318–320
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Song Z, Chen Q, Huang Z, Hua Y, Yan S (2011) Contextualizing object detection and classification. In: CVPR
Torralba A (2003) Contextual priming for object detection. Int J Comput Vis 53(2):169–191
Tu Z, Bai X (2010) Auto-context and its application to high-level vision tasks and 3d brain image segmentation. PAMI 32(10)
Viola P, Jones M (2004) Robust real-time face detection. IJCV 57(2)
Wang E (2013) Deep learning for image understanding in Bing. Bing blogs http://www.bing.com/blogs/site_blogs/b/searchquality/archive/2013/11/22/ deep-learning-for-image-understanding-in-bing.aspx
Warrington EK, James M (1988) Visual apperceptive agnosia: a clinico-anatomical study of three cases. Cortex 24(1):13–32
Weinzaepfel P, Revaud J, Harchaoui Z, Schmid C (2013) DeepFlow: large displacement optical flow with deep matching. In: ICCV
Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1(2)
Yao J, Fidler S, Urtasun R (2012) Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In: 2012 IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp 702–709. IEEE
Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759
Acknowledgments
The authors thank artist Cristina Lazar for “Her Eyes”—the original drawing of the face in Figs. 26.1, 26.2, and 26.3, which has been reproduced here with the artist’s permission. The authors also thank Shumeet Baluja and Jay Yagnik for interesting discussions and valuable feedback on these ideas. M. Leordeanu was supported by CNCS-UEFISCDI, under PNIII-P4-ID-ERC-2016-0007.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Leordeanu, M., Sukthankar, R. (2017). Towards a Visual Story Network Using Multiple Views for Object Recognition at Different Levels of Spatiotemporal Context. In: Opris, I., Casanova, M.F. (eds) The Physics of the Mind and Brain Disorders. Springer Series in Cognitive and Neural Systems, vol 11. Springer, Cham. https://doi.org/10.1007/978-3-319-29674-6_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-29674-6_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29672-2
Online ISBN: 978-3-319-29674-6
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)