Language Label Learning for Visual Concepts Discovered from Video Sequences

  • Prithwijit Guha
  • Amitabha Mukerjee
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4840)


Computational models of grounded language learning have been based on the premise that words and concepts are learned simultaneously. Given the mounting cognitive evidence for concept formation in infants, we argue that the availability of pre-lexical concepts (learned from image sequences) leads to considerable computational efficiency in word acquisition. Key to the process is a model of bottom-up visual attention in dynamic scenes. Background learning and foreground segmentation is used to generate robust tracking and detect occlusion events. Trajectories are clustered to obtain motion event concepts. The object concepts (image schemas) are abstracted from the combined appearance and motion data. The set of acquired concepts under visual attentive focus are then correlated with contemporaneous commentary to learn the grounded semantics of words and multi-word phrasal concatenations from the narrative. We demonstrate that even based on a mere half hour of video (of a scene involving many objects and activities), a number of rudimentary concepts can be discovered. When these concepts are associated with unedited English commentary, we find that several words emerge - approximately half the identified concepts from the video are associated with the correct concepts. Thus, the computational model reflects the beginning of language comprehension, based on attentional parsing of the visual data. Finally, the emergence of multi-word phrasal concatenations, a precursor to syntax, is observed where they are more salient referents than single words.


Visual Attention Single Word Visual Saliency Dynamic Scene Textual Narrative 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Piaget, J.: The Construction of Reality in the Child. Basic Books, New York (1994)Google Scholar
  2. 2.
    Fodor, J.A., Lepore, E.: What Can’t Be Evaluated Can’t Be Evaluated, and It Can’t Be Supervalued Either. Journal Of Philosophy 93, 516–536 (1996)MathSciNetGoogle Scholar
  3. 3.
    Carey, S.: Knowledge acquisition: Enrichment or conceptual change? In: Carey, S., Gelman, R. (eds.) The Epigenesis of Mind: Essays in Biology and Cognition, pp. 257–291. MIT Press, Cambridge (1999)Google Scholar
  4. 4.
    Mandler, J.M.: Foundations of Mind. Oxford University Press, New York (2004)Google Scholar
  5. 5.
    Quin, P., Eimas, P.: The emergence of category representation during infancy: Are separate perceptual and conceptual processes required? Journal of Cognition and development 1, 55–61 (2000)CrossRefGoogle Scholar
  6. 6.
    Jones, S.S., Smith, L.B.: The place of perception in children’s concepts. Cognitive Development 8, 113–139 (1993)CrossRefGoogle Scholar
  7. 7.
    Mandler, J.M.: A synopsis of The foundations of mind: Origins of conceptual thought. Developmental Science 7, 499–505 (2004)CrossRefGoogle Scholar
  8. 8.
    Barsalou, L.W.: Perceptual symbol systems. Behavioral and Brain Sciences 22, 577–609 (1999)Google Scholar
  9. 9.
    Regier, T.: The Human Semantic Potential: Spatial Language and Constrained Connectionism. Bradford Books (1996)Google Scholar
  10. 10.
    Roy, D.K., Pentland, A.P.: Learning words from sights and sounds: a computational model. Cognitive Science 26, 113–146 (2002)CrossRefGoogle Scholar
  11. 11.
    Langacker, R.: Foundations of Cognitive Grammar, Descriptive Application, vol. 2. Stanford University Press, Stanford, CA (1991)Google Scholar
  12. 12.
    Quine, W.V.O.: Word and Object. John Wiley and Sons, New York (1960)zbMATHGoogle Scholar
  13. 13.
    Singh, V.K., Maji, S., Mukerjee, A.: Confidence Based updation of Motion Conspicuity in Dynamic Scenes. In: CRV 2006. Third Canadian Conference on Computer and Robot Vision (2006)Google Scholar
  14. 14.
    Itti, L., Koch, C.: Computational modeling of visual attention. Nature Reviews Neuroscience 2, 194–203 (2001)CrossRefGoogle Scholar
  15. 15.
    Coldren, J.T., Haaf, R.A.: Priority of processing components of visual stimuli by 6-month-old infants. Infant Behavior and Development 22, 131–135 (1999)CrossRefGoogle Scholar
  16. 16.
    Yu, C., Ballard, D.H.: A Multimodal Learning Interface for Grounding Spoken Language in Sensory Perceptions. ACM Transactions on Applied Perception  (2004)Google Scholar
  17. 17.
    Baillargeon, R., hua Wang, S.: Event categorization in infancy. Trends in Cognitive Sciences 6, 85–93 (2002)CrossRefGoogle Scholar
  18. 18.
    Guha, P., Biswas, A., Mukerjee, A., Venkatesh, K.: Occlusion sequence mining for complex multi-agent activity discovery. In: Proceedings of The Sixth IEEE International Workshop on Visual Surveillance, pp. 33–40 (2006)Google Scholar
  19. 19.
    Roy, D.: Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence 167, 170–205 (2005)CrossRefGoogle Scholar
  20. 20.
    Dominey, P.F., Boucher, J.D.: Learning To Talk About Events From Narrated Video in the Construction Grammar Framework. Artificial Intelligence 167, 31–61 (2005)CrossRefGoogle Scholar
  21. 21.
    Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)zbMATHGoogle Scholar
  22. 22.
    Chang, Y.-H., Morrison, C.T., Kerr, W., Galstyan, A., Cohen, P.R., Beal, C., Amant, R.S., Oates, T.: The Jean System. In: ICDL 2006. International Conference on Development and Learning (2006)Google Scholar
  23. 23.
    Siskind, J.M.: Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic. J. of Artificial Intelligence Res. 15, 31–90 (2001)zbMATHGoogle Scholar
  24. 24.
    Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, vol. 2, pp. 28–31 (2004)Google Scholar
  25. 25.
    Proesmans, M., Van Gool, L.J., Pauwels, E.J., Osterlinck, A.: Determination of optical flow and its discontinuities using non-linear diffusion. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 295–304. Springer, Heidelberg (1994)Google Scholar
  26. 26.
    Guha, P., Mukerjee, A., Venkatesh, K.S.: Spatio-temporal Discovery: Appearance + Behavior = Agent. In: Kalra, P., Peleg, S. (eds.) ICVGIP 2006. LNCS, vol. 4338, pp. 516–527. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  27. 27.
    Bloom, P.: How Children Learn the Meanings of Words, pp. 55–87. MIT Press, Cambridge (2000)Google Scholar
  28. 28.
    Rothenstein, A.L., Tsotsos, J.K.: Attention links sensing to recognition. Image and Vision Computing , 1–13 (2006), doi:10.1016/j.imavis.2005.08.011Google Scholar
  29. 29.
    Regier, T.: Emergent constraints on word-learning: A computational review. Trends in Cognitive Sciences 7, 263–268 (2003)CrossRefGoogle Scholar
  30. 30.
    Shutts, K., Spelke, E.S.: Straddling the perception-conception boundary. Developmental Science 7, 507–511 (2004)CrossRefGoogle Scholar
  31. 31.
    Stromswold, K.: The cognitive neuroscience of language acquisition. In: Gazzaniga (ed.) The new cognitive neurosciences, pp. 909–932. MIT Press, Cambridge, MA (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Prithwijit Guha
    • 1
  • Amitabha Mukerjee
    • 2
  1. 1.Department of Electrical Engineering, Indian Institute of Technology, Kanpur, Kanpur - 208016, Uttar PradeshIndia
  2. 2.Department of Computer Science & Engineering, Indian Institute of Technology, Kanpur, Kanpur - 208016, Uttar PradeshIndia

Personalised recommendations