Embodied Active Vision in Language Learning and Grounding

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4840)


Most cognitive studies of language acquisition in both natural systems and artificial systems have focused on the role of purely linguistic information as the central constraint. However, we argue that non-linguistic information, such as vision and talkers’ attention, also plays a major role in language acquisition. To support this argument, this chapter reports two studies of embodied language learning – one on natural intelligence and one on artificial intelligence. First, we developed a novel method that seeks to describe the visual learning environment from a young child’s point of view. A multi-camera sensing environment is built which consists of two head-mounted mini cameras that are placed on both the child’s and the parent’s foreheads respectively. The major result is that the child uses their body to constrain the visual information s/he perceives and by doing so adapts to an embodied solution to deal with the reference uncertainty problem in language learning. In our second study, we developed a learning system trained in an unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. The system collects acoustic signals in concert with user-centric multisensory information from non-speech modalities, such as user’s perspective video, gaze positions, head directions and hand movements. A multimodal learning algorithm uses this data to first spot words from continuous speech and then associate action verbs and object names with their perceptually grounded meanings. Similar to human learners, the central ideas of our computational system are to make use of non-speech contextual information to facilitate word spotting, and utilize body movements as deictic references to associate temporally co-occurring data from different modalities and build a visually grounded lexicon.


Language Learn Lexical Item Active Vision Head Direction Word Learning 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Baldwin, D.: Early referential understanding: Infant’s ability to recognize referential acts for what they are. Developmental psychology 29, 832–843 (1993)CrossRefGoogle Scholar
  2. Ballard, D.H., Hayhoe, M.M., Pook, P.K., Rao, R.P.N.: Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences 20, 1311–1328 (1997)Google Scholar
  3. Baron-Cohen, S.: Mindblindness: an essay on autism and theory of mind. MIT Press, Cambridge (1995)Google Scholar
  4. Bertenthal, B., Campos, J., Kermoian, R.: An epigenetic perspective on the development of self-produced locomotion and its consequences. Current Directions in Psychological Science 3, 140–145 (1994)CrossRefGoogle Scholar
  5. Bloom, P.: How children learn the meanings of words. The MIT Press, Cambridge, MA (2000)Google Scholar
  6. Brown, P.F., Pietra, S., Pietra, V., Mercer, R.L.: The mathematics of statistical machine translation:parameter estimation. Computational Linguistics 19(2), 263–311 (1994)Google Scholar
  7. Plunkett, K.: Theories of early language acquisition. Trends in cognitive sciences 1, 146–153 (1997)CrossRefGoogle Scholar
  8. Quine, W.: Word and object. MIT Press, Cambridge, MA (1960)zbMATHGoogle Scholar
  9. Rabiner, L.R., Juang, B.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  10. Salvucci, D.D., Anderson, J.: Tracking eye movement protocols with cognitive process models. In: Proceedings of the twentieth annual conference of the cognitive science society, pp. 923–928. LEA, Mahwah, NJ (1998)Google Scholar
  11. Smith, L.: How to learn words: An associative crane. In: Golinkoff, R., Hirsh-Pasek, K. (eds.) Breaking the word learning barrier, pp. 51–80. Oxford University Press, Oxford (2000)Google Scholar
  12. Steels, L., Vogt, P.: Grounding adaptive language game in robotic agents. In: Husbands, C., Harvey, I. (eds.) Proc. of the 4th european conference on artificial life, MIT Press, London (1997)Google Scholar
  13. Tomasello, M., Akhtar, N.: Two-year-olds use pragmatic cues to differentiate reference to objects and actions. Cognitive Development 10, 201–224 (1995)CrossRefGoogle Scholar
  14. Woodward, A., Guajardo, J.: Infants’ understanding of the point gesture as an object-directed action. Cognitive Development 17, 1061–1084 (2002)CrossRefGoogle Scholar
  15. Yu, C., Ballard, D.H., Aslin, R.N.: The role of embodied intention in early lexical acquisition. Cognitive Science 29(6), 961–1005 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Chen Yu
    • 1
  1. 1.Indiana University, Bloomington IN 47401USA

Personalised recommendations