Embodied Active Vision in Language Learning and Grounding
- 1.2k Downloads
Most cognitive studies of language acquisition in both natural systems and artificial systems have focused on the role of purely linguistic information as the central constraint. However, we argue that non-linguistic information, such as vision and talkers’ attention, also plays a major role in language acquisition. To support this argument, this chapter reports two studies of embodied language learning – one on natural intelligence and one on artificial intelligence. First, we developed a novel method that seeks to describe the visual learning environment from a young child’s point of view. A multi-camera sensing environment is built which consists of two head-mounted mini cameras that are placed on both the child’s and the parent’s foreheads respectively. The major result is that the child uses their body to constrain the visual information s/he perceives and by doing so adapts to an embodied solution to deal with the reference uncertainty problem in language learning. In our second study, we developed a learning system trained in an unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. The system collects acoustic signals in concert with user-centric multisensory information from non-speech modalities, such as user’s perspective video, gaze positions, head directions and hand movements. A multimodal learning algorithm uses this data to first spot words from continuous speech and then associate action verbs and object names with their perceptually grounded meanings. Similar to human learners, the central ideas of our computational system are to make use of non-speech contextual information to facilitate word spotting, and utilize body movements as deictic references to associate temporally co-occurring data from different modalities and build a visually grounded lexicon.
KeywordsLanguage Learn Lexical Item Active Vision Head Direction Word Learning
Unable to display preview. Download preview PDF.
- Ballard, D.H., Hayhoe, M.M., Pook, P.K., Rao, R.P.N.: Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences 20, 1311–1328 (1997)Google Scholar
- Baron-Cohen, S.: Mindblindness: an essay on autism and theory of mind. MIT Press, Cambridge (1995)Google Scholar
- Bloom, P.: How children learn the meanings of words. The MIT Press, Cambridge, MA (2000)Google Scholar
- Brown, P.F., Pietra, S., Pietra, V., Mercer, R.L.: The mathematics of statistical machine translation:parameter estimation. Computational Linguistics 19(2), 263–311 (1994)Google Scholar
- Salvucci, D.D., Anderson, J.: Tracking eye movement protocols with cognitive process models. In: Proceedings of the twentieth annual conference of the cognitive science society, pp. 923–928. LEA, Mahwah, NJ (1998)Google Scholar
- Smith, L.: How to learn words: An associative crane. In: Golinkoff, R., Hirsh-Pasek, K. (eds.) Breaking the word learning barrier, pp. 51–80. Oxford University Press, Oxford (2000)Google Scholar
- Steels, L., Vogt, P.: Grounding adaptive language game in robotic agents. In: Husbands, C., Harvey, I. (eds.) Proc. of the 4th european conference on artificial life, MIT Press, London (1997)Google Scholar