Building a Talking Baby Robot: A Contribution to the Study of Speech Acquisition and Evolution

  • Jihène E. Serkhane
  • Jean-Luc Schwartz
  • Pierre Bessière
Part of the Springer Tracts in Advanced Robotics book series (STAR, volume 46)


Speech is a perceptuo-motor system. A natural computational modelling framework is provided by cognitive robotics, or more precisely speech robotics, which is also based on embodiment, multimodality, development, and interaction. This chapter describes the bases of a virtual baby robot, an articulatory model that integrates the non-uniform growth of the vocal tract, a set of sensors, and a learning model. The articulatory model delivers sagittal contour, lip shape and acoustic formants from seven input parameters that characterize the configurations of the jaw, the tongue, the lips and the larynx. To simulate the growth of the vocal tract from birth to adulthood, a process modifies the longitudinal dimension of the vocal tract shape as a function of age. The auditory system of the robot comprises a “phasic” system for event detection over time, and a “tonic” system to track formants. The model of visual perception specifies the basic lip characteristics: height, width, area and protrusion. The orosensorial channel, which provides tactile sensations on the lips, the tongue and the palate, is elaborated as a model for the prediction of tongue–palatal contacts from articulatory commands. Learning involves Bayesian programming, in which there are two phases: (i) specification of the variables, decomposition of the joint distribution and identification of the free parameters through exploration of a learning set; and (ii) utilization, which relies on questions about the joint distribution.


Speech Production Vocal Tract Speech Sound Speech Development Audiovisual Speech 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Abry, C.: [b ]-[d ]-[g ] as a universal triangle as acoustically optimal as [i ]-[a ]-[u ]. 15th Int. Congr. Phonetics ICPhS, 727–730 (2003)Google Scholar
  2. Abry, C., Badin, P.: Speech mapping as a framework for an integrated approach to the sensori-motor foundations of language. In: 4th Speech Production Seminar, 1st ESCA Tutorial and Research Workshop on Speech Production Modelling: from control strategies to acoustics, pp. 175–184 (1996)Google Scholar
  3. Abry, C., Boë, L.-J.: Laws for lips. Speech Communication 5, 97–104 (1986)CrossRefGoogle Scholar
  4. Abry, C., Benoît, C., Boë, L.-J., Sock, R.: Un choix d’événements pour l’organisation temporelle du signal de parole. In: 14èmes Journées d’Etudes sur la Parole, Société Française d’Acoustique, pp. 133–137 (1985)Google Scholar
  5. Abry, C., Orliaguet, J., Sock, R.: Patterns of speech phasing. their robustness in the production of a timed linguistic task: single vs. double (abutted) consonants in french. European Bull. of Cogn. Psych. 10, 269–288 (1990)Google Scholar
  6. Abry, C., Cathiard, M., Vilain, A., Laboissière, R., Loevenbruck, H., Savariaux, C., Schwartz, J.-L.: Some insights in bimodal perception given for free by the natural time course of speech production. In: Vatikiotis-Bateson, E., Bailly, G., Perrier, P. (eds.) Audiovisual Speech Processing, MIT Press, Cambridge (2006)Google Scholar
  7. Bailly, G.: Learning to speak. sensori-motor control of speech movements. Speech Communication 22, 251–268 (1997)CrossRefGoogle Scholar
  8. Berrah, A., Glotin, H., Laboissière, R., Bessière, P., Boë, L.-J.: From form to formation of phonetic structures: an evolutionary computing perspective. In: Fogarty, T., Venturini, G. (eds.) ICML 1996 Workshop onn Evolutionary Computing and Machine Learning, pp. 23–29 (1996)Google Scholar
  9. Bessière, P.: Vers une théorie probabiliste des systèmes sensori-moteurs (HDR). PhD thesis, Université Joseph Fourier, Grenoble, France (2000)Google Scholar
  10. Bladon, A.: Arguments against formants in the auditory representation of speech. In: Carlson, R., Granström, B. (eds.) The Representation of Speech in the Peripheral Auditory System, pp. 95–102. Elsevier Biomedical, Amsterdam (1982)Google Scholar
  11. Boë, L.-J.: Modelling the growth of the vocal tract vowel spaces of newly-born infants and adults. In: Proc. XIVth International Congress of Phonetic Sciences, pp. 2501–2504 (1999)Google Scholar
  12. Boë, L.-J., Maeda, S.: Modélisation de la croissance du conduit vocal. In: Journées d’Études Linguistiques “La Voyelle dans tous ses états”, pp. 98–105 (1998)Google Scholar
  13. Boë, L.-J., Perrier, P., Guérin, B., Schwartz, J.-L.: Maximal vowel space. In: Proc. of Eurospeech 1989, pp. 281–284 (1989)Google Scholar
  14. Boë, L.-J., Perrier, P., Bailly, G.: The geometric vocal tract variables controlled for vowel production: proposals for constraining acoustic-to-articulatory inversion. Journal of Phonetics 20, 27–38 (1992)Google Scholar
  15. Boë, L.-J., Gabioud, B., Perrier, P.: Speech maps interactive plant ”smip”. In: Proc. XIIIth International Congress of Phonetic Sciences, vol. 2, pp. 426–429 (1995a)Google Scholar
  16. Boë, L.-J., Gabioud, B., Perrier, P., Schwartz, J.-L., Vallée, N.: Vers une unification des espaces vocaliques. In: Levels in Speech Communication: Relations and Interactions, pp. 63–71. Elsevier Science, Amsterdam (1995)Google Scholar
  17. Boë, L.-J., Abry, C., Beautemps, D., Schwartz, J., Laboissière, R.: Les sosies vocaliques – inversion et focalisation. XXIIIèmes Journées d’Étude sur la Parole, 257–260 (2000)Google Scholar
  18. Boë, L.-J., Vallée, N., Badin, P., Schwartz, J.-L., Abry, C.: Tendencies in phonological structures: The influence of substance on form. Les Cahiers de l’ICP, Bulletin de la Communication Parlée 5, 35–55 (2000)Google Scholar
  19. Bosma, J. (ed.): Symposium on oral sensation and perception. Charles C. Thomas (1967)Google Scholar
  20. Bothorel, A., Simon, P., Wioland, F., Zerling, J.P.: Cinéradiographie des voyelles et des consonnes du français. recueil de documents synchronisés pour quatre sujets: vues latérales du conduit vocal, vues frontales de l’orifice labial, données acoustiques. Technical report, Institut de Phonétique, Strasbourg, France (1986)Google Scholar
  21. Brooks, R., Breazeal, C., Marjanovic, M., Scassellati, B., Williamson, M.: The cog project: Building a humanoid robot. In: Nehaniv, C. (ed.) Computation for Metaphors, Analogy, and Agents. Notes in Artificial Intelligence, pp. 52–87. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  22. Campbell, R., Dodd, B., Burnham, D. (eds.): Hearing by eye, II. Perspectives and directions in research on audiovisual aspects of language processing. Psychology Press (1998)Google Scholar
  23. Chistovich, L.: Auditory processing of speech. Language and Speech 23, 67–72 (1980)Google Scholar
  24. Davis, B., MacNeilage.: The articulatory basis of babbling. Am. SLH Ass. 38, 1199–1211 (1995)Google Scholar
  25. De Boer, B.: Self-organisation in vowel systems. Journal of Phonetics, 441–465 (2000)Google Scholar
  26. Delgutte, B.: Speech coding in the auditory nerve ii: Processing schemes for vowel-like sounds. J. Acoust. Soc. Am. 75, 879–886 (1984)CrossRefGoogle Scholar
  27. Dodd, B., Campbell, R. (eds.): Hearing by eye: the psychology of lipreading. Lawrence Erlbaum Associates, Mahwah (1987)Google Scholar
  28. Fant, G.: Acoustic Theory of Speech Production. The Hague, Mouton (1960)Google Scholar
  29. Gabioud, B.: Articulatory models in speech synthesis. In: Keller, E. (ed.) Fundamentals of Speech Synthesis and Recognition. Basic Concepts, State-of-the-Art and Future Challenges, pp. 215–230. John Willey (1994)Google Scholar
  30. Goldstein, U.: An articulatory model for the vocal tract of the growing children. PhD thesis, MIT, Cambridge, Massachusetts, USA (1988)Google Scholar
  31. Guenther, F.: Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. Psychological Review 102, 594–621 (1995)CrossRefGoogle Scholar
  32. Guiard Marigny, T.: Modélisation des lèvres. Master’s thesis, DEA Signal Image Parole, INP, Grenoble, France (1992)Google Scholar
  33. Hardcastle, W.: Physiology of speech production. Academic Press, London (1976)Google Scholar
  34. Hoole, P.: Bite-block speech in the absence of oral sensibility. In: Proc. ICPhS, vol. 4, pp. 16–19 (1987)Google Scholar
  35. Jakobson, R.: Child language aphasia,and phonological universals. Mouton, The Hague (1968)Google Scholar
  36. Kent, R., Miolo, G.: Phonetic abilities in the first year of life. In: Fletcher, P.M. (ed.) The Handbook of Child Language, Blackwel Publishers (1995)Google Scholar
  37. Kent, R., Martin, R., Sufit, R.: Oral sensation: a review and clinical prospective. In: Winitz, H. (ed.) Human Communication and its Disorders, pp. 135–191. Ablex Publishing, Greenwich (1990)Google Scholar
  38. Koopmans-Van Beinum, F., Van Der Stelt, J.: Early stages in the development of speech movements. In: Lindblom, B., Zetterstrom, R. (eds.) Precursors of Early Speech, pp. 37–49. Stockton Press (1986)Google Scholar
  39. Kuhl, P., Meltzoff, A.: The bimodal perception of speech in infancy. Science 218, 1138–1141 (1992)CrossRefGoogle Scholar
  40. Kuhl, P., Meltzoff, A.: Infant vocalizations in response to speech: vocal imitation and developmental changes. J. Acoust. Soc. Am. 100, 2425–2438 (1996)CrossRefGoogle Scholar
  41. Laboissière, R.: Préliminaires pour une robotique de la communication parlée: inversion et contrôle d’un modèle articulatoire du conduit vocal. PhD thesis, Thèse de Docteur de l’INPG, Signal-Image-Parole, Grenoble, France (1992)Google Scholar
  42. Landgren, S., Olsson, K.: Oral mechanoreceptors. In: Grillner, S. (ed.) Speech Motor Control, Pergamon Press, Oxford (1982)Google Scholar
  43. Liljencrants, J., Lindblom, B.: Numerical simulations of vowel quality systems: The role of perceptual contrast. Language 48, 839–862 (1972)CrossRefGoogle Scholar
  44. Lindblom, B.: Phonetic universals in vowel systems. In: Ohala, J., Jaeger, J. (eds.) Experimental Phonology, pp. 13–44. Academic Press, London (1986)Google Scholar
  45. Lindblom, B.: On the notion of possible speech sound. Journal of Phonetics 18, 135–152 (1990)Google Scholar
  46. Lindblom, B.: Systemic constraints and adaptive change in the formation of sound structure. In: Hurford, J. (ed.) Evolution of Human Language, Edimburgh Univ. Press (1997)Google Scholar
  47. Lindblom, B., Lubker, J., McAllister, R.: Compensatory articulation and the modeling of normal speech production behavior. In: Carré, R. (ed.) Articulatory modeling and phonetics, pp. 147–161. GALF (1977)Google Scholar
  48. Mackenzie Beck, J.: Organic variation of the vocal apparatus. In: Hardcastle, W., Laver, J. (eds.) The Handbook of Phonetic Sciences, pp. 256–297. Blackwell Publishers, Malden (1997)Google Scholar
  49. MacNeilage, P., Davis, B.: Acquisition of speech production, frames then content. In: Jeannerod, M. (ed.) Attention and Performance, XIII: Motor Representation and Control, pp. 453–476. Lawrence Erlbaum Associates, Mahwah (1990)Google Scholar
  50. MacNeilage, P., Rootes, T., Chase, R.: Speech production and perception in a patient with severe impairment of somesthesic perception and motor control. Journal of Speech and Hearing Research 10, 449–467 (1967)Google Scholar
  51. MacNeilage, P.F.: The frame/content theory of evolution of speech production. Behavioral and Brain Sciences (BBS) 21(4), 499–511 (1998)Google Scholar
  52. Maeda, S.: Compensatory articulation during speech: Evidence from the analysis and synthesis of vocal tract shapes using an articulatory model. In: Hardcastle, W., Marchal, A. (eds.) Speech Production and Modelling, pp. 131–149. Kluwer Academic Publishers, Dordrecht (1989)Google Scholar
  53. Matyear, C.L.: An acoustical study of vowels in babbling. PhD thesis, Doct. diss. University of Texas. Austin (1997)Google Scholar
  54. Matyear, C.L., MacNeilage, P.F., Davis, B.L.: Nasalization of vowels in nasal environments in babbling: evidence for frame dominance. Phonetica 55, 1–17 (1998)CrossRefGoogle Scholar
  55. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)CrossRefGoogle Scholar
  56. Meltzoff, A.N.: Newborn imitation. In: Min, D., Blater, A. (eds.) Infant development, the essentiel readings, pp. 165–181. Blackwell, Malden (2000)Google Scholar
  57. Ménard, L., Schwartz, J.-L., Boë, L.-J., Kandel, S., Vallée, N.: Auditory normalization of french vowels synthesized by an articulatory model simulating growth from birth to adulthood. Journal of the Acoustical Society of America 4(111), 1892–1905 (2002)CrossRefGoogle Scholar
  58. Ménard, L., Schwartz, J.-L., Boë, L.-J.: The role of vocal tract morphology in speech development: Perceptual targets and sensori-motor maps for french synthesized vowels from birth to adulthood. Journal of Language, Speech and Hearing Research 47, 1059–1080 (2004)CrossRefGoogle Scholar
  59. Mills, A.: The development of phonology in the blind child. In: Dodd, B., Campbell, R. (eds.) Hearing by eye: the psychology of lipreading, pp. 145–161. Lawrence Erlbaum, Mahwah (1987)Google Scholar
  60. Piquemal, M., Schwartz, J.-L., Berthommier, F., Lallouache, T., Escudier, P.: Détection et localisation auditive d’explosions consonantiques dans des séquences vcv bruitées. In: Actes des XXIemes Journées d’études sur la parole, pp. 143–146 (1996)Google Scholar
  61. Pols, L.: Analysis and synthesis of speech using a broad-band spectral representation. In: Fant, G., Tatham, M. (eds.) Auditory Analysis and Perception of Speech, Academic Press, London (1975)Google Scholar
  62. Recasens, D.: An electropalatographic and acoustic study of consinant-to-vowel coarticulation. Journal of Phonetics 19, 177–192 (1991)Google Scholar
  63. Savariaux, C., Perrier, P., Orliaguet, J.: Compensation strategies for the perturbation of the rounded vowel [u] using a lip-tube: A study of the control space in speech production. J. Acoust. Soc. Am. 98, 2428–2442 (1995)CrossRefGoogle Scholar
  64. Schroeder, M., Atal, B., Hall, J.: Objective measure of certain speech signal degradations based on masking properties of human auditory perception. In: Lindblom, B., Ohman, S. (eds.) Frontiers of Speech Communication Research, pp. 217–229. Academic Press, London (1979)Google Scholar
  65. Schwartz, J.-L., Boë, L.-J.: Predicting palatal contacts from jaw and tongue commands: a new sensory model and its potential use in speech control. In: 5th Seminar on speech production: Models and data (2000)Google Scholar
  66. Schwartz, J.-L., Arrouas, Y., Beautemps, D., Escudier, P.: Auditory analysis of speech gestures. In: Schouten, M. (ed.) The Auditory Processing of Speech – From Sounds to Words, Speech Research. Mouton de Gruyter (1992)Google Scholar
  67. Schwartz, J.-L., Boë, L.-J., Vallée, N., Abry, C.: The dispersion-focalization theory of vowel systems. Journal of Phonetics 25, 255–286 (1997)CrossRefGoogle Scholar
  68. Schwartz, J.-L., Robert-Ribes, J., Escudier, P.: Ten years after summerfield a taxonomy of models for audiovisual fusion in speech perception. In: Campbell, B.D.R., Burnham, D. (eds.) Hearing by eye, II. Perspectives and directions in research on audiovisual aspects of language processing, pp. 85–108. Psychology Press (1998)Google Scholar
  69. Schwartz, J.-L., Abry, C., Boë, L.-J., Cathiard, M.: Phonology in a theory of perception-for-action-control. In: Durand, B.L.J. (ed.) Phonetics, Phonology and Cognition, pp. 255–280. Oxford University Press, Oxford (2002)Google Scholar
  70. Serkhane, J., Schwartz, J.-L.: Simulating vocal imitation in infants, using a growth articulatory model and speech robotics. In: Proc. ICPhS, Barcelona, pp. 2241–2245 (2003)Google Scholar
  71. Serkhane, J., Schwartz, J.-L., Boë, L.-J., Davis, B., Matyear, C.: Motor specifications of a baby robot via the analysis of infants’ vocalizations. In: ICSLP 2002, pp. 45–48 (2002)Google Scholar
  72. Steels, L.: Synthesising the origins of language and meaning using co-evolution, self oprganisation and level formation. In: Hurford, M.S.-K.J.R., Knight, C. (eds.) Approaches to the evolution of language, pp. 384–404. Cambridge University Press, Cambridge (1998)Google Scholar
  73. Vilain, A., Abry, C., Badin, P.: Coproduction strategies in french vcvc: Confronting ohman’s model with adult and developmental articulatory data. In: Proc.5th Seminar on Speech Production, Munich, Germany, pp. 81–84 (2000)Google Scholar
  74. Wood, S.: A radiographic analysis of constriction locations for vowels. Journal of Phonetics 7, 25–43 (1979)Google Scholar
  75. Wu, Z., Schwartz, J.-L., Escudier, P.: Physiologically plausible modules for the detection of articulatory-acoustic events. In: Ainsworth, B. (ed.) Advances in Speech, Hearing and Language Processing, Cochlear Nucleus, vol. 3, pp. 479–495. JAI Press (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Jihène E. Serkhane
    • 1
  • Jean-Luc Schwartz
    • 1
  • Pierre Bessière
    • 2
  1. 1.CNRS - Institut de la Communication Parlée (ICP)  
  2. 2.CNRS - Grenoble Université 

Personalised recommendations