A multimodal approach of generating 3D human-like talking agent

  • 146 Accesses

  • 1 Citations


This paper introduces a multimodal framework of generating a 3D human-like talking agent which can communicate with user through speech, lip movement, head motion, facial expression and body animation. In this framework, lip movements are obtained by searching and matching acoustic features which are represented by Mel-frequency cepstral coefficients (MFCC) in audio-visual bimodal database. Head motion is synthesized by visual prosody which maps textual prosodic features into rotational and translational parameters. Facial expression and body animation are generated by transferring motion data to skeleton. A simplified high level Multimodal Marker Language (MML), in which only a few fields are used to coordinate the agent channels, is introduced to drive the agent. The experiments validate the effectiveness of the proposed multimodal framework.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.


  1. 1. (2011) Accessed 16 August

  2. 2.

    Wik P, Hjalmarsson A (2009) Embodied conversational agents in computer assisted language learning. Speech Commun 51(10):1024–1037

  3. 3. (2011) Accessed 16 August

  4. 4. (2011) Accessed 16 August

  5. 5.

    Badler N, Steedman M, Achorn B, Bechet T, Douville B, Prevost S, Cassell J, Pelachaud C, Stone M (1994) Animated conversation: rule-based generation of facial expression gesture and spoken intonation for multiple conversation agents. In: Proceedings of SIGGRAPH, pp 73–80

  6. 6.

    Van Welbergen H, Reidsma D, Ruttkay ZM, Zwiers Elckerlyc J (2010) A BML realizer for continuous, multimodal interaction with a virtual human. J Multimodal User Interfaces 3(4):271–284 ISSN 1783-7677

  7. 7.

    Cerekovic A, Pejsa T, Pandzic IS (2009) RealActor: character animation and multimodal behavior realization system. In: IVA, pp 486–487

  8. 8.

    Kipp M, Heloir A, Gebhard P, Schroeder M (2010) Realizing multimodal behavior: closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th international conference on intelligent virtual agents. Springer, Berlin

  9. 9.

    Courgeon M, Rebillat M, Katz B, Clavel C, Martin J-C (2010) Life-sized audiovisual spatial social scenes with multiple characters: MARC SMART-I2. In: Proceedings of the 5th meeting of the French association for virtual reality

  10. 10.

    Park SI, Shin HJ, Shin SY (2002) On-line locomotion generation based on motion blending. In: Proc of the ACM SIGGRAPH/eurographics symposium on computer animation, New York, NY, USA. ACM Press, New York, pp 105–111

  11. 11.

    Baerlocher P (2001) Inverse kinematics techniques for the interactive posture control of articulated figures. PhD thesis, Swiss Federal Institute of Technology, EPFL

  12. 12.

    Cassell J, Vilhjalmsson HH, Bickmore TW (2001) Beat: the behavior expression animation toolkit. In: Proceedings of SIGGRAPH, pp 477–486

  13. 13.

    Gu E, Badler N (2006) Visual attention and eye gaze during multipartite conversations with distractions. In: Proc of intelligent virtual agents (IVA’06), Marina del Rey, CA

  14. 14.

    Faloutsos P, van de Panne M, Terzopoulos D (2001) Composable controllers for physics-based character animation. In: SIGGRAPH ’01: proceedings of the 28th annual conference on computer graphics and interactive techniques, New York, NY, USA. ACM Press, New York, pp 251–260

  15. 15.

    Kuffner JJ, Latombe JC (2000) Interactive manipulation planning for animated characters. In: Proc of pacic graphics’00, Hong Kong

  16. 16.

    Kallmann M (2005) Scalable solutions for interactive virtual humans that can manipulate objects. In: Artificial intelligence and interactive digital entertainment (AIIDE), Marina del Rey, CA

  17. 17.

    Carolis BD, Pelachaud C, Poggi I, de Rosis F (2001) Behavior planning for a reflexive agent. In: Proceedings of the international joint conference on artificial intelligence (IJCAI’01), Seattle

  18. 18.

    Graf HP, Cosatto E, Ostermann J, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91:1406–1429

  19. 19.

    Graf HP, Cosatto S., Huang F (2002) Visual prosody: facial movements accompanying speech. In: Fifth IEEE international conference on automatic face and gesture recognition

  20. 20.

    Chuang E, Bregler C (2005) Mood swings: expressive speech animation. ACM Trans Graph 24:331–347

  21. 21.

    Bodenheimer B, Rose C, Rosenthal S, Pella J (1997) The process of motion capture: Dealing with the data. In: Thalmann, computer animation and simulation. Eurographics Animation Workshop. Springer, New York, p 318

  22. 22.

    Boulic R, Becheiraz P, Emering L, Thalmann D (1997) Integration of motion control techniques for virtual human and avatar real-time animation. In: Proc of virtual reality software and technology, Switzerland, pp 111–118

  23. 23.

    Stone M, DeCarlo D, Oh I, Rodriguez C, Stere A, Lees A, Bregler C (2004) Speaking with hands: creating animated conversational characters from recordings of human performance. ACM Trans Graph 23(3):506–513

  24. 24.

    Thiebaux M, Marsella S, Marshall AN, Kallmann M (2008) SmartBody behavior realization for embodied conversational agents AAMAS. In: Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems, international foundation for autonomous agents and multiagent systems, 2008, pp 151–158

  25. 25.

    Mancini Greta M, Pelachaud C (2007) Dynamic behavior qualifiers for conversational agents. In: Intelligent virtual agents, IVA’07, Paris

  26. 26.

    Lewis J (1991) Automated lip-sync: Background and techniques. J Vis Comput Animat 2:118–122

  27. 27.

    Wen Z, Hong P, Huang TS (2002) Real-time speech-driven face animation with expressions using neural networks. IEEE Trans Neural Netw 13:916–927

  28. 28.

    Xin L, Tao J, Yin P (2009) Realistic visual speech synthesis based on hybrid concatenation method. IEEE Trans Audio Speech Lang Process 17:469–477

  29. 29.

    Che J, Yang M, Mu K, Tao J (2010) Real-time speech-driven lip synchronization. In: 4th International universal communication symposium, pp 377–381

  30. 30.

    Oki BM, Goldberg D, Nichols D, Terry D (1992) Using collaborative filtering to weave an information tapestry. Commun ACM 35:61–70

  31. 31.

    Tekalp AM, Ostermann J (2000) Face and 2-d mesh animation in mpeg-4. Signal Process Image Commun 15:387–421

  32. 32.

    Young S et al. (2000) The HTK book (v3.0). Cambridge University Engineering Department, Cambridge

  33. 33.

    Xu M, Duan L-Y, Cai J et al. (2004) HMM-based audio keyword generation. In: Advances in multimedia information processing, 5th Pacific rim conference on multimedia

  34. 34.

    Jia H, Liu F, Tao J (2008) A maximum entropy based hierarchical model for automatic prosodic boundary labeling in mandarin. In: Proceedings of 6th international symposium on Chinese spoken language processing

  35. 35.

    Mu K, Tao J, Che J, Yang M (2010) Mood Avatar: Automatic Text-Driven Head Motion Synthesis. In: 12th international conference on multimodal interfaces and 7th workshop on machine learning for multimodal interaction, Beijing, China

  36. 36.

    Deng Z, Neumann U, Busso C, Narayanan S (2005) Natural head motion synthesis driven by acoustic prosodic features. Comput Animat Virtual Worlds 16:283–290

  37. 37. (2011) Accessed 16 August

Download references

Author information

Correspondence to Minghao Yang.

Additional information

This work is supported in part by National Science Foundation of China (No. 60873160 and No.90820303) and China-Singapore Institute of Digital Media (CSIDM).

Electronic Supplementary Material

Below are the links to the electronic supplementary material.

(AVI 1.50 MB)

(AVI 667 kB)

(AVI 1.63 MB)

(AVI 1.50 MB)

(AVI 667 kB)

(AVI 1.63 MB)

(AVI 2.74 MB)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Yang, M., Tao, J., Mu, K. et al. A multimodal approach of generating 3D human-like talking agent. J Multimodal User Interfaces 5, 61–68 (2012).

Download citation


  • Multimodal 3D talking agent
  • Lip movement
  • Head motion
  • MFCC
  • Facial expression
  • Gesture animation