Multimodal Information Processing for Affective Computing

  • Jianhua Tao


Affective computing is interaction that relates to, arises from or deliberately influences emotions [1]; it tries to assign computers the human-like capabilities of observation, interpretation and generation of affect features. It is an important topic in human–computer interaction (HCI), because it helps increase the quality of human to computer communications.


Facial Expression Emotion Recognition Speech Synthesis Emotional Speech Facial Animation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was supported in part by the National Natural Science Foundation of China under Grant 60575032 and the 863 program under Grant 2006AA01Z138.


  1. 1.
    Picard R W. (1997). Affective Computing. MIT Press, Cambridge, MA.Google Scholar
  2. 2.
    James W. (1884). What is emotion? Mind, vol. 9(34), 188–205.Google Scholar
  3. 3.
    Oatley K. (1987). Cognitive science and the understanding of emotions. Cogn. Emotion, 3(1), 209–216.Google Scholar
  4. 4.
    Bigun E. S., Bigun J., Duc B., Fischer S. (1997). Expert conciliation for multimodal person authentication systems using bayesian statistics. In: Int. Conf. on Audio and Video-Based Biometric Person Authentication (AVBPA), Crans-Montana, Switzerland, 291–300.Google Scholar
  5. 5.
    Scherer K. R. (1986). Vocal affect expression: A review and a model for future research. Psychol. Bull., vol. 99(2), 143–165.Google Scholar
  6. 6.
    Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Commun., 40, 227–256.CrossRefMATHGoogle Scholar
  7. 7.
    Scherer, K. R., Banse, R., Wallbott, H. G. (2001). Emotion inferences from vocal expression correlate across languages and cultures. J. Cross-Cultural Psychol., 32 (1), 76–92.CrossRefGoogle Scholar
  8. 8.
    Johnstone, T., van Reekum, C. M., Scherer, K. R. (2001). Vocal correlates of appraisal processes. In: Scherer, K. R., Schorr, A., Johnstone, T. (eds) Appraisal Processes in Emotion: Theory, Methods, Research. Oxford University Press, New York and Oxford, 271–284.Google Scholar
  9. 9.
    Petrushin, V. A. (2000). Emotion recognition in speech signal: Experimental study, development and application. In: 6th Int. Conf. on Spoken Language Processing, ICSLP2000, Beijing, 222–225.Google Scholar
  10. 10.
    Gobl, C., Chasaide, A. N. (2003). The role of voice quality in communicating emotion, mood and attitude. Speech Commun., 40(1-2), 189–212.Google Scholar
  11. 11.
    Tato, R., Santos, R., Kompe, R., Pardo, J. M. (2002). Emotional space improves emotion recognition. In: ICSLP2002, Denver, CO, 2029–2032.Google Scholar
  12. 12.
    Dellaert, F., Polzin, T., Waibel, A. (1996). Recognizing emotion in speech. In: ICSLP 1996, Philadelphia, PA, 1970–1973.Google Scholar
  13. 13.
    Lee, C. M., Narayanan, S., Pieraccini, R. (2001). Recognition of negative emotion in the human speech signals. In: Workshop on Automatic Speech Recognition and Understanding.Google Scholar
  14. 14.
    Yu, F., Chang, E., Xu, Y. Q., Shum H. Y. (2001). Emotion detection from speech to enrich multimedia content. In: The 2nd IEEE Pacific-Rim Conf. on Multimedia, Beijing, China, 550–557.Google Scholar
  15. 15.
    Campbell, N. (2004). Perception of affect in speech – towards an automatic processing of paralinguistic information in spoken conversation. In: ICSLP2004, Jeju, 881–884.Google Scholar
  16. 16.
    Cahn, J. E. (1990). The generation of affect in synthesized speech. J. Am. Voice I/O Soc., vol. 8, 1–19.Google Scholar
  17. 17.
    Schroder, M. (2001). Emotional speech synthesis: A review. In: Eurospeech 2001, Aalborg, Denmark, 561–564.Google Scholar
  18. 18.
    Campbell, N. (2004). Synthesis units for conversational speech – using phrasal segments. Autumn Meet. Acoust.: Soc. Jpn., vol. 2005, 337–338.Google Scholar
  19. 19.
    Schroder, M., Breuer, S. (2004). XML representation languages as a way of interconnecting TTS modules. In: 8th Int. Conf. on Spoken Language Processing, ICSLP’04, Jeju, Korea.Google Scholar
  20. 20.
    Eide, E., Aaron, A., Bakis, R., Hamza, W., Picheny, M., Pitrelli, J. (2002). A corpus-based approach to <ahem/> expressive speech synthesis. In: IEEE Speech Synthesis Workshop, Santa Monica, 79–84.Google Scholar
  21. 21.
    Chuang, Z. J., Wu, C. H. (2002). Emotion recognition from textual input using an emotional semantic network. In: Int. Conf. on Spoken Language Processing, ICSLP 2002, Denver, 177–180.Google Scholar
  22. 22.
    Tao, J. (2003). Emotion control of chinese speech synthesis in natural environment. In: Eurospeech2003, Geneva.Google Scholar
  23. 23.
    Moriyama, T., Ozawa, S. (1999). Emotion recognition and synthesis system on speech. In: IEEE Int. Conf. on Multimedia Computing and Systems, Florence, Italy, 840–844.Google Scholar
  24. 24.
    Massaro, D. W., Beskow, J., Cohen, M. M., Fry, C. L., Rodriguez, T. (1999). Picture my voice: Audio to visual speech synthesis using artificial neural networks. In: AVSP’99, Santa Cruz, CA, 133–138.Google Scholar
  25. 25.
    Darwin, C. (1872). The Expression of the Emotions in Man and Animals. University of Chicago Press, Chicago.CrossRefGoogle Scholar
  26. 26.
    Etcoff, N. L., Magee, J. J. (1992). Categorical perception of facial expressions. Cognition, vol. 44, 227–240.Google Scholar
  27. 27.
    Ekman, P., Friesen, W. V. (1997). Manual for the Facial Action Coding System. Consulting Psychologists Press, Palo Alto, CA.Google Scholar
  28. 28.
    Yamamoto, E., Nakamura, S., Shikano, K. (1998). Lip movement synthesis from speech based on Hidden Markov Models. Speech Commun., vol. 26, 105–115.Google Scholar
  29. 29.
    Tekalp, A. M., Ostermann, J. (2000). Face and 2-D mesh animation in MPEG-4. Signal Process.: Image Commun., vol. 15, 387–421.Google Scholar
  30. 30.
    Lyons, M. J., Akamatsu, S., Kamachi, M., Gyoba, J. (1998). Coding facial expressions with gabor wavelets. In: 3rd IEEE Int. Conf. on Automatic Face and Gesture Recognition, Nara, Japan, 200–205.Google Scholar
  31. 31.
    Calder, A. J., Burton, A. M., Miller, P., Young, A. W., Akamatsu, S. (2001). A principal component analysis of facial expression. Vis. Res., vol. 41, 1179–208.Google Scholar
  32. 32.
    Kobayashi, H., Hara, F. (1992). Recognition of six basic facial expressions and their strength by neural network. In: Intl. Workshop on Robotics and Human Communications, New York, 381–386.Google Scholar
  33. 33.
    Bregler, C., Covell, M., Slaney, M. (1997). Video rewrite: Driving visual speech with audio. In: ACM SIGGRAPH’97, Los Angeles, CA, 353–360.Google Scholar
  34. 34.
    Cosatto, E., Potamianos, G., Graf, H. P. (2000). Audio-visual unit selection for the synthesis of photo-realistic talking-heads. In: IEEE Int. Conf. on Multimedia and Expo, New York, 619–622.Google Scholar
  35. 35.
    Ezzat, T., Poggio, T. (1998). MikeTalk: A talking facial display based on morphing visemes. In: Computer Animation Conf., Philadelphia, PA, 456–459.Google Scholar
  36. 36.
    Gutierrez-Osuna, R., Rundomin, J. L. (2005). Speech-driven facial animation with realistic dynamics. IEEE Trans. Multimedia, vol. 7, 33–42.Google Scholar
  37. 37.
    Hong, P. Y., Wen, Z., Huang, T. S. (2002). Real-time speech-driven face animation with expressions using neural networks. IEEE Trans. Neural Netw., vol. 13, 916–927.Google Scholar
  38. 38.
    Verma, A., Subramaniam, L. V., Rajput, N., Neti, C., Faruquie, T. A. (2004). Animating expressive faces across languages. IEEE Trans Multimedia, vol. 6, 791–800.Google Scholar
  39. 39.
    Collier, G. (1985). Emotional expression, Lawrence Erlbaum Associates.
  40. 40.
    Argyle, M. (1988). Bodily Communication. Methuen & Co, New York, NY.Google Scholar
  41. 41.
    Siegman, A. W., Feldstein, S. (1985). Multichannel Integrations of Nonverbal Behavior, Lawrence Erlbaum Associates, Hillsdale, NJ.Google Scholar
  42. 42.
    Feldman, R. S., Philippot, P., Custrini, R. J. (1991). Social competence and nonverbal behavior. In: Rimé, R. S. F. B. (ed) Fundamentals of Nonverbal Behavior. Cambridge University Press, Cambridge, 329–350.Google Scholar
  43. 43.
    Knapp, M. L., Hall, J. A. (2006). Nonverbal Communication in Human Interaction, 6th edn. Thomson Wadsworth, Belmont, CA.Google Scholar
  44. 44.
    Go, H. J., Kwak, K. C., Lee, D. J., Chun, M. G. (2003). Emotion recognition from facial image and speech signal. In: Int. Conf. Society of Instrument and Control Engineers, Fukui, Japan, 2890–2895.Google Scholar
  45. 45.
    Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C. M. et al. (2004), Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Int. Conf. on Multimodal Interfaces, State College, PA, 205–211.Google Scholar
  46. 46.
    Song, M., Bu, J., Chen, C., Li, N. (2004). Audio-visual based emotion recognition – A new approach. In: Int. Conf. on Computer Vision and Pattern Recognition, Washington, DC, USA, 1020–1025.Google Scholar
  47. 47.
    Zeng, Z., Tu, J., Liu, M., Zhang, T., Rizzolo, N., Zhang, Z., Huang, T. S., Roth, D., Levinson, S. (2004). Bimodal HCI-related emotion recognition. In: Int. Conf. on Multimodal Interfaces, State College, PA, 137–143.Google Scholar
  48. 48.
    Zeng, Z., Tu, J., Pianfetti, B., Huang, T. S. Audio-visual affective expression recognition through multi-stream fused HMM. IEEE Trans. Multimedia, vol. 10(4), 570–577.Google Scholar
  49. 49.
    Zeng, Z., Tu, J., Liu, M., Huang, T. S., Pianfetti, B., Roth D., Levinson, S. (2007). Audio-visual affect recognition. IEEE Trans. Multimedia, 9 (2), 424–428.CrossRefGoogle Scholar
  50. 50.
    Wang, Y., Guan, L. (2005). Recognizing human emotion from audiovisual information. In: ICASSP, Philadelphia, PA, Vol. II, 1125–1128.Google Scholar
  51. 51.
    Hoch, S., Althoff, F., McGlaun, G., Rigoll, G. (2005). Bimodal fusion of emotional data in an automotive environment. In: ICASSP, Philadelphia, PA, Vol. II, 1085–1088.Google Scholar
  52. 52.
    Fragopanagos, F., Taylor, J. G. (2005). Emotion recognition in human-computer interaction. Neural Netw., 18, 389–405.CrossRefGoogle Scholar
  53. 53.
    Pal, P., Iyer, A. N., Yantorno, R. E. (2006). Emotion detection from infant facial expressions and cries. In: Proc. Int’l Conf. on Acoustics, Speech & Signal Processing, Philadelphia, PA, 2, 721–724.Google Scholar
  54. 54.
    Caridakis, G., Malatesta, L., Kessous, L., Amir, N., Paouzaiou, A., Karpouzis, K. (2006). Modeling naturalistic affective states via facial and vocal expression recognition. In: Int. Conf. on Multimodal Interfaces, Banff, Alberta, Canada, 146–154.Google Scholar
  55. 55.
    Karpouzis, K., Caridakis, G., Kessous, L., Amir, N., Raouzaiou, A., Malatesta, L., Kollias, S. (2007). Modeling naturalistic affective states via facial, vocal, and bodily expression recognition. In: Lecture Notes in Artificial Intelligence, vol. 4451, 91–112.Google Scholar
  56. 56.
    Chen, C. Y., Huang, Y. K., Cook, P. (2005). Visual/Acoustic emotion recognition. In: Proc. Int. Conf. on Multimedia and Expo, Amsterdam, Netherlands, 1468–1471.Google Scholar
  57. 57.
    Picard, R. W. (2003). Affective computing: Challenges. Int. J. Hum. Comput. Studies, vol. 59, 55–64.Google Scholar
  58. 58.
    Ortony, A., Clore, G. L., Collins, A. (1990). The Cognitive Structure of Emotions. Cambridge University Press, Cambridge.Google Scholar
  59. 59.
    Carberry, S., de Rosis, F. (2008). Introduction to the Special Issue of UMUAI on ‘Affective Modeling and Adaptation’, International Journal of User Modeling and User-Adapted Interaction, vol. 18, 1–9.CrossRefGoogle Scholar
  60. 60.
    Esposito, A., Balodis, G., Ferreira, A., Cristea, G. (2006). Cross-Modal Analysis of Verbal and Non-verbal Communication. Proposal for a COST Action.Google Scholar
  61. 61.
    Yin, P. R., Tao J. H. (2005). Dynamic mapping method based speech driven face animation system. In: The 1st Int. Conf. on Affective Computing and Intelligent Interaction (ACII2005), Beijing., 755–763.Google Scholar
  62. 62.
    O’Brien, J. F., Bodenheimer, B., Brostow, G., Hodgins, J. (2000). Automatic joint parameter estimation from magnetic motion capture data. In: Graphics Interface 2000, Montreal, Canada, 53–60.Google Scholar
  63. 63.
    Aggarwal, J. K., Cai, Q. (1999). Human motion analysis: A review. Comput. Vision Image Understand., vol. 73(3), 428–440.Google Scholar
  64. 64.
    Gavrila, D. M. (1999). The visual analysis of human movement: A survey. Comput. Vision Image Understand., vol. 73(1), 82–98.Google Scholar
  65. 65.
    Azarbayejani, A., Wren, C., Pentland, A. (1996). Real-time 3-D tracking of the human body. In: IMAGE’COM 96, Bordeaux, France.Google Scholar
  66. 66.
    Camurri, A., Poli, G. D., Leman, M., Volpe, G. (2001). A multi-layered conceptual framework for expressive gesture applications. In: Intl. EU-TMR MOSART Workshop, Barcelona.Google Scholar
  67. 67.
    Cowie, R. (2001). Emotion recognition in human-computer interaction. IEEE Signal Process. Mag., vol. 18(1), 32–80.Google Scholar
  68. 68.
    Brunelli, R., Falavigna, D. (1995). Person identification using multiple cues. In: IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 17(10), 955–966.Google Scholar
  69. 69.
    Kumar, A., Wong, D. C., Shen, H. C., Jain, A. K. (2003). Personal verification using palmprint and hand geometry biometric. In: 4th Int. Conf. on Audio- and Video-based Biometric Person Authentication, Guildford, UK, 668–678.Google Scholar
  70. 70.
    Frischholz, R. W., Dieckmann, U. (2000). Bioid: A multimodal biometric identification system. IEEE Comput., vol. 33(2), 64–68.Google Scholar
  71. 71.
    Jain, A. K., Ross, A. (2002). Learning user-specific parameters in a multibiometric system. In: Int. Conf. on Image Processing (ICIP), Rochester, New York, 57–60.Google Scholar
  72. 72.
    Ho, T. K., Hull, J. J., Srihari, S. N. (1994). Decision combination in multiple classifier systems. In: IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 16(1), 66–75.Google Scholar
  73. 73.
    Kittler, J., Hatef, M., Duin, R. P. W., Matas, J. (1998). On combining classifiers. In: IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20(3), 226–239.Google Scholar
  74. 74.
    Dieckmann, U., Plankensteiner, P., Wagner T. (1997). Sesam: A biometric person identification system using sensor fusion. Pattern Recognit. Lett., vol. 18, 827–833.Google Scholar
  75. 75.
    Silva, D., Miyasato, T., Nakatsu, R. (1997). Facial emotion recognition using multi-modal information, In: Proc. Int. Conf. on Information and Communications and Signal Processing, Singapore, 397–401.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Chinese Academy of SciencesBeijingChina

Personalised recommendations