Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence

  • Samer Al Moubayed
  • Jonas Beskow
  • Björn Granström
  • David House
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6456)


In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech.

We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eye-gaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.


visual prosody prominence stress multimodal gaze head-nod eyebrows visual synthesis talking heads 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    McGurk, H., MacDonald, J.: Hearing lips and seeing voices, vol. 264, pp. 746–748 (1976)Google Scholar
  2. 2.
    Summerfield, Q.: Lipreading and audio-visual speech perception. Philosophical Transactions: Biological Sciences 335(1273), 71–78 (1992)CrossRefGoogle Scholar
  3. 3.
    Cave, C., Guaïtella, I., Bertrand, R., Santi, S., Harlay, F., Espesser, R.: About the relationship between eyebrow movements and Fo variations. In: Proc. of the Fourth International Conference on Spoken Language, vol. 4 (1996)Google Scholar
  4. 4.
    Munhall, K., Jones, J., Callan, D., Kuratate, T., Vatikiotis-Bateson, E.: Head Movement Improves Auditory Speech Perception Psychological Science, vol. 15(2), pp. 133–137 (2004)Google Scholar
  5. 5.
    Davis, C., Kim, J.: Audio-visual speech perception off the top of the head. Cognition 100(3), 21–31 (2006)CrossRefGoogle Scholar
  6. 6.
    Cvejic, E., Kim, J., Davis, C.: Prosody off the top of the head: Prosodic contrasts can be discriminated by head motion. Speech Communication (2010)Google Scholar
  7. 7.
    Terken, J., Hermes, D.: The perception of prosodic prominence, in Prosody: Theory and Experiment. Studies Presented to Gösta Bruce. pp. 89–127 (2000)Google Scholar
  8. 8.
    Streefkerk, B., Pols, L., Bosch, L.: Acoustical features as predictors for prominence in read aloud Dutch sentences used in ANN’s. In: Sixth European Conference on Speech Communication and Technology, Citeseer (1999)Google Scholar
  9. 9.
    Fant, G., Kruckenberg, A., Nord, L.: Durational correlates of stress in Swedish, French, and English. Journal of phonetics 19(3-4), 351–365 (1991)Google Scholar
  10. 10.
    Bruce, G.: Swedish word accents in sentence perspective. LiberLäromedel/Gleerup (1977)Google Scholar
  11. 11.
    Gussenhoven, C., Bruce, G.: Word prosody and intonation.Empirical Approaches to Language Typology, 233–272 (1999)Google Scholar
  12. 12.
    Heldner, M., Strangert, E.: Temporal effects of focus in Swedish. Journal of Phonetics 29(3), 329–361 (2001)CrossRefGoogle Scholar
  13. 13.
    Fant, G., Kruckenberg, A., Liljencrants, J., Hertegård, S.: Acoustic phonetic studies of prominence in Swedish. KTH TMH-QPSR 2(3), 2000 (2000)Google Scholar
  14. 14.
    Fant, G., Kruckenberg, A.: Notes on stress and word accent in Swedish. In: Proceedings of the International Symposium on Prosody, Yokohama, September 18, pp. 2–3 (1994)Google Scholar
  15. 15.
    Granström, B., House, D.: Audiovisual representation of prosody in expressive speech communication. Speech Communication 46(3-4), 473–484 (2005)CrossRefGoogle Scholar
  16. 16.
    Beskow, J., Granström, B., House, D.: Visual correlates to prominence in several expressive modes. In: Proc of the Ninth International Conference on Spoken Language Processing (2006)Google Scholar
  17. 17.
    House, D., Beskow, J., Granström, B.: Timing and interaction of visual cues for prominence in audiovisual speech perception. In: Proc. of the Seventh European Conference on Speech Communication and Technology (2001)Google Scholar
  18. 18.
    Swerts, M., Krahmer, E.: The importance of different facial areas for signalling visual prominence. In: Proc. of the Ninth International Conference on Spoken Language Processing (2006)Google Scholar
  19. 19.
    Krahmer, E., Swerts, M.: The effects of visual beats on prosodic prominence: Acoustic analyses, auditory perception and visual perception. Journal of Memory and Language 57(3), 396–414 (2007)CrossRefGoogle Scholar
  20. 20.
    Dohen, M., Lœvenbruck, H.: Interaction of audition and vision for the perception of prosodic contrastive focus. Language and Speech 52(2-3), 177 (2009)CrossRefGoogle Scholar
  21. 21.
    Dohen, M., Lcevenbruck, H., Hill, H.: Recognizing Prosody from the Lips: Is It Possible to Extract Prosodic Focus. Visual Speech Recognition: Lip Segmentation and Mapping, 416 (2009)Google Scholar
  22. 22.
    Wang, D., Narayanan, S.: An acoustic measure for word prominence in spontaneous speech. IEEE Transactions on Audio, Speech, and Language Processing 15(2), 690–701 (2007)CrossRefGoogle Scholar
  23. 23.
    Grice, M., Savino, M.: Can pitch accent type convey information status in yes-no questions. In: Proc. of the Workshop Sponsored by the Association for Computational Linguistics, pp. 29–38 (1997)Google Scholar
  24. 24.
    Al Moubayed, S., Beskow, J.: Effects of visual prominence cues on speech intelligibility. In: Proceedings of the International Conference on Auditory Visual Speech Processing AVSP 2009, vol. 15, p. 16 (2009)Google Scholar
  25. 25.
    Tamburini, F.: Prosodic prominence detection in speech. In: Proceedings of the Seventh International Symposium on Signal Processing and Its Applications, vol. 1 (2003)Google Scholar
  26. 26.
    Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K.-E., Öhman, T.: Synthetic faces as a lipreading support. In: Proceedings of ICSLP 1998 (1998)Google Scholar
  27. 27.
    Salvi, G., Beskow, J., Al Moubayed, S., Granström, B.: Synface - speech-driven facial animation for virtual speech-reading support. Journal on Audio, Speech and Music Processing (2009)Google Scholar
  28. 28.
    Beskow, J.: Rule-based visual speech synthesis. In: Proc. of the Fourth European Conference on Speech Communication and Technology (1995)Google Scholar
  29. 29.
    Sjölander, K.: An HMM-based system for automatic segmentation and alignment of speech. In: Proceedings of Fonetik, pp. 93–96 (2003)Google Scholar
  30. 30.
    Beskow, J.: Trainable articulatory control models for visual speech synthesis. International Journal of Speech Technology 7(4), 335–349 (2004)CrossRefGoogle Scholar
  31. 31.
    Shannon, R., Zeng, F., Kamath, V., Wygonski, J., Ekelid, M.: Speech recognition with primarily temporal cues. Science 270(5234), 303 (1995)CrossRefGoogle Scholar
  32. 32.
    Al Moubayed, S., Beskow, J., Oster, A.-M., Salvi, G., Granström, B., van Son, N., Ormel, E.: Virtual speech reading support for hard of hearing in a domestic multi-media setting. In: Proceedings of Interspeech 2009 (2009)Google Scholar
  33. 33.
    Poggi, I., Pelachaud, C., De Rosisc, F.: Eye communication in a conversational 3D synthetic agent. AI communications 13(3), 169–181 (2000)Google Scholar
  34. 34.
    Ekman, P.: About brows: Emotional and conversational signals. Human ethology: Claims and limits of a new discipline: contributions to the Colloquium, 169–248 (1979)Google Scholar
  35. 35.
    Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., Stone, M.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pp. 413–420 (1994)Google Scholar
  36. 36.
    Raidt, S., Bailly, G., Elisei, F.: Analyzing and modeling gaze during face-to-face interaction. In: Proceedings of the International Conference on Auditory-Visual Speech Processing, AVSP 2007 (2007)Google Scholar
  37. 37.
    Vatikiotis-Bateson, E., Eigsti, I., Yano, S., Munhall, K.: Eye movement of perceivers during audiovisual speech perception. Perception and Psychophysics 60(6), 926–940 (1998)CrossRefGoogle Scholar
  38. 38.
    Paré, M., Richler, R., Ten, H., Munhall, K.: Gaze behavior in audiovisual speech perception: The influence of ocular fixations on the McGurk effect. Perception & psychophysics 65(4), 553 (2003)CrossRefGoogle Scholar
  39. 39.
    Cutler, A., Otake, T.: Pitch accent in spoken-word recognition in Japanese. The Journal of the Acoustical Society of America 105, 1877 (1999)CrossRefGoogle Scholar
  40. 40.
    van Wassenhove, V., Grant, K., Poeppel, D.: Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences 102(4), 1181 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Samer Al Moubayed
    • 1
  • Jonas Beskow
    • 1
  • Björn Granström
    • 1
  • David House
    • 1
  1. 1.Center for Speech TechnologyRoyal Institute of Technology KTHStockholmSweden

Personalised recommendations