Prosodic Features for Speaker Recognition

  • Leena Mary


In this chapter the effectiveness of syllable-based prosodic features for speaker recognition is discussed. The term prosody represents a collection of characteristics such as intonation, stress and timing, primarily expressed using variations in pitch, energy and duration at various levels of speech. Prosody reflects the learned/acquired speaking habits of a person and hence contributes for speaker recognition. Because prosodic features are less affected by channel mismatch and noise, they are particularly well suited for speaker forensics, a field that demands accurate identification of suspects with as few mitigating conditions as possible. In this chapter, the author describes a method for extracting prosodic features directly from speech signal. Applying this method, speech is segmented into syllable-like regions using vowel onset points (VOP). The locations of VOPs serve as reference for extraction and representation of prosodic features. The effectiveness of the prosodic features for speaker recognition is demonstrated for extended task of NIST speaker recognition evaluation 2003. Combining evidence from spectral features with that of the proposed prosodic features helps to improve overall speaker recognition accuracy.


Speech Signal Vocal Tract Speaker Recognition Speaker Verification Prosodic Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The author would like to thank Prof. B. Yegnanarayana and members of Speech and Vision Laboratory of IIT Madras, India during 2002–2006 for their support to carry out the study described in this chapter.


  1. 1.
    Heck LP (2002) Integrating high-level information for robust speaker recognition in John Hopkins University workshop on SuperSID, Baltimore, Maryland. http:\\
  2. 2.
    Doddington GG (2001) Speaker recognition based on idiolectic differences between speakers. Proc. EUROSPEECH, Aalborg, Denmark, pp 2521–2524Google Scholar
  3. 3.
    Campbell JP (1997) Speaker recognition: a tutorial. Proc IEEE 85(9):1437–1462CrossRefGoogle Scholar
  4. 4.
    Mary L (2006) Multilevel implicit features for language and speaker recognition. Ph. D. Thesis, Indian Institute of Technology, MadrasGoogle Scholar
  5. 5.
    Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52:12–40CrossRefGoogle Scholar
  6. 6.
    NIST (2001) Speaker recognition evaluation website:
  7. 7.
    Reynolds D, Andrews W, Campbell J, Navratil J, Peskin B, Adami A, Jin Q, Klusacek D, Abramson J, Mihaescu R, Godfrey J, Jones D, Xiang B (2003) The superSID project: exploiting high-level information for high-accuracy speaker recognition Proc. IEEE Int. Conf. Acoust., Speech and Signal Processing, Hong Kong, China, 4, pp 784–787Google Scholar
  8. 8.
    Shriberg E, Stolcke A, Hakkani-Tur D, Tur G (2000) Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun 32:127–154CrossRefGoogle Scholar
  9. 9.
    Sonmez MK, Heck L, Weintraub M, Shriberg E (1997) A lognormal tied mixture model of pitch for prosody-based speaker recognition. Proc. EUROSPEECH, Rhodes, Greece. 3, pp 1391–1394Google Scholar
  10. 10.
    Atkinson JE (1978) Correlation analysis of the physiological factors controlling fundamental voice frequency. J Acoust Soc Am 63(1):211–222CrossRefGoogle Scholar
  11. 11.
    Yegnanarayana B, Prasanna SRM, Zachariah JM, Gupta CS (2005) Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Trans Speech Audio Process 13(4):575–582CrossRefGoogle Scholar
  12. 12.
    Atal B (1972) Automatic speaker recognition based on pitch contours. J Acous Soc Am 52(3):1687–1697CrossRefGoogle Scholar
  13. 13.
    Adami AG, Mihaescu R, Reynolds DA, Godfrey JJ (2003) Modeling prosodic dynamics for speaker recognition. Proc. ICASSP, Hong Kong, China, 4, pp 788–791Google Scholar
  14. 14.
    Makhoul J (1975) Linear prediction: a tutorial review. Proc IEEE 63:561–580CrossRefGoogle Scholar
  15. 15.
    Furui S (1981) Cepstral analysis technique for automatic speaker verification. IEEE Trans Speech Audio Process 29:254–272Google Scholar
  16. 16.
    Reynolds DA, Rose R (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3:72–83CrossRefGoogle Scholar
  17. 17.
    Reynolds DA (1996) The effect of handset variability on speaker recognition performance: Experiments on the switchboard corpus. Proc. ICASSP, Atlanta, GA, USA, 1, pp 113–116Google Scholar
  18. 18.
    Thyme-Gobbel AE, Hutchins SE (1996) On using prosodic cues in automatic language identification. Proc. Int. Conf. Spoken Language Processing, Philadelphia, PA, USA, 3, pp 1768–1772Google Scholar
  19. 19.
    Mary L, Yegnanarayana B (2008) Extraction and representation of prosodic features for language and speaker recognition. Speech Commun 50:782–796CrossRefGoogle Scholar
  20. 20.
    Drygajlo A (2007) Forensic automatic speaker recognition. IEEE Signal Process Mag 132–135Google Scholar
  21. 21.
    Shriberg E, Stolcke A (2008) The case for automatic higher level features in forensic speaker recognition. Proc. Interspeech, Brisbane, Australia, pp 1509–1512Google Scholar
  22. 22.
    Rose P (2006) Technical speaker recognition: evaluation, types and testing of evidence. Comp Speech Lang 20:159–1914CrossRefGoogle Scholar
  23. 23.
    Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Commun 46:455–472CrossRefGoogle Scholar
  24. 24.
    Sonmez MK, Shriberg E, Heck L, Weintraub M (1998) Modeling dynamic prosodic variation for speaker variation. Proc. ICSLP, Sydney, Australia, 7, pp 3189–3192Google Scholar
  25. 25.
    Adami AG, Mihaescu R, Reynolds DA, Godfrey JJ (2003) Modeling prosodic dynamics for speaker recognition. Proc. ICASSP, Hong kong, China, 4, pp 788–791Google Scholar
  26. 26.
    Peskin B, Navratil J, Abramson J, Jones D, Klusacek D, Reynolds D, Xiang B (2003) Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS`02. Proc. ICASSP, Hong kong, China, 4, pp 792–795Google Scholar
  27. 27.
    Rouas J, Farinas J, Pellegrino F, Andre-Obrecht R (2005) Rhythmic unit extraction and modelling for automatic language identification. Speech Commun 47:436–456CrossRefGoogle Scholar
  28. 28.
    Nagarajan T, Murthy HA (2006) Language identification using acoustic log-likelihoods of syllable-like units. Speech Commun 48:913–926CrossRefGoogle Scholar
  29. 29.
    Dehak N, Kenny P, Dumouchel P (2007) Continuous prosodic features and formant modeling with joint factor analysis for speaker verification. Proc. of Interspeech, pp 1234–1237Google Scholar
  30. 30.
    Mary L, Yegnanarayana B (2006) Prosodic features for speaker verification. Proc. of Interspeech, Pittsburgh, Pennsylvania, pp 917–920Google Scholar
  31. 31.
    MacNeilage PF (1998) The frame/content theory of evolution of speech production. Behav Brain Sci 21:499–546Google Scholar
  32. 32.
    Krakow RA (1999) Physiological organization of syllables: a review. J Phonetics 27:23–54CrossRefGoogle Scholar
  33. 33.
    Atterer M, Ladd DR (2004) On the phonetics and phonology of “segmental anchoring” of F0: evidence from German. J Phonetics 32:177–197CrossRefGoogle Scholar
  34. 34.
    Prasanna SRM, Gangashetty SV, Yegnanarayana B (2001) Significance of vowel onset point for speech analysis. Proc. Signal Proc. Com, Indian Institute of Science, pp. 81–88Google Scholar
  35. 35.
    Prasanna SRM (2004) Event-based analysis of speech. Ph D Thesis, Indian Institute of Technology, MadrasGoogle Scholar
  36. 36.
    Prasanna SRM, Yegnanarayana B (2005) Detection of vowel onset point events using excitation source information, Proc. of Interspeech, pp 1133–1136Google Scholar
  37. 37.
    Prasanna SRM, Zachariah JM (2002) Detection of vowel onset point in speech. Proc. IEEE Int Conf Acoust Speech, Signal Processing, Orlando, Fl, USA 4:4159Google Scholar
  38. 38.
    Ananthapadmanabha TV (1978) Epoch extraction of voice speech. Ph. D. Thesis, Indian institute of Science, BangaloreGoogle Scholar
  39. 39.
    Hess W (1983) Pitch determination of speech signals. Springer, BerlinCrossRefGoogle Scholar
  40. 40.
    Ananthapadmanabha TV, Yegnanarayana B (1979) Epoch extraction fromlinear prediction residual for identification of closed glottis interval. IEEE Trans ASSP 27:309–319CrossRefGoogle Scholar
  41. 41.
    Ananthapadmanabha TV, Yegnanarayana B (1975) Epoch extraction of voice speech. IEEE Trans ASSP 23:562–570CrossRefGoogle Scholar
  42. 42.
    Taylor P (2000) Analysis and synthesis of intonation using the tilt model. J Acoust Soc Am 107(3):1697–1714CrossRefGoogle Scholar
  43. 43.
    Gussenhoven C, Reepp BH, Rietveld A, Rump HH, Terken J (1997) The perceptual prominence of fundamental frequency peaks. J Acoust Soc Am 102(5):3009–3022CrossRefGoogle Scholar
  44. 44.
    Yegnanarayana B (1999) Artificial neural network. Prentice Hall of India, New DelhiGoogle Scholar
  45. 45.
    Yegnanarayana B, Kishore SP (2002) AANN-An alternative for GMM for pattern recognition. Neural Netw 15(3):459–469CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Rajiv Gandhi Institute of TechnologyKottayamIndia

Personalised recommendations