Skip to main content

Voice-Based Speaker Recognition Combining Acoustic and Stylistic Features

  • Chapter

We present a survey of the state of the art in voice-based speaker identification research. We describe the general framework of a text-independent speaker verification system, and, as an example, SRI’s voice-based speaker recognition system. This system was ranked among the best-performing systems in NIST textindependent speaker recognition evaluations in the years 2004 and 2005. It consists of six subsystems and a neural network combiner. The subsystems are categorized into two groups: acoustics-based, or low level, and stylistic, or high level. Acoustic subsystems extract short-term spectral features that implicitly capture the anatomy of the vocal apparatus, such as the shape of the vocal tract and its variations. These features are known to be sensitive to microphone and channel variations, and various techniques are used to compensate for these variations. High-level subsystems, on the other hand, capture the stylistic aspects of a person’s voice, such as the speaking rate for particular words, rhythmic and intonation patterns, and idiosyncratic word usage. These features represent behavioral aspects of the person’s identity and are shown to be complementary to spectral acoustic features. By combining all information sources we achieve equal error rate performance of around 3% on the NIST speaker recognition evaluation for two minutes of enrollment and two minutes of test data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Adami, A., et al. (2003). Modeling Prosodic Dynamics for Speaker Recognition. ICASSP.

    Google Scholar 

  • Auckenthaler, R., et al. 2000. Improving a GMM speaker verification system by phonetic weighting. Proc. of ICASSP, Phoenix, AZ.

    Google Scholar 

  • Campbell, W.M. 2002. Generalized Linear Discriminant Sequence Kernels for Speaker Recognition. ICASSP, Orlando, FL.

    Google Scholar 

  • Doddington, G. 2001. Speaker recognition based on idiolectal differences between speakers. Eurospeech, Aalborg, Denmark.

    Google Scholar 

  • Ferrer, L., et al. 2005. Class-based score combination for speaker recognition. Eurospeech, Lisbon.

    Google Scholar 

  • Ferrer, L., et al. 2003. Modeling duration patterns for speaker recognition. Eurospeech, Geneva.

    Google Scholar 

  • Gadde, V.R.R. 2000. Modeling word durations. International Conference on Spoken Language Processing, Beijing.

    Google Scholar 

  • Gillick, D., et al.1995. Speaker Detection without Models. ICASSP, Philadelphia.

    Google Scholar 

  • Hatch, A., et al. 2006. Within-class covariance normalization for SVM-based speaker recognition. ICSLP, Pittsburgh.

    Google Scholar 

  • Hermansky, H. and Morgan, N. (1984). RASTA processing of speech. IEEE Transactions on Speech and Audio 2: 578-589 author H. Hermansky.

    Google Scholar 

  • Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. European Conference on Machine Learning.

    Google Scholar 

  • Kajarekar, S. (2005). Four Weightings and a Fusion: A Cepstral-SVM System for Speaker Recognition. ASRU, San Juan, IEEE.

    Google Scholar 

  • Kajarekar, S., et al. (2004). Modeling NERFs for speaker recognition. Odyssey 04 Speaker and Language Recognition Workshop, Toledo, Spain.

    Google Scholar 

  • Kenny, P., et al. 2006. Improvements in factor analysis based speaker verification. ICASSP, Toulouse, France, IEEE.

    Google Scholar 

  • Kenny, P., et al. 2005. Factor analysis simplified. ICASSP, Philadelphia, IEEE.

    Google Scholar 

  • Leggetter, C. and Woodland, P. 1995. Maximum likelihood linear regres-sion for speaker adaptation of HMMs. Computer Speech and Language 9: 171-186.

    Article  Google Scholar 

  • Martin, A., et al. (2004). Conversational Telephone Speech Corpus Collection for the NIST Speaker Recognition Evaluation 2004. IAD.

    Google Scholar 

  • Newman, M., et al. (1996). Speaker verification through large vocabulary continuous speech recognition. ICSLP.

    Google Scholar 

  • Pelecanos, J. and Sridharan, S. (2001). Feature warping for robust speaker verification. 2001: A Speaker Odyssey: The Speaker Recognition Workshop, Crete, Greece, IEEE.

    Google Scholar 

  • Reynolds, D. 2003. Channel robust speaker verification via feature mapping. ICASSP, Hong Kong, IEEE.

    Google Scholar 

  • Reynolds, D., et al. (2003). SuperSID: Exploiting high-level information for high-performance speaker recognition. http://www.clsp.jhu.edu/ws2002/groups/supersid/supersid-final.pdf. ICASSP, Hong Kong, IEEE.

  • Reynolds, D., et al. 2000. Speaker verification using adapted mixture models. Digital Signal Processing 10: 181-202.

    Article  Google Scholar 

  • Shriberg, E., et al. 2005. Modeling prosodic feature sequences for speaker recognition. Speech Communication 463-4: 455-472.

    Article  Google Scholar 

  • Solewicz, Y. A. and Koppel, M. 2005. Considering speech quality in speaker verification fusion. INTERSPEECH, Lisbon, Portugal.

    Google Scholar 

  • Sonmez, K., et al. 1998. A lognormal model of pitch for prosody-based speaker recognition. Eurospeech, Rhodes, Greece.

    Google Scholar 

  • Stolcke, A., et al. (2006). Improvements in MLLR-transform-based speaker recognition. IEEE Odyssey 2006 Speaker and Language Recognition Workshop, San Juan.

    Google Scholar 

  • Stolcke, A., et al. (2005). MLLR transforms as features in speaker recognition. Eurospeech, Lisbon, Portugal.

    Google Scholar 

  • Vogt, R., et al. 2005. Modeling session variability in text-independent speaker verification. Eurospeech, Lisbon, Portugal, ISCA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag London Limited

About this chapter

Cite this chapter

Kajarekar, S.S., Ferrer, L., Stolcke, A., Shriberg, E. (2008). Voice-Based Speaker Recognition Combining Acoustic and Stylistic Features. In: Ratha, N.K., Govindaraju, V. (eds) Advances in Biometrics. Springer, London. https://doi.org/10.1007/978-1-84628-921-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-84628-921-7_10

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84628-920-0

  • Online ISBN: 978-1-84628-921-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics