We present a survey of the state of the art in voice-based speaker identification research. We describe the general framework of a text-independent speaker verification system, and, as an example, SRI’s voice-based speaker recognition system. This system was ranked among the best-performing systems in NIST textindependent speaker recognition evaluations in the years 2004 and 2005. It consists of six subsystems and a neural network combiner. The subsystems are categorized into two groups: acoustics-based, or low level, and stylistic, or high level. Acoustic subsystems extract short-term spectral features that implicitly capture the anatomy of the vocal apparatus, such as the shape of the vocal tract and its variations. These features are known to be sensitive to microphone and channel variations, and various techniques are used to compensate for these variations. High-level subsystems, on the other hand, capture the stylistic aspects of a person’s voice, such as the speaking rate for particular words, rhythmic and intonation patterns, and idiosyncratic word usage. These features represent behavioral aspects of the person’s identity and are shown to be complementary to spectral acoustic features. By combining all information sources we achieve equal error rate performance of around 3% on the NIST speaker recognition evaluation for two minutes of enrollment and two minutes of test data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Adami, A., et al. (2003). Modeling Prosodic Dynamics for Speaker Recognition. ICASSP.
Auckenthaler, R., et al. 2000. Improving a GMM speaker verification system by phonetic weighting. Proc. of ICASSP, Phoenix, AZ.
Campbell, W.M. 2002. Generalized Linear Discriminant Sequence Kernels for Speaker Recognition. ICASSP, Orlando, FL.
Doddington, G. 2001. Speaker recognition based on idiolectal differences between speakers. Eurospeech, Aalborg, Denmark.
Ferrer, L., et al. 2005. Class-based score combination for speaker recognition. Eurospeech, Lisbon.
Ferrer, L., et al. 2003. Modeling duration patterns for speaker recognition. Eurospeech, Geneva.
Gadde, V.R.R. 2000. Modeling word durations. International Conference on Spoken Language Processing, Beijing.
Gillick, D., et al.1995. Speaker Detection without Models. ICASSP, Philadelphia.
Hatch, A., et al. 2006. Within-class covariance normalization for SVM-based speaker recognition. ICSLP, Pittsburgh.
Hermansky, H. and Morgan, N. (1984). RASTA processing of speech. IEEE Transactions on Speech and Audio 2: 578-589 author H. Hermansky.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. European Conference on Machine Learning.
Kajarekar, S. (2005). Four Weightings and a Fusion: A Cepstral-SVM System for Speaker Recognition. ASRU, San Juan, IEEE.
Kajarekar, S., et al. (2004). Modeling NERFs for speaker recognition. Odyssey 04 Speaker and Language Recognition Workshop, Toledo, Spain.
Kenny, P., et al. 2006. Improvements in factor analysis based speaker verification. ICASSP, Toulouse, France, IEEE.
Kenny, P., et al. 2005. Factor analysis simplified. ICASSP, Philadelphia, IEEE.
Leggetter, C. and Woodland, P. 1995. Maximum likelihood linear regres-sion for speaker adaptation of HMMs. Computer Speech and Language 9: 171-186.
Martin, A., et al. (2004). Conversational Telephone Speech Corpus Collection for the NIST Speaker Recognition Evaluation 2004. IAD.
Newman, M., et al. (1996). Speaker verification through large vocabulary continuous speech recognition. ICSLP.
Pelecanos, J. and Sridharan, S. (2001). Feature warping for robust speaker verification. 2001: A Speaker Odyssey: The Speaker Recognition Workshop, Crete, Greece, IEEE.
Reynolds, D. 2003. Channel robust speaker verification via feature mapping. ICASSP, Hong Kong, IEEE.
Reynolds, D., et al. (2003). SuperSID: Exploiting high-level information for high-performance speaker recognition. http://www.clsp.jhu.edu/ws2002/groups/supersid/supersid-final.pdf. ICASSP, Hong Kong, IEEE.
Reynolds, D., et al. 2000. Speaker verification using adapted mixture models. Digital Signal Processing 10: 181-202.
Shriberg, E., et al. 2005. Modeling prosodic feature sequences for speaker recognition. Speech Communication 463-4: 455-472.
Solewicz, Y. A. and Koppel, M. 2005. Considering speech quality in speaker verification fusion. INTERSPEECH, Lisbon, Portugal.
Sonmez, K., et al. 1998. A lognormal model of pitch for prosody-based speaker recognition. Eurospeech, Rhodes, Greece.
Stolcke, A., et al. (2006). Improvements in MLLR-transform-based speaker recognition. IEEE Odyssey 2006 Speaker and Language Recognition Workshop, San Juan.
Stolcke, A., et al. (2005). MLLR transforms as features in speaker recognition. Eurospeech, Lisbon, Portugal.
Vogt, R., et al. 2005. Modeling session variability in text-independent speaker verification. Eurospeech, Lisbon, Portugal, ISCA.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag London Limited
About this chapter
Cite this chapter
Kajarekar, S.S., Ferrer, L., Stolcke, A., Shriberg, E. (2008). Voice-Based Speaker Recognition Combining Acoustic and Stylistic Features. In: Ratha, N.K., Govindaraju, V. (eds) Advances in Biometrics. Springer, London. https://doi.org/10.1007/978-1-84628-921-7_10
Download citation
DOI: https://doi.org/10.1007/978-1-84628-921-7_10
Publisher Name: Springer, London
Print ISBN: 978-1-84628-920-0
Online ISBN: 978-1-84628-921-7
eBook Packages: Computer ScienceComputer Science (R0)