Voice-Based Speaker Recognition Combining Acoustic and Stylistic Features

Kajarekar, Sachin S.; Ferrer, Luciana; Stolcke, Andreas; Shriberg, Elizabeth

doi:10.1007/978-1-84628-921-7_10

Voice-Based Speaker Recognition Combining Acoustic and Stylistic Features

Sachin S. Kajarekar³,
Luciana Ferrer⁴,
Andreas Stolcke^3,5 &
…
Elizabeth Shriberg⁵

Chapter

1872 Accesses
3 Citations

We present a survey of the state of the art in voice-based speaker identification research. We describe the general framework of a text-independent speaker verification system, and, as an example, SRI’s voice-based speaker recognition system. This system was ranked among the best-performing systems in NIST textindependent speaker recognition evaluations in the years 2004 and 2005. It consists of six subsystems and a neural network combiner. The subsystems are categorized into two groups: acoustics-based, or low level, and stylistic, or high level. Acoustic subsystems extract short-term spectral features that implicitly capture the anatomy of the vocal apparatus, such as the shape of the vocal tract and its variations. These features are known to be sensitive to microphone and channel variations, and various techniques are used to compensate for these variations. High-level subsystems, on the other hand, capture the stylistic aspects of a person’s voice, such as the speaking rate for particular words, rhythmic and intonation patterns, and idiosyncratic word usage. These features represent behavioral aspects of the person’s identity and are shown to be complementary to spectral acoustic features. By combining all information sources we achieve equal error rate performance of around 3% on the NIST speaker recognition evaluation for two minutes of enrollment and two minutes of test data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adami, A., et al. (2003). Modeling Prosodic Dynamics for Speaker Recognition. ICASSP.
Google Scholar
Auckenthaler, R., et al. 2000. Improving a GMM speaker verification system by phonetic weighting. Proc. of ICASSP, Phoenix, AZ.
Google Scholar
Campbell, W.M. 2002. Generalized Linear Discriminant Sequence Kernels for Speaker Recognition. ICASSP, Orlando, FL.
Google Scholar
Doddington, G. 2001. Speaker recognition based on idiolectal differences between speakers. Eurospeech, Aalborg, Denmark.
Google Scholar
Ferrer, L., et al. 2005. Class-based score combination for speaker recognition. Eurospeech, Lisbon.
Google Scholar
Ferrer, L., et al. 2003. Modeling duration patterns for speaker recognition. Eurospeech, Geneva.
Google Scholar
Gadde, V.R.R. 2000. Modeling word durations. International Conference on Spoken Language Processing, Beijing.
Google Scholar
Gillick, D., et al.1995. Speaker Detection without Models. ICASSP, Philadelphia.
Google Scholar
Hatch, A., et al. 2006. Within-class covariance normalization for SVM-based speaker recognition. ICSLP, Pittsburgh.
Google Scholar
Hermansky, H. and Morgan, N. (1984). RASTA processing of speech. IEEE Transactions on Speech and Audio 2: 578-589 author H. Hermansky.
Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. European Conference on Machine Learning.
Google Scholar
Kajarekar, S. (2005). Four Weightings and a Fusion: A Cepstral-SVM System for Speaker Recognition. ASRU, San Juan, IEEE.
Google Scholar
Kajarekar, S., et al. (2004). Modeling NERFs for speaker recognition. Odyssey 04 Speaker and Language Recognition Workshop, Toledo, Spain.
Google Scholar
Kenny, P., et al. 2006. Improvements in factor analysis based speaker verification. ICASSP, Toulouse, France, IEEE.
Google Scholar
Kenny, P., et al. 2005. Factor analysis simplified. ICASSP, Philadelphia, IEEE.
Google Scholar
Leggetter, C. and Woodland, P. 1995. Maximum likelihood linear regres-sion for speaker adaptation of HMMs. Computer Speech and Language 9: 171-186.
Article Google Scholar
Martin, A., et al. (2004). Conversational Telephone Speech Corpus Collection for the NIST Speaker Recognition Evaluation 2004. IAD.
Google Scholar
Newman, M., et al. (1996). Speaker verification through large vocabulary continuous speech recognition. ICSLP.
Google Scholar
Pelecanos, J. and Sridharan, S. (2001). Feature warping for robust speaker verification. 2001: A Speaker Odyssey: The Speaker Recognition Workshop, Crete, Greece, IEEE.
Google Scholar
Reynolds, D. 2003. Channel robust speaker verification via feature mapping. ICASSP, Hong Kong, IEEE.
Google Scholar
Reynolds, D., et al. (2003). SuperSID: Exploiting high-level information for high-performance speaker recognition. http://www.clsp.jhu.edu/ws2002/groups/supersid/supersid-final.pdf. ICASSP, Hong Kong, IEEE.
Reynolds, D., et al. 2000. Speaker verification using adapted mixture models. Digital Signal Processing 10: 181-202.
Article Google Scholar
Shriberg, E., et al. 2005. Modeling prosodic feature sequences for speaker recognition. Speech Communication 463-4: 455-472.
Article Google Scholar
Solewicz, Y. A. and Koppel, M. 2005. Considering speech quality in speaker verification fusion. INTERSPEECH, Lisbon, Portugal.
Google Scholar
Sonmez, K., et al. 1998. A lognormal model of pitch for prosody-based speaker recognition. Eurospeech, Rhodes, Greece.
Google Scholar
Stolcke, A., et al. (2006). Improvements in MLLR-transform-based speaker recognition. IEEE Odyssey 2006 Speaker and Language Recognition Workshop, San Juan.
Google Scholar
Stolcke, A., et al. (2005). MLLR transforms as features in speaker recognition. Eurospeech, Lisbon, Portugal.
Google Scholar
Vogt, R., et al. 2005. Modeling session variability in text-independent speaker verification. Eurospeech, Lisbon, Portugal, ISCA.
Google Scholar

Download references

Author information

Authors and Affiliations

SRI International, 333 Ravenswood Avenue, 94025, Menlo Park, CA, USA
Sachin S. Kajarekar & Andreas Stolcke
Department of Electrical Engineering, Stanford University, Stanford, CA, USA
Luciana Ferrer
International Computer Science Institute, Berkeley, CA, USA
Andreas Stolcke & Elizabeth Shriberg

Authors

Sachin S. Kajarekar
View author publications
You can also search for this author in PubMed Google Scholar
Luciana Ferrer
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Stolcke
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth Shriberg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IBM Thomas J. Watson Research Center, Hawthorne, NY, USA
Nalini K. Ratha BTech, MTech, PhD
Department of Computer Science and Engineering, University of Buffalo, NY, USA
Venu Govindaraju BTech, MS, PhD

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kajarekar, S.S., Ferrer, L., Stolcke, A., Shriberg, E. (2008). Voice-Based Speaker Recognition Combining Acoustic and Stylistic Features. In: Ratha, N.K., Govindaraju, V. (eds) Advances in Biometrics. Springer, London. https://doi.org/10.1007/978-1-84628-921-7_10

Download citation

DOI: https://doi.org/10.1007/978-1-84628-921-7_10
Publisher Name: Springer, London
Print ISBN: 978-1-84628-920-0
Online ISBN: 978-1-84628-921-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Buying options