Automatic Speech Processing by Inference in Generative Models

Roweis, Sam T.

doi:10.1007/0-387-22794-6_8

Sam T. Roweis²

1238 Accesses
2 Citations

Summary

In this chapter, we have explored the use of inference in probabilistic generative models as a powerful signal processing tool for speech and audio. The basic paradigm explored was to design a simple model for the data we observe in which the key quantities that we would eventually like to compute appear as hidden (latent) variables. By executing probabilistic inference in such models, we automatically estimating the hidden quantities and thus perform our desired computation. In a sense, the rules of probability derive for us, automatically, the optimal signal processing algorithm for our desired outputs given our inputs under the model assumptions. Crucially, even though the generative model may be quite simple and may not capture all of the variability present in the data, the results of inference can still be extremely informative.

We gave several examples showing how inference in very simple generative models can be used to perform surprisingly complex speech processing tasks including denoising, source separation, pitch tracking, timescale modification and estimation of articulatory movements from audio.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Achan, K., Roweis, S., and Frey, B., 2004. A segmental HMM for speech waveforms. Technical Report UTML-TR-2004-001, University of Toronto.
Google Scholar
Blackburn, S. and Young, S., 1996. Pseudo-articulatory speech synthesis for recognition using automatic feature extraction from x-ray data. In ICSLP 1996 v.2, volume 2, pages 969–972.
Google Scholar
Brown, G.J. and Cooke, M.P., 1994. Computational auditory scene analysis. Computer Speech and Language, 8.
Google Scholar
Carreira-Perpiñán, M., 2000. Reconstruction of sequential data with probabilistic models and continuity constraints. In Advances in Neural Information Processing Systems (NIPS), volume 12.
Google Scholar
Cauwenberghs, G., 1999, Monaural separation of independent acoustical components. In IEEE Symposium on Circuit and Systems (IS-CAS’99). IEEE.
Google Scholar
Chennoukh, S., Sinder, D., Richard, G., and Flanagan, J., 1997. Voice mimic system using an articulatory codebook for estimation of vocal tract shape. In Eurospeech 1997, Rhodes, Greece.
Google Scholar
Ephraim, Y., Malah, D., and Juang, B.H., 1989. On the application of hidden markov models for enhancing noisy speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 37.
Google Scholar
Gales, M. and Young, S., 1996. Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 4(5):352–359.
Article Google Scholar
Green, P., Barker, J., Cooke, M,P., and Josifovski, L., 2001. Handling missing and unreliable information in speech recognition. In AIS-TATS.
Google Scholar
Hinton, G. and Zemel, R., 1994. Autoencoders, minimum description length, and helmholtz free energy. In Advances in Neural Information Processing Systems (NIPS), volume 6. MIT Press.
Google Scholar
Jojic, N. and Prey, B., 2000. Topographic transformation as a discrete latent variable. In Advances in Neural Information Processing Systems (NIPS), volume 12. MIT Press.
Google Scholar
Logan, B. and Moreno, P., 1998. Factorial hmms for acoustic modeling. In ICASSP, IEEE.
Google Scholar
Nix, D. and Hogden, J., 1999. Maximum likelihood continuity mapping: An alternative to HMMs. In Advances in Neural Information Processing Systems (NIPS), volume 11. MIT Press.
Google Scholar
Plante, F., Ainsworth, W.A., and Meyer, G.F., 1995. A pitch extraction reference database. In Eurospeech.
Google Scholar
Ramsay, G. and Deng, L., 1994. A stochastic framework for articulatory speech recognition. Journal of the Acoustical Society of America, 95(5):2873.
Article Google Scholar
Reyes, M., Raj, B., and Ellis, D., 2003. Multi-channel source separation by factorial hmms. In ICASSP. IEEE.
Google Scholar
Ross, D. and Zemel, R., 2003. Multiple cause vector quantization. In Advances in Neural Information Processing Systems (NIPS), volume 15. MIT Press.
Google Scholar
Roweis, S., 2000. Constrained hidden markov models. In Advances in Neural Information Processing Systems (NIPS), volume 12. MIT Press.
Google Scholar
Roweis, S., 2001. One microphone source separation. In Advances in Neural Information Processing Systems (NIPS), volume 13. MIT Press.
Google Scholar
Roweis, S. and Alwan, A., 1997. Towards articulatory speech recognition. In Eurospeech 1997, volume 3, pages 1227–1230, Rhodes, Greece.
Google Scholar
Roucos, S. and Wilgus, A.M., 1985. High quality time-scale modification for speech. In ICASSP. IEEE.
Google Scholar
Schroeter, J. and Sondhi, M., 1994. Techniques for estimating vocal tract shapes from the speech signal. IEEE Transactions on Speech and Audio Processing, 2(1 p2): 133–150.
Article Google Scholar
Smyth, P., 1997. Clustering sequences with hidden Markov models. In G. Tesauro, D. Touretzky, and T. Leen, eds., Advances in Neural Information Processing Systems, volume 9, pages 648–654. MIT Press.
Google Scholar
Varga, A.P. and Moore, R.K., 1990. Hidden markov model decomposition of speech and noise. In ICASSP, pages 845–848. IEEE.
Google Scholar
Wan, E.A. and Nelson, A.T., 1998. Removal of noise from speech using the dual ekf algorithm. In ICASSP. IEEE.
Google Scholar
Westbury, J.R., 1994. X-ray microbeam speech production database user’s handbook. Technical report, University of Wisconsin, Madison.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Toronto, Canada
Sam T. Roweis

Authors

Sam T. Roweis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

East Bay Institute for Research and Education, USA
Pierre Divenyi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Roweis, S.T. (2005). Automatic Speech Processing by Inference in Generative Models. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_8

Download citation

DOI: https://doi.org/10.1007/0-387-22794-6_8
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-8001-2
Online ISBN: 978-0-387-22794-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics