Skip to main content

Automatic Speech Processing by Inference in Generative Models

  • Chapter
Speech Separation by Humans and Machines

Summary

In this chapter, we have explored the use of inference in probabilistic generative models as a powerful signal processing tool for speech and audio. The basic paradigm explored was to design a simple model for the data we observe in which the key quantities that we would eventually like to compute appear as hidden (latent) variables. By executing probabilistic inference in such models, we automatically estimating the hidden quantities and thus perform our desired computation. In a sense, the rules of probability derive for us, automatically, the optimal signal processing algorithm for our desired outputs given our inputs under the model assumptions. Crucially, even though the generative model may be quite simple and may not capture all of the variability present in the data, the results of inference can still be extremely informative.

We gave several examples showing how inference in very simple generative models can be used to perform surprisingly complex speech processing tasks including denoising, source separation, pitch tracking, timescale modification and estimation of articulatory movements from audio.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Achan, K., Roweis, S., and Frey, B., 2004. A segmental HMM for speech waveforms. Technical Report UTML-TR-2004-001, University of Toronto.

    Google Scholar 

  • Blackburn, S. and Young, S., 1996. Pseudo-articulatory speech synthesis for recognition using automatic feature extraction from x-ray data. In ICSLP 1996 v.2, volume 2, pages 969–972.

    Google Scholar 

  • Brown, G.J. and Cooke, M.P., 1994. Computational auditory scene analysis. Computer Speech and Language, 8.

    Google Scholar 

  • Carreira-Perpiñán, M., 2000. Reconstruction of sequential data with probabilistic models and continuity constraints. In Advances in Neural Information Processing Systems (NIPS), volume 12.

    Google Scholar 

  • Cauwenberghs, G., 1999, Monaural separation of independent acoustical components. In IEEE Symposium on Circuit and Systems (IS-CAS’99). IEEE.

    Google Scholar 

  • Chennoukh, S., Sinder, D., Richard, G., and Flanagan, J., 1997. Voice mimic system using an articulatory codebook for estimation of vocal tract shape. In Eurospeech 1997, Rhodes, Greece.

    Google Scholar 

  • Ephraim, Y., Malah, D., and Juang, B.H., 1989. On the application of hidden markov models for enhancing noisy speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 37.

    Google Scholar 

  • Gales, M. and Young, S., 1996. Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 4(5):352–359.

    Article  Google Scholar 

  • Green, P., Barker, J., Cooke, M,P., and Josifovski, L., 2001. Handling missing and unreliable information in speech recognition. In AIS-TATS.

    Google Scholar 

  • Hinton, G. and Zemel, R., 1994. Autoencoders, minimum description length, and helmholtz free energy. In Advances in Neural Information Processing Systems (NIPS), volume 6. MIT Press.

    Google Scholar 

  • Jojic, N. and Prey, B., 2000. Topographic transformation as a discrete latent variable. In Advances in Neural Information Processing Systems (NIPS), volume 12. MIT Press.

    Google Scholar 

  • Logan, B. and Moreno, P., 1998. Factorial hmms for acoustic modeling. In ICASSP, IEEE.

    Google Scholar 

  • Nix, D. and Hogden, J., 1999. Maximum likelihood continuity mapping: An alternative to HMMs. In Advances in Neural Information Processing Systems (NIPS), volume 11. MIT Press.

    Google Scholar 

  • Plante, F., Ainsworth, W.A., and Meyer, G.F., 1995. A pitch extraction reference database. In Eurospeech.

    Google Scholar 

  • Ramsay, G. and Deng, L., 1994. A stochastic framework for articulatory speech recognition. Journal of the Acoustical Society of America, 95(5):2873.

    Article  Google Scholar 

  • Reyes, M., Raj, B., and Ellis, D., 2003. Multi-channel source separation by factorial hmms. In ICASSP. IEEE.

    Google Scholar 

  • Ross, D. and Zemel, R., 2003. Multiple cause vector quantization. In Advances in Neural Information Processing Systems (NIPS), volume 15. MIT Press.

    Google Scholar 

  • Roweis, S., 2000. Constrained hidden markov models. In Advances in Neural Information Processing Systems (NIPS), volume 12. MIT Press.

    Google Scholar 

  • Roweis, S., 2001. One microphone source separation. In Advances in Neural Information Processing Systems (NIPS), volume 13. MIT Press.

    Google Scholar 

  • Roweis, S. and Alwan, A., 1997. Towards articulatory speech recognition. In Eurospeech 1997, volume 3, pages 1227–1230, Rhodes, Greece.

    Google Scholar 

  • Roucos, S. and Wilgus, A.M., 1985. High quality time-scale modification for speech. In ICASSP. IEEE.

    Google Scholar 

  • Schroeter, J. and Sondhi, M., 1994. Techniques for estimating vocal tract shapes from the speech signal. IEEE Transactions on Speech and Audio Processing, 2(1 p2): 133–150.

    Article  Google Scholar 

  • Smyth, P., 1997. Clustering sequences with hidden Markov models. In G. Tesauro, D. Touretzky, and T. Leen, eds., Advances in Neural Information Processing Systems, volume 9, pages 648–654. MIT Press.

    Google Scholar 

  • Varga, A.P. and Moore, R.K., 1990. Hidden markov model decomposition of speech and noise. In ICASSP, pages 845–848. IEEE.

    Google Scholar 

  • Wan, E.A. and Nelson, A.T., 1998. Removal of noise from speech using the dual ekf algorithm. In ICASSP. IEEE.

    Google Scholar 

  • Westbury, J.R., 1994. X-ray microbeam speech production database user’s handbook. Technical report, University of Wisconsin, Madison.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer Science + Business Media, Inc.

About this chapter

Cite this chapter

Roweis, S.T. (2005). Automatic Speech Processing by Inference in Generative Models. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_8

Download citation

  • DOI: https://doi.org/10.1007/0-387-22794-6_8

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4020-8001-2

  • Online ISBN: 978-0-387-22794-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics