Summary
In this chapter, we have explored the use of inference in probabilistic generative models as a powerful signal processing tool for speech and audio. The basic paradigm explored was to design a simple model for the data we observe in which the key quantities that we would eventually like to compute appear as hidden (latent) variables. By executing probabilistic inference in such models, we automatically estimating the hidden quantities and thus perform our desired computation. In a sense, the rules of probability derive for us, automatically, the optimal signal processing algorithm for our desired outputs given our inputs under the model assumptions. Crucially, even though the generative model may be quite simple and may not capture all of the variability present in the data, the results of inference can still be extremely informative.
We gave several examples showing how inference in very simple generative models can be used to perform surprisingly complex speech processing tasks including denoising, source separation, pitch tracking, timescale modification and estimation of articulatory movements from audio.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Achan, K., Roweis, S., and Frey, B., 2004. A segmental HMM for speech waveforms. Technical Report UTML-TR-2004-001, University of Toronto.
Blackburn, S. and Young, S., 1996. Pseudo-articulatory speech synthesis for recognition using automatic feature extraction from x-ray data. In ICSLP 1996 v.2, volume 2, pages 969–972.
Brown, G.J. and Cooke, M.P., 1994. Computational auditory scene analysis. Computer Speech and Language, 8.
Carreira-Perpiñán, M., 2000. Reconstruction of sequential data with probabilistic models and continuity constraints. In Advances in Neural Information Processing Systems (NIPS), volume 12.
Cauwenberghs, G., 1999, Monaural separation of independent acoustical components. In IEEE Symposium on Circuit and Systems (IS-CAS’99). IEEE.
Chennoukh, S., Sinder, D., Richard, G., and Flanagan, J., 1997. Voice mimic system using an articulatory codebook for estimation of vocal tract shape. In Eurospeech 1997, Rhodes, Greece.
Ephraim, Y., Malah, D., and Juang, B.H., 1989. On the application of hidden markov models for enhancing noisy speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 37.
Gales, M. and Young, S., 1996. Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 4(5):352–359.
Green, P., Barker, J., Cooke, M,P., and Josifovski, L., 2001. Handling missing and unreliable information in speech recognition. In AIS-TATS.
Hinton, G. and Zemel, R., 1994. Autoencoders, minimum description length, and helmholtz free energy. In Advances in Neural Information Processing Systems (NIPS), volume 6. MIT Press.
Jojic, N. and Prey, B., 2000. Topographic transformation as a discrete latent variable. In Advances in Neural Information Processing Systems (NIPS), volume 12. MIT Press.
Logan, B. and Moreno, P., 1998. Factorial hmms for acoustic modeling. In ICASSP, IEEE.
Nix, D. and Hogden, J., 1999. Maximum likelihood continuity mapping: An alternative to HMMs. In Advances in Neural Information Processing Systems (NIPS), volume 11. MIT Press.
Plante, F., Ainsworth, W.A., and Meyer, G.F., 1995. A pitch extraction reference database. In Eurospeech.
Ramsay, G. and Deng, L., 1994. A stochastic framework for articulatory speech recognition. Journal of the Acoustical Society of America, 95(5):2873.
Reyes, M., Raj, B., and Ellis, D., 2003. Multi-channel source separation by factorial hmms. In ICASSP. IEEE.
Ross, D. and Zemel, R., 2003. Multiple cause vector quantization. In Advances in Neural Information Processing Systems (NIPS), volume 15. MIT Press.
Roweis, S., 2000. Constrained hidden markov models. In Advances in Neural Information Processing Systems (NIPS), volume 12. MIT Press.
Roweis, S., 2001. One microphone source separation. In Advances in Neural Information Processing Systems (NIPS), volume 13. MIT Press.
Roweis, S. and Alwan, A., 1997. Towards articulatory speech recognition. In Eurospeech 1997, volume 3, pages 1227–1230, Rhodes, Greece.
Roucos, S. and Wilgus, A.M., 1985. High quality time-scale modification for speech. In ICASSP. IEEE.
Schroeter, J. and Sondhi, M., 1994. Techniques for estimating vocal tract shapes from the speech signal. IEEE Transactions on Speech and Audio Processing, 2(1 p2): 133–150.
Smyth, P., 1997. Clustering sequences with hidden Markov models. In G. Tesauro, D. Touretzky, and T. Leen, eds., Advances in Neural Information Processing Systems, volume 9, pages 648–654. MIT Press.
Varga, A.P. and Moore, R.K., 1990. Hidden markov model decomposition of speech and noise. In ICASSP, pages 845–848. IEEE.
Wan, E.A. and Nelson, A.T., 1998. Removal of noise from speech using the dual ekf algorithm. In ICASSP. IEEE.
Westbury, J.R., 1994. X-ray microbeam speech production database user’s handbook. Technical report, University of Wisconsin, Madison.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer Science + Business Media, Inc.
About this chapter
Cite this chapter
Roweis, S.T. (2005). Automatic Speech Processing by Inference in Generative Models. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_8
Download citation
DOI: https://doi.org/10.1007/0-387-22794-6_8
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-8001-2
Online ISBN: 978-0-387-22794-8
eBook Packages: EngineeringEngineering (R0)