Skip to main content

Modelling Graph-Based Observation Spaces for Segment-Based Speech Recognition

  • Conference paper
Mathematical Foundations of Speech and Language Processing

Part of the book series: The IMA Volumes in Mathematics and its Applications ((IMA,volume 138))

  • 696 Accesses

Abstract

Most speech recognizers use an observation space which is based on a temporal sequence of spectral “frames.” There is another class of recognizer which further processes these frames to produce a segment-based network, and represents each segment by a fixed-dimensional “feature.” In such feature-based recognizers the observation space takes the form of a temporal graph of feature vectors, so that any single segmentation of an utterance will use a subset of all possible feature vectors. In this work we describe a maximum a posteriori decoding strategy for feature-based recognizers and derive two normalization critera useful for a segment-based Viterbi or A* search. We show how a segment-based recognizer is able to obtain good results on the tasks of phonetic and word recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. Chang. Near-miss modeling: A segment-based approach to speech recognition. Ph.D. thesis, EECS, MIT, June 1998.

    Google Scholar 

  2. J. Chang and J. Glass. Segmentation and modeling in segment-based recognition. In Proc. Eurospeech, pages 1199–1202, Rhodes, Greece, October 1997.

    Google Scholar 

  3. J. Cohen. Segmenting speech using dynamic programming. Journal of the Acoustic Society of America, 69(5): 1430–1438, May 1981.

    Article  Google Scholar 

  4. R. Cole, R. Stern, M. Phillips, S. Brill, A. Pilant, and P. Specker. Feature-based speaker-independent recognition of isolated letters. In Proc. ICASSP, pages 731–733, Boston, MA, April 1983.

    Google Scholar 

  5. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallet, and N. Dahlgren. The DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM. NTIS order number PB91-505065, October 1990.

    Google Scholar 

  6. H. Gish and K. Ng. A segmental speech model with applications to word spotting. In Proc. ICASSP, pages 447–450, Minneapolis, MN, April 1993.

    Google Scholar 

  7. J. Glass, J. Chang, and M. McCandless. A probabilistic framework for featurebased speech recognition. In Proc. ICSLP, pages 2277–2280, Philadelphia, PA, October 1996.

    Google Scholar 

  8. J. Glass, T. Hazen, and L. Hetherington. Real-time telephone-based speech recognition in the Jupiter domain. In Proc. ICASSP, pages 61–64, Phoenix, AZ, March 1999.

    Google Scholar 

  9. W. Goldenthal. Statistical trajectory models for phonetic recognition. Technical report MIT/LCS/TR-642, MIT Lab. for Computer Science, August 1994.

    Google Scholar 

  10. A. Halberstadt. Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Ph.D. thesis, MIT Dept. EECS, November 1998.

    Google Scholar 

  11. A. Halberstadt and J. Glass. Heterogeneous measurements for phonetic classification. In Proc. Eurospeech, pages 401–404, Rhodes, Greece, September 1997.

    Google Scholar 

  12. A. Halberstadt and J. Glass. Heterogeneous measurements and multiple classifiers for speech recognition. In Proc. ICSLP, pages 995–998, Sydney, Australia, December 1998.

    Google Scholar 

  13. T. Hazen and A. Halberstadt. Using aggregation to improve the performance of mixture Gaussian acoustic models. In Proc. ICASSP, pages 653–656, Seattle, WA, May 1998.

    Google Scholar 

  14. L. Hetherington. An efficient implementation of phonological rules using finitestate transducers. In Proc. Eurospeech, pages 1599–1602, Aalborg, Denmark, September 2001.

    Google Scholar 

  15. W. Holmes and M. Russell. Modeling speech variability with segmental HMMs. In Proc. ICASSP, pages 447–450, Atlanta, GA, May 1996.

    Google Scholar 

  16. L. Lamel and J.L. Gauvain. High performance speaker-independent phone recognition using CDHMM. In Proc. Eurospeech, pages 121–124, Berlin, Germany, September 1993.

    Google Scholar 

  17. S. Lee and J. Glass. Real-time probabilistic segmentation for segment-based speech recognition. In Proc. ICSLP, pages 1803–1806, Sydney, Australia, December 1998.

    Google Scholar 

  18. K. Livescu and J. Glass. Segment-based recognition on the PhoneBook task: Initial results and observations on duration modeling. In Proc. Eurospeech, pages 1437–1440, Aalborg, Denmark, September 2001.

    Google Scholar 

  19. J. Marcus. Phonetic recognition in a segment-based HMM. In Proc. ICASSP, pages 479–482, Minneapolis, MN, April 1993.

    Google Scholar 

  20. J. Ming and F. Smith. Improved phone recognition using bayesian triphone models. In Proc. ICASSP, pages 409–412, Seattle, WA, May 1998.

    Google Scholar 

  21. M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23(2):269–311, June 1997.

    MathSciNet  Google Scholar 

  22. M. Ostendorf, V. Digilakis, and O. Kimball. From HMM’s to segment models: a unified view of stochastic modelling for speech recognition. IEEE Trans. SAP, 4(5):360–378, September 1996.

    Google Scholar 

  23. M. Ostendorf and S. Roucos. A stochastic segment model for phoneme-based continuous speech recognition. IEEE Trans. ASSP, 37(12):1857–1869, December 1989.

    Article  Google Scholar 

  24. K. Ponting and S. Peeling. The use of variable frame rate analysis in speech recognition. Computer Speech and Language, 5:169–179, 1991.

    Article  Google Scholar 

  25. L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989.

    Google Scholar 

  26. M. Riley and A. Ljolje. Lexical access with a statistically-derived phonetic network. In Proc. Eurospeech, pages 585–588, Genoa, Italy, September 1991.

    Google Scholar 

  27. A. Robinson. An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Networks, 5(2):298–305, March 1994.

    Article  Google Scholar 

  28. J. Rohlicek, W. Russell, S. Roucos, and H. Gish. Continuous hidden Markov modelling for speaker-independent word spotting. In Proc. ICASSP, pages 627–630, Glasgow, Scotland, May 1989.

    Google Scholar 

  29. R. Rose and D. Paul. A hidden Markov model based keyword recognition system. In Proc. ICASSP, pages 129–132, Albuquerque, NM, April 1990.

    Google Scholar 

  30. M. Russell. A segmental HMM for speech pattern modelling. In Proc. ICASSP, pages 499–502, Minneapolis, MN, 1993.

    Google Scholar 

  31. K. Stevens. Lexical access from features. In Workshop on speech technology for man-machine interaction, Bombay, India, 1990.

    Google Scholar 

  32. N. Ström, L. Hetherington, T. Hazen, E. Sandness, and J. Glass. Acoustic modelling improvements in a segment-based speech recognizer. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pages 139–142, Keystone, CO, 1999.

    Google Scholar 

  33. J. Wilpon, L. Rabiner, C.H. Lee, and E. Goldman. Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Trans. ASSP, 38(ll):1870–1878, November 1990.

    Article  Google Scholar 

  34. V. Zue, S. Seneff, J. Glass, J. Polifroni, C. Pao, T. Hazen, and L. Hetherington. Jupiter: A telephone-based conversational interface for weather information. IEEE Trans. Speech and Audio Proc., 8(l):85–96, January 2000.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer Science+Business Media New York

About this paper

Cite this paper

Glass, J.R. (2004). Modelling Graph-Based Observation Spaces for Segment-Based Speech Recognition. In: Johnson, M., Khudanpur, S.P., Ostendorf, M., Rosenfeld, R. (eds) Mathematical Foundations of Speech and Language Processing. The IMA Volumes in Mathematics and its Applications, vol 138. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9017-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-9017-4_8

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4612-6484-2

  • Online ISBN: 978-1-4419-9017-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics