Modelling Graph-Based Observation Spaces for Segment-Based Speech Recognition

Glass, James R.

doi:10.1007/978-1-4419-9017-4_8

James R. Glass⁶

Part of the book series: The IMA Volumes in Mathematics and its Applications ((IMA,volume 138))

696 Accesses

Abstract

Most speech recognizers use an observation space which is based on a temporal sequence of spectral “frames.” There is another class of recognizer which further processes these frames to produce a segment-based network, and represents each segment by a fixed-dimensional “feature.” In such feature-based recognizers the observation space takes the form of a temporal graph of feature vectors, so that any single segmentation of an utterance will use a subset of all possible feature vectors. In this work we describe a maximum a posteriori decoding strategy for feature-based recognizers and derive two normalization critera useful for a segment-based Viterbi or A* search. We show how a segment-based recognizer is able to obtain good results on the tasks of phonetic and word recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. Chang. Near-miss modeling: A segment-based approach to speech recognition. Ph.D. thesis, EECS, MIT, June 1998.
Google Scholar
J. Chang and J. Glass. Segmentation and modeling in segment-based recognition. In Proc. Eurospeech, pages 1199–1202, Rhodes, Greece, October 1997.
Google Scholar
J. Cohen. Segmenting speech using dynamic programming. Journal of the Acoustic Society of America, 69(5): 1430–1438, May 1981.
Article Google Scholar
R. Cole, R. Stern, M. Phillips, S. Brill, A. Pilant, and P. Specker. Feature-based speaker-independent recognition of isolated letters. In Proc. ICASSP, pages 731–733, Boston, MA, April 1983.
Google Scholar
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallet, and N. Dahlgren. The DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM. NTIS order number PB91-505065, October 1990.
Google Scholar
H. Gish and K. Ng. A segmental speech model with applications to word spotting. In Proc. ICASSP, pages 447–450, Minneapolis, MN, April 1993.
Google Scholar
J. Glass, J. Chang, and M. McCandless. A probabilistic framework for featurebased speech recognition. In Proc. ICSLP, pages 2277–2280, Philadelphia, PA, October 1996.
Google Scholar
J. Glass, T. Hazen, and L. Hetherington. Real-time telephone-based speech recognition in the Jupiter domain. In Proc. ICASSP, pages 61–64, Phoenix, AZ, March 1999.
Google Scholar
W. Goldenthal. Statistical trajectory models for phonetic recognition. Technical report MIT/LCS/TR-642, MIT Lab. for Computer Science, August 1994.
Google Scholar
A. Halberstadt. Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Ph.D. thesis, MIT Dept. EECS, November 1998.
Google Scholar
A. Halberstadt and J. Glass. Heterogeneous measurements for phonetic classification. In Proc. Eurospeech, pages 401–404, Rhodes, Greece, September 1997.
Google Scholar
A. Halberstadt and J. Glass. Heterogeneous measurements and multiple classifiers for speech recognition. In Proc. ICSLP, pages 995–998, Sydney, Australia, December 1998.
Google Scholar
T. Hazen and A. Halberstadt. Using aggregation to improve the performance of mixture Gaussian acoustic models. In Proc. ICASSP, pages 653–656, Seattle, WA, May 1998.
Google Scholar
L. Hetherington. An efficient implementation of phonological rules using finitestate transducers. In Proc. Eurospeech, pages 1599–1602, Aalborg, Denmark, September 2001.
Google Scholar
W. Holmes and M. Russell. Modeling speech variability with segmental HMMs. In Proc. ICASSP, pages 447–450, Atlanta, GA, May 1996.
Google Scholar
L. Lamel and J.L. Gauvain. High performance speaker-independent phone recognition using CDHMM. In Proc. Eurospeech, pages 121–124, Berlin, Germany, September 1993.
Google Scholar
S. Lee and J. Glass. Real-time probabilistic segmentation for segment-based speech recognition. In Proc. ICSLP, pages 1803–1806, Sydney, Australia, December 1998.
Google Scholar
K. Livescu and J. Glass. Segment-based recognition on the PhoneBook task: Initial results and observations on duration modeling. In Proc. Eurospeech, pages 1437–1440, Aalborg, Denmark, September 2001.
Google Scholar
J. Marcus. Phonetic recognition in a segment-based HMM. In Proc. ICASSP, pages 479–482, Minneapolis, MN, April 1993.
Google Scholar
J. Ming and F. Smith. Improved phone recognition using bayesian triphone models. In Proc. ICASSP, pages 409–412, Seattle, WA, May 1998.
Google Scholar
M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23(2):269–311, June 1997.
MathSciNet Google Scholar
M. Ostendorf, V. Digilakis, and O. Kimball. From HMM’s to segment models: a unified view of stochastic modelling for speech recognition. IEEE Trans. SAP, 4(5):360–378, September 1996.
Google Scholar
M. Ostendorf and S. Roucos. A stochastic segment model for phoneme-based continuous speech recognition. IEEE Trans. ASSP, 37(12):1857–1869, December 1989.
Article Google Scholar
K. Ponting and S. Peeling. The use of variable frame rate analysis in speech recognition. Computer Speech and Language, 5:169–179, 1991.
Article Google Scholar
L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989.
Google Scholar
M. Riley and A. Ljolje. Lexical access with a statistically-derived phonetic network. In Proc. Eurospeech, pages 585–588, Genoa, Italy, September 1991.
Google Scholar
A. Robinson. An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Networks, 5(2):298–305, March 1994.
Article Google Scholar
J. Rohlicek, W. Russell, S. Roucos, and H. Gish. Continuous hidden Markov modelling for speaker-independent word spotting. In Proc. ICASSP, pages 627–630, Glasgow, Scotland, May 1989.
Google Scholar
R. Rose and D. Paul. A hidden Markov model based keyword recognition system. In Proc. ICASSP, pages 129–132, Albuquerque, NM, April 1990.
Google Scholar
M. Russell. A segmental HMM for speech pattern modelling. In Proc. ICASSP, pages 499–502, Minneapolis, MN, 1993.
Google Scholar
K. Stevens. Lexical access from features. In Workshop on speech technology for man-machine interaction, Bombay, India, 1990.
Google Scholar
N. Ström, L. Hetherington, T. Hazen, E. Sandness, and J. Glass. Acoustic modelling improvements in a segment-based speech recognizer. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pages 139–142, Keystone, CO, 1999.
Google Scholar
J. Wilpon, L. Rabiner, C.H. Lee, and E. Goldman. Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Trans. ASSP, 38(ll):1870–1878, November 1990.
Article Google Scholar
V. Zue, S. Seneff, J. Glass, J. Polifroni, C. Pao, T. Hazen, and L. Hetherington. Jupiter: A telephone-based conversational interface for weather information. IEEE Trans. Speech and Audio Proc., 8(l):85–96, January 2000.
Article Google Scholar

Download references

Author information

Authors and Affiliations

MIT Laboratory for Computer Science, 200 Technology Square, Cambridge, MA, 02139, USA
James R. Glass

Authors

James R. Glass
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Cognitive and Linguistic Studies, Brown University, Providence, RI, 02912, USA
Mark Johnson
Dept. of ECE and Dept. of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
Sanjeev P. Khudanpur
Dept. of Electrical Engineering, University of Washington, Seattle, WA, 98195, USA
Mari Ostendorf
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Roni Rosenfeld

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Glass, J.R. (2004). Modelling Graph-Based Observation Spaces for Segment-Based Speech Recognition. In: Johnson, M., Khudanpur, S.P., Ostendorf, M., Rosenfeld, R. (eds) Mathematical Foundations of Speech and Language Processing. The IMA Volumes in Mathematics and its Applications, vol 138. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9017-4_8

Download citation

DOI: https://doi.org/10.1007/978-1-4419-9017-4_8
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4612-6484-2
Online ISBN: 978-1-4419-9017-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics