Abstract
Most speech recognizers use an observation space which is based on a temporal sequence of spectral “frames.” There is another class of recognizer which further processes these frames to produce a segment-based network, and represents each segment by a fixed-dimensional “feature.” In such feature-based recognizers the observation space takes the form of a temporal graph of feature vectors, so that any single segmentation of an utterance will use a subset of all possible feature vectors. In this work we describe a maximum a posteriori decoding strategy for feature-based recognizers and derive two normalization critera useful for a segment-based Viterbi or A* search. We show how a segment-based recognizer is able to obtain good results on the tasks of phonetic and word recognition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
J. Chang. Near-miss modeling: A segment-based approach to speech recognition. Ph.D. thesis, EECS, MIT, June 1998.
J. Chang and J. Glass. Segmentation and modeling in segment-based recognition. In Proc. Eurospeech, pages 1199–1202, Rhodes, Greece, October 1997.
J. Cohen. Segmenting speech using dynamic programming. Journal of the Acoustic Society of America, 69(5): 1430–1438, May 1981.
R. Cole, R. Stern, M. Phillips, S. Brill, A. Pilant, and P. Specker. Feature-based speaker-independent recognition of isolated letters. In Proc. ICASSP, pages 731–733, Boston, MA, April 1983.
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallet, and N. Dahlgren. The DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM. NTIS order number PB91-505065, October 1990.
H. Gish and K. Ng. A segmental speech model with applications to word spotting. In Proc. ICASSP, pages 447–450, Minneapolis, MN, April 1993.
J. Glass, J. Chang, and M. McCandless. A probabilistic framework for featurebased speech recognition. In Proc. ICSLP, pages 2277–2280, Philadelphia, PA, October 1996.
J. Glass, T. Hazen, and L. Hetherington. Real-time telephone-based speech recognition in the Jupiter domain. In Proc. ICASSP, pages 61–64, Phoenix, AZ, March 1999.
W. Goldenthal. Statistical trajectory models for phonetic recognition. Technical report MIT/LCS/TR-642, MIT Lab. for Computer Science, August 1994.
A. Halberstadt. Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Ph.D. thesis, MIT Dept. EECS, November 1998.
A. Halberstadt and J. Glass. Heterogeneous measurements for phonetic classification. In Proc. Eurospeech, pages 401–404, Rhodes, Greece, September 1997.
A. Halberstadt and J. Glass. Heterogeneous measurements and multiple classifiers for speech recognition. In Proc. ICSLP, pages 995–998, Sydney, Australia, December 1998.
T. Hazen and A. Halberstadt. Using aggregation to improve the performance of mixture Gaussian acoustic models. In Proc. ICASSP, pages 653–656, Seattle, WA, May 1998.
L. Hetherington. An efficient implementation of phonological rules using finitestate transducers. In Proc. Eurospeech, pages 1599–1602, Aalborg, Denmark, September 2001.
W. Holmes and M. Russell. Modeling speech variability with segmental HMMs. In Proc. ICASSP, pages 447–450, Atlanta, GA, May 1996.
L. Lamel and J.L. Gauvain. High performance speaker-independent phone recognition using CDHMM. In Proc. Eurospeech, pages 121–124, Berlin, Germany, September 1993.
S. Lee and J. Glass. Real-time probabilistic segmentation for segment-based speech recognition. In Proc. ICSLP, pages 1803–1806, Sydney, Australia, December 1998.
K. Livescu and J. Glass. Segment-based recognition on the PhoneBook task: Initial results and observations on duration modeling. In Proc. Eurospeech, pages 1437–1440, Aalborg, Denmark, September 2001.
J. Marcus. Phonetic recognition in a segment-based HMM. In Proc. ICASSP, pages 479–482, Minneapolis, MN, April 1993.
J. Ming and F. Smith. Improved phone recognition using bayesian triphone models. In Proc. ICASSP, pages 409–412, Seattle, WA, May 1998.
M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23(2):269–311, June 1997.
M. Ostendorf, V. Digilakis, and O. Kimball. From HMM’s to segment models: a unified view of stochastic modelling for speech recognition. IEEE Trans. SAP, 4(5):360–378, September 1996.
M. Ostendorf and S. Roucos. A stochastic segment model for phoneme-based continuous speech recognition. IEEE Trans. ASSP, 37(12):1857–1869, December 1989.
K. Ponting and S. Peeling. The use of variable frame rate analysis in speech recognition. Computer Speech and Language, 5:169–179, 1991.
L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989.
M. Riley and A. Ljolje. Lexical access with a statistically-derived phonetic network. In Proc. Eurospeech, pages 585–588, Genoa, Italy, September 1991.
A. Robinson. An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Networks, 5(2):298–305, March 1994.
J. Rohlicek, W. Russell, S. Roucos, and H. Gish. Continuous hidden Markov modelling for speaker-independent word spotting. In Proc. ICASSP, pages 627–630, Glasgow, Scotland, May 1989.
R. Rose and D. Paul. A hidden Markov model based keyword recognition system. In Proc. ICASSP, pages 129–132, Albuquerque, NM, April 1990.
M. Russell. A segmental HMM for speech pattern modelling. In Proc. ICASSP, pages 499–502, Minneapolis, MN, 1993.
K. Stevens. Lexical access from features. In Workshop on speech technology for man-machine interaction, Bombay, India, 1990.
N. Ström, L. Hetherington, T. Hazen, E. Sandness, and J. Glass. Acoustic modelling improvements in a segment-based speech recognizer. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pages 139–142, Keystone, CO, 1999.
J. Wilpon, L. Rabiner, C.H. Lee, and E. Goldman. Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Trans. ASSP, 38(ll):1870–1878, November 1990.
V. Zue, S. Seneff, J. Glass, J. Polifroni, C. Pao, T. Hazen, and L. Hetherington. Jupiter: A telephone-based conversational interface for weather information. IEEE Trans. Speech and Audio Proc., 8(l):85–96, January 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer Science+Business Media New York
About this paper
Cite this paper
Glass, J.R. (2004). Modelling Graph-Based Observation Spaces for Segment-Based Speech Recognition. In: Johnson, M., Khudanpur, S.P., Ostendorf, M., Rosenfeld, R. (eds) Mathematical Foundations of Speech and Language Processing. The IMA Volumes in Mathematics and its Applications, vol 138. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9017-4_8
Download citation
DOI: https://doi.org/10.1007/978-1-4419-9017-4_8
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4612-6484-2
Online ISBN: 978-1-4419-9017-4
eBook Packages: Springer Book Archive