Can Phonetic Knowledge be Used to Improve the Performance of Speech Recognisers and Synthesisers?

Ainsworth, William A.

doi:10.1007/1-4020-2637-4_2

William A. Ainsworth

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 25))

425 Accesses

Abstract

A chronological survey of the development of machine recognition of speech is contrasted with the beginnings of speech synthesis, and the advantages and disadvantages of the different systems and approaches as well as their changing degrees of dependency on phonetic knowledge are sketched. The unsolved fundamental problem of concatenation quality in present-day synthesis is discussed and a knowledge based solution mooted which can be projected onto recognition: A mathematical model of the relationship between temporally overlapping underlying articulatory gestures and the resulting surface acoustic signal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ainsworth, W.A. A system for converting English text speech. In: IEEE Transactions, AU-21, 1973: 288–290.
Google Scholar
Ainsworth, W.A. Speech Recognition by Machine. Peter Peregrinus Ltd., London, 1988.
Google Scholar
Allen, J.B. How do humans process and recognize speech? In: IEEE Transactions on Speech and Audio Processing, 2(4), 1994: 567–577.
Article Google Scholar
Allen, J., Hunnicutt, S., and Klatt, D. From Text to Speech: The MITalk System. Cambridge University Press, Cambridge, 1987.
Google Scholar
Baker, J.K. The DRAGON system — an overview. In: IEEE Transactions, ASSP-23, 1975: 24–29.
Google Scholar
Berthommier, F. and Meyer, G.F. Source separation by a functional model of amplitude demodulation. In: Proceedings of Eurospeech’95, 1995: 135–138.
Google Scholar
Carré, R., Ainsworth, W.A., Jospa, P., Maeda, S., and Pasdeloup, V. Perception of vowel-to-vowel transitions with different formant trajectories. Phonetica, 58 (2001): 163–178.
PubMed Google Scholar
Christensen, H., Lindberg, B., and Andersen, O. Introducing phonetically motivated information into ASR. In: Proceedings of Eurospeech’01, Volume 4, 2001: 2289–2292.
Google Scholar
Crouzet, O. and Ainsworth, W.A. Envelope information in speech processing: acoustic-phonetic analysis vs. auditory figure-ground segregation. In: Proceedings of Eurospeech’01, 1, 2001: 477–480.
Google Scholar
Davis, K.H., Biddulph, R., and Balashek, S. Automatic recognition of spoken digits. Journal of the Acoustical Society of America, 24 (1952): 637–642.
Google Scholar
Denes, P. and Mathews, M.V. Spoken digit recognition using time-frequency pattern-matching. Journal of the Acoustical Society of America, 32 (1960): 1450–1455.
Google Scholar
Fletcher, H. Speech and Hearing in Communication. Van Nostrand, New York, 1953.
Google Scholar
Forgie, J.W. and Forgie, C.D. Results obtained from a vowel recognition computer program. Journal of the Acoustical Society of America, 31 (1959): 1480–1489.
Google Scholar
Godfrey, J.J., Holliman, E.C., and McDaniel, J. Switchboard: Telephone speech corpus for research and development. In: Proceedings of ICASSP-92, 1, 1992: 517–520.
Google Scholar
Greenberg, S., Arai, T., and Silipo, R. Speech intelligibility derived from exceedingly sparse spectral information. In: Proceedings of ICSLP, Sydney, 1998: 2803–2806.
Google Scholar
Hagen, A., Morris, A., and Bourlard, H. From multi-band full combination to multi-stream full combination processing in robust ASR. In: Proceedings of the ISCA Workshop on Automatic Speech Recognition: Challenges for the New Millennium, ISCA ITRW ASR2000, 2000: 175–180.
Google Scholar
Holmes, J.N., Mattingly, I.G., and Shearme, J.N. Speech synthesis by rule. Language and Speech, 7 (1964): 127–143.
Google Scholar
Jacobson, R., Fant, G.C.M., and Halle, M. Preliminaries to speech analysis. MIT Tech. Report 13, 1952.
Google Scholar
Jelinek, F. Continuous speech recognition by statistical methods. In: Proceedings of IEEE, 64 (1976): 532–556.
Google Scholar
Klatt, D.H. Review of the ARPA speech recognition project. Journal of the Acoustical Society of America, 62 (1979): 1345–1366.
Google Scholar
Kollmeier, B., Kock, R., and Kohlrauch, A. Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. Journal of the Acoustical Society of America, 95 (1994): 1593–1602.
PubMed Google Scholar
Levinson, S.E., Rabiner, L.R., and Sondhi, M.M. Speaker-independent digit recognition using hidden Markov models. In: Proceedings of ICASSP 1983, 1983: 1049–1052.
Google Scholar
Liberman, A.M., Delattre, P., and Cooper, F.S. The role of selected stimulus-variables in the perception of unvoiced stop consonants. American Journal of Psychology, 65 (1952): 497–516.
PubMed Google Scholar
Liberman, A.M., Delattre, P., Cooper, F.S., and Gerstman, L.J. The role of consonant-vowel transitions in the perception of stop and nasal consonants. Psychological Monographs, 68 (1954): 1–13.
Google Scholar
Liberman, A.M., Delattre, P., Gerstman, L.J., and Cooper, F.S. Tempo of frequency change as a cue for distinguishing classes of speech sounds. Journal of Experimental Psychology, 54 (1956): 358–368.
Google Scholar
Liberman, A.M., Delattre, P., and Cooper, F.S. Cues for the distinction between voiced and voiceless stops in initial position. Language and Speech, 1 (1958): 153–167.
Google Scholar
Liberman, A.M., Cooper, F.S., Shankweiler, D.P., and Studdert-Kennedy, M. Perception of the speech code. Psychological Review, 74 (1967): 431–461.
PubMed Google Scholar
Meyer, G.M., Edmonds, B.A., Yang, D., and Ainsworth, W.A. Amplitude modulation maps for robust speech recognition. In: Proceedings of the ISCA Workshop on Automatic Speech Recognition: Challenges for the New Millennium, ISCA ITRW ASR2000, 2000: 168–174.
Google Scholar
Nelson, A.L., Herscher, M.B., Martin, T.B., Zadell, H.J., and Falter, J.W. Acoustic recognition by by analog feature-abstraction techniques. In: Models for the Perception of Speech and Visual Form, MIT Press, 1967: 428.
Google Scholar
Olson, H.F. and Belar, H. Phonetic typewriter. Journal of the Acoustical Society of America, 28 (1956): 1072–1081.
Google Scholar
Pastor, M. and Casacuberta, F. Automatic learning of finite state automata for pronunciation modelling. In: Proceedings of Eurospeech’01, 4, 2001: 2293–2296.
Google Scholar
Potter, R.K., Kopp, G.A., and Green, H.C. Visible Speech. Van Nostrand, New York, 1947.
Google Scholar
Sakoe, H., and Chiba, S. Dynamic programming algorithms optimization for spoken word recognition. In: IEEE Transactions, ASSP-26, 1978: 43–49.
Google Scholar
van Santen, J.P.H. Perceptual experiments for diagnostic testing of text-to-speech systems. Computer Speech and Language, 7 (1993): 49–100.
Article Google Scholar
van Santen, J.P.H., Sproat, R.W. Olive, J.P., and Hirschberg, J. (Eds.). Progress in Speech Synthesis. Springer-Verlag, New York, 1997.
Google Scholar
Velichko, V.M., and Zagoruyko, N.G. Automatic recognition of 200 words. International Journal of Man-Machine Studies, 2 (1970): 223–234.
Google Scholar
Wiren, J., and Stubbs, H.L. Electronic binary selection system for phoneme classification. Journal of the Acoustical Society of America, 28 (1956): 1082–1091.
Google Scholar

Download references

Authors

William A. Ainsworth
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universität des Saarlandes, Saarbrücken, Germany
William J. Barry
Norwegian University of Science and Technology, Trondheim, Norway
Wim A. van Dommelen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ainsworth, W.A. (2005). Can Phonetic Knowledge be Used to Improve the Performance of Speech Recognisers and Synthesisers?. In: Barry, W.J., van Dommelen, W.A. (eds) The Integration of Phonetic Knowledge in Speech Technology. Text, Speech and Language Technology, vol 25. Springer, Dordrecht. https://doi.org/10.1007/1-4020-2637-4_2

Download citation

DOI: https://doi.org/10.1007/1-4020-2637-4_2
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-2635-5
Online ISBN: 978-1-4020-2637-9
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics