Abstract
A chronological survey of the development of machine recognition of speech is contrasted with the beginnings of speech synthesis, and the advantages and disadvantages of the different systems and approaches as well as their changing degrees of dependency on phonetic knowledge are sketched. The unsolved fundamental problem of concatenation quality in present-day synthesis is discussed and a knowledge based solution mooted which can be projected onto recognition: A mathematical model of the relationship between temporally overlapping underlying articulatory gestures and the resulting surface acoustic signal.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ainsworth, W.A. A system for converting English text speech. In: IEEE Transactions, AU-21, 1973: 288–290.
Ainsworth, W.A. Speech Recognition by Machine. Peter Peregrinus Ltd., London, 1988.
Allen, J.B. How do humans process and recognize speech? In: IEEE Transactions on Speech and Audio Processing, 2(4), 1994: 567–577.
Allen, J., Hunnicutt, S., and Klatt, D. From Text to Speech: The MITalk System. Cambridge University Press, Cambridge, 1987.
Baker, J.K. The DRAGON system — an overview. In: IEEE Transactions, ASSP-23, 1975: 24–29.
Berthommier, F. and Meyer, G.F. Source separation by a functional model of amplitude demodulation. In: Proceedings of Eurospeech’95, 1995: 135–138.
Carré, R., Ainsworth, W.A., Jospa, P., Maeda, S., and Pasdeloup, V. Perception of vowel-to-vowel transitions with different formant trajectories. Phonetica, 58 (2001): 163–178.
Christensen, H., Lindberg, B., and Andersen, O. Introducing phonetically motivated information into ASR. In: Proceedings of Eurospeech’01, Volume 4, 2001: 2289–2292.
Crouzet, O. and Ainsworth, W.A. Envelope information in speech processing: acoustic-phonetic analysis vs. auditory figure-ground segregation. In: Proceedings of Eurospeech’01, 1, 2001: 477–480.
Davis, K.H., Biddulph, R., and Balashek, S. Automatic recognition of spoken digits. Journal of the Acoustical Society of America, 24 (1952): 637–642.
Denes, P. and Mathews, M.V. Spoken digit recognition using time-frequency pattern-matching. Journal of the Acoustical Society of America, 32 (1960): 1450–1455.
Fletcher, H. Speech and Hearing in Communication. Van Nostrand, New York, 1953.
Forgie, J.W. and Forgie, C.D. Results obtained from a vowel recognition computer program. Journal of the Acoustical Society of America, 31 (1959): 1480–1489.
Godfrey, J.J., Holliman, E.C., and McDaniel, J. Switchboard: Telephone speech corpus for research and development. In: Proceedings of ICASSP-92, 1, 1992: 517–520.
Greenberg, S., Arai, T., and Silipo, R. Speech intelligibility derived from exceedingly sparse spectral information. In: Proceedings of ICSLP, Sydney, 1998: 2803–2806.
Hagen, A., Morris, A., and Bourlard, H. From multi-band full combination to multi-stream full combination processing in robust ASR. In: Proceedings of the ISCA Workshop on Automatic Speech Recognition: Challenges for the New Millennium, ISCA ITRW ASR2000, 2000: 175–180.
Holmes, J.N., Mattingly, I.G., and Shearme, J.N. Speech synthesis by rule. Language and Speech, 7 (1964): 127–143.
Jacobson, R., Fant, G.C.M., and Halle, M. Preliminaries to speech analysis. MIT Tech. Report 13, 1952.
Jelinek, F. Continuous speech recognition by statistical methods. In: Proceedings of IEEE, 64 (1976): 532–556.
Klatt, D.H. Review of the ARPA speech recognition project. Journal of the Acoustical Society of America, 62 (1979): 1345–1366.
Kollmeier, B., Kock, R., and Kohlrauch, A. Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. Journal of the Acoustical Society of America, 95 (1994): 1593–1602.
Levinson, S.E., Rabiner, L.R., and Sondhi, M.M. Speaker-independent digit recognition using hidden Markov models. In: Proceedings of ICASSP 1983, 1983: 1049–1052.
Liberman, A.M., Delattre, P., and Cooper, F.S. The role of selected stimulus-variables in the perception of unvoiced stop consonants. American Journal of Psychology, 65 (1952): 497–516.
Liberman, A.M., Delattre, P., Cooper, F.S., and Gerstman, L.J. The role of consonant-vowel transitions in the perception of stop and nasal consonants. Psychological Monographs, 68 (1954): 1–13.
Liberman, A.M., Delattre, P., Gerstman, L.J., and Cooper, F.S. Tempo of frequency change as a cue for distinguishing classes of speech sounds. Journal of Experimental Psychology, 54 (1956): 358–368.
Liberman, A.M., Delattre, P., and Cooper, F.S. Cues for the distinction between voiced and voiceless stops in initial position. Language and Speech, 1 (1958): 153–167.
Liberman, A.M., Cooper, F.S., Shankweiler, D.P., and Studdert-Kennedy, M. Perception of the speech code. Psychological Review, 74 (1967): 431–461.
Meyer, G.M., Edmonds, B.A., Yang, D., and Ainsworth, W.A. Amplitude modulation maps for robust speech recognition. In: Proceedings of the ISCA Workshop on Automatic Speech Recognition: Challenges for the New Millennium, ISCA ITRW ASR2000, 2000: 168–174.
Nelson, A.L., Herscher, M.B., Martin, T.B., Zadell, H.J., and Falter, J.W. Acoustic recognition by by analog feature-abstraction techniques. In: Models for the Perception of Speech and Visual Form, MIT Press, 1967: 428.
Olson, H.F. and Belar, H. Phonetic typewriter. Journal of the Acoustical Society of America, 28 (1956): 1072–1081.
Pastor, M. and Casacuberta, F. Automatic learning of finite state automata for pronunciation modelling. In: Proceedings of Eurospeech’01, 4, 2001: 2293–2296.
Potter, R.K., Kopp, G.A., and Green, H.C. Visible Speech. Van Nostrand, New York, 1947.
Sakoe, H., and Chiba, S. Dynamic programming algorithms optimization for spoken word recognition. In: IEEE Transactions, ASSP-26, 1978: 43–49.
van Santen, J.P.H. Perceptual experiments for diagnostic testing of text-to-speech systems. Computer Speech and Language, 7 (1993): 49–100.
van Santen, J.P.H., Sproat, R.W. Olive, J.P., and Hirschberg, J. (Eds.). Progress in Speech Synthesis. Springer-Verlag, New York, 1997.
Velichko, V.M., and Zagoruyko, N.G. Automatic recognition of 200 words. International Journal of Man-Machine Studies, 2 (1970): 223–234.
Wiren, J., and Stubbs, H.L. Electronic binary selection system for phoneme classification. Journal of the Acoustical Society of America, 28 (1956): 1082–1091.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer
About this chapter
Cite this chapter
Ainsworth, W.A. (2005). Can Phonetic Knowledge be Used to Improve the Performance of Speech Recognisers and Synthesisers?. In: Barry, W.J., van Dommelen, W.A. (eds) The Integration of Phonetic Knowledge in Speech Technology. Text, Speech and Language Technology, vol 25. Springer, Dordrecht. https://doi.org/10.1007/1-4020-2637-4_2
Download citation
DOI: https://doi.org/10.1007/1-4020-2637-4_2
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-2635-5
Online ISBN: 978-1-4020-2637-9
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)