Summary
After presenting the basic principles of speech analysis, we focus on the mathematical techniques which constitute the foundations of most of the methods currently in use in speech processing, such as the Fourier transforms and the linear prediction analysis. Then, we review typical parameter sets generally proposed to encode the speech signal prior recognition. While these methods give a reasonable representation of speech spectra, they do not provide a very accurate temporal localization of a signal’s spectral components. Two classes of techniques having the potential to deal with this problem, such as time-frequency analyses and wavelets, are presented. Finally, we address the problem of robust speech analysis and give a brief overview of the fields of higher-order spectral analysis and auditory modeling, illustrating our presentation with recent applications of these techniques to speech processing. We conclude this chapter by mentioning the limits of standard analysis methods in the presence of noise.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alinat, P. (1973). Reconnaissance des Phonèmes au Moyen d’une Cochlée Artificielle. Ph.D. thesis. Université de Nice, Thèse de Docteur Ingénieur.
Ambikairajah, E., Keane, M., Kilmartin, L., and Tattersall, G. (1993). The application of the wavelet transform for speech processing. In EUROSPEECH, pages 151–154.
Atal, B. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J. Acoust. Soc. Amer., 55:1304–1312.
Atal, B. and Hanauer, S. (1971). Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust. Soc. Am., 50:637–655.
Atal, B. and Schroeder, M. (1968). Predictive coding of speech signals. In 6th International Congress on Acoustic, Tokyo, pages 21–28.
Atlas, L., Loughlin, P., and Pitton, J. (1991). Truly nonstationary techniques for the analysis and display of voiced speech. In ICASSP, pages 433–436.
Beet, S. (1990). Automatic speech recognition using a reduced auditory representation and position-tolerant discrimination. Computer Speech and Language, 4:17–33.
Beet, S., Powrie, H., Moore, R., and Tomlinson, M. (1988). Improved speech recognition using a reduced auditory representation. In ICASSP, pages 75–78.
Bladon, A. (1985). Acoustic phonetics, auditory phonetics, speaker sex and speech recognition: A thread. In Fallside, F. and Woods, W. A., editors, Computer Speech Processing, pages 29–39. Prentice Hall International.
Bladon, A. (1987). The auditory modelling dilemma, and a phonetic response. In Eleventh ICphS, pages 319–324.
Blomberg, M., Carlson, R., Elenius, K., and Granström, B. (1984). Auditory models in isolated word recognition. In ICASSP, pages 17.9.1–17.9.4.
Bregman, A. (1990). Auditory Scene Analysis. M.I.T. Press.
Brown, G. and Cooke, M. (1995). Temporal synchronisation in a neural oscillator model of primitive auditory stream segregation. In IJCAI Workshop on Computational Auditory Scene Analysis.
Burg, J. (1995). Maximum Entropy Spectral Analysis. Ph.D. thesis. Stanford University.
Cadzow, J. (1980). High performance spectral estimation — a new ARMA method. IEEE Trans. ASSP, ASSP-28(5):524–529.
Caelen, J. (1979). Un modèle d’oreille; analyse de la parole continue; reconnaissance phonémique. Université Paul Sabatier de Toulouse, Thèse d’Etat.
Caelen, J. (1985). Space/time data-information in the ARIAL-project ear model. Speech Communication, 4:163–180.
Carlson, R. and Granström, B. (1982). Towards an auditory spectrogram. In Carlson, R. and Granström, B., editors, The Representation of Speech in the Peripheral Auditory System, pages 109–114. Elsevier Biomedical Press.
Chester, D., Taylor, F., and Doyle, M. (1984). The Wigner distribution in speech processing applications. Journal of the Franklin Institute, 318:415–430.
Chistovich, L., al., (1982). Temporal processing of peripheral auditory patterns of speech. In Carlson, R. and Granström, B., editors, The Representation of Speech in the Peripheral Auditory System, pages 165–180. Elsevier Biomedical Press.
Choi, H. and Williams, W. (1989). Improved time-frequency representation of multi-component signals using exponential kernels. IEEE Trans. ASSP, 37:862–871.
Claasen, T. and Mecklenbrauker, W. (1980). The Wigner distribution, a tool for time-frequency signal analysis. Part3: Relations with other time-frequency signal transformations. Philips J. Res., 35:373–389.
Cohen, J. (1985). Application of an adaptive auditory model to speech recognition. In Workshop on Speech Recognition, Montréal, pages 8–9.
Cohen, J. (1989a). Application of an auditory model to speech recognition. J. Acoust. Soc. Am., 85(6):2623–2629.
Cohen, L. (1966). Generalized phase-space distribution functions. Journal Math. Phys.,7(5):781–786.
Cohen, L. (1989b). Time-frequency distributions — A review. Proc. IEEE, 77(7):941–981.
Cooke, M. (1986). A computer model of peripheral auditory processing incorporating phase-locking, suppression, and adaptation effects. Speech Communication, 5(3–4):261–281.
d’Alessandro, C. (1992). Speech analysis and synthesis using an auditory-based wavelet representation. In ETRW: Comparing Signal Representations, Sheffield, England, pages 31–38.
Davis, S. and Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP, ASSP-28(4):357–366.
Delgutte, B. (1982). Some correlates of phonetic distinctions at the level of the auditory nerve. In Carlson, R. and Granström, B., editors, The representation of Speech in the Peripheral Auditory System, pages 131–149. Elsevier Biomedical Press.
Delgutte, B. (1984). Codage de la Parole dans le Nerf Auditif. Ph.D. thesis, Université Pierre et Marie Curie, Paris 6.
Delgutte, B. (1986). Comment on the use of peripheral auditory models in speech recognition. In Perkell, J. S. and Klatt, D. H., editors, Variance and Variability in Speech Processes, pages 320–323. Lawrence Erlbaum Associates.
Dolmazon, J. (1982). Representation of speech-like sounds in the peripheral auditory system in light of a model. In Carlson, R. and Granström, B., editors, The Representation of Speech in the Peripheral Auditory System, pages 151–164. Elsevier Biomedical Press.
Ephraim, Y., Wilpon, J., and Rabiner, L. (1987). A linear predictive front-end processor for speech recognition in noisy environments. In ICASSP, pages 1324–1327.
Favero, R. and Gurgen, F. (1994). Using wavelet dyadic grids and neural networks for speech recognition. In ICSLP, pages 1539–1542.
Fineberg, A. and Yu, K. (1994). A time-frequency analysis technique for speech recognition signal processing. In ICSLP, pages 1615–1618.
Gao, Y., Huang, T., Chen, S., and Haton, J.-P. (1992). Auditory model-based speech processing. In ICSLP, pages 73–76.
Gao, Y., Huang, T., and Haton, J.-P. (1993). Central auditory model for spectral processing. In ICASSP, pages 704–707.
Garudradi, H. (1988). Identification of invariant acoustic cues in stop consonants using the Wigner distribution. Ph.D. thesis. University of British Columbia.
Gerard, C. and Baudry, M. (1993). Parametrization centiseconde du signal de parole en milieu bruité. In Haton, J.-P., editor. Actes du Séminaire Reconnaissance Automatique de la Parole. GDR-PRC Communication Homme-Machine.
Gersho, A. and Cuperman, V. (1983). Vector quantization: A pattern-matching technique for speech coding. IEEE Comm. Magazine, 21(9): 15–21.
Gersho, A. and Shoham, Y. (1984). Hierarchical vector quantization of speech with dynamic codebook allocation. In ICASSP, pages 10.7.1–10.7.4.
Ghitza, O. (1986). Speech analysis/synthesis based on matching the synthesized and the original representations in the auditory nerve level. In ICASSP, pages 1995–1998.
Ghitza, O. (1987). Robustness against noise: The role of timing-synchrony measurement. In ICASSP, pages 2372–2375.
Ghitza, O. (1988). Auditory neural feedback as a basis for speech processing. In ICASSP, pages 91–94.
Gray, R. (1984). Vector quantization. IEEEASSP Magazine, 1:4–29.
Green, P., Cooke, M., and Crawford, M. (1995). Auditory scene analysis and hidden Markov model recognition of speech in noise. In ICASSP, pages 401–404.
Greenberg, S. (1988a). The ear as a speech analyzer. Journal of Phonetics, 15(4): 139–149.
Greenberg, S. (1988b). A special issue on the representation of speech in the auditory periphery. Journal of Phonetics, 15(4).
Hanson, B. and Applebaum, T. (1993). Subband or cepstral domain filtering for recognition of Lombard and channel-distorted speech. In ICASSP, pages II79–II.82.
Hanson, B. and Wakita, H. (1986). Spectral slope based distortion measures for all-pole models of speech. In ICASSP, pages 757–780.
Hermansky, H. (1987). An efficient speaker-independent automatic speech recognition by simulation of some properties of human auditory perception. In IC-ASSP, pages 1159–1162.
Hermansky, H., Hanson, B., and Wakita, H. (1985). Low-dimensional representation of vowels based on all-pole modeling in the psychophysical domain. Speech Communication, 4(1–3): 181–187.
Hermansky, H., Morgan, N., Bayya, A., and Kohn, P. (1991). Compensation for the effect of the communication channel in auditory-like analysis of speech (RAS-TA-PLP). In EUROSPEECH, pages 1367–1370.
Howitt, A. (1987). Application of the Wigner distribution to speech analysis. S.M. Thesis, Massachusetts Institute of Technology.
Huber, P., Kleiner, B., Gasser, T., and Dumermuth, G. (1971). Statistical methods for investigating phase relations in stationary stochastic processes. IEEE Trans, on Audio Electroacoustics, pages 78–86.
Hunt, M. and Lefèbvre, C. (1986). Speech recognition using a cochlear model. In ICASSP, pages 1979–1982.
Hunt, M. and Lefèbvre, C. (1988). Speaker dependent and independent speech recognition experiments with an auditory model. In ICASSP, pages 215–218.
Hwang, W.-L. and Mallat, S. (1992). Singularities and noise discrimination with wavelets. In ICASSP, pages 377–380.
Itakura, F. and Saito, S. (1968). Analysis synthesis telephony based upon the maximum likelihood method. In Kohasi, Y., editor, 6th International Congress on Acoustics, Tokyo, pages C-5–5.
Itakura, F. and Umezaki, T. (1987). Distance measure for speech recognition based on the smoothed group delay spectrum. In ICASSP, pages 1257–1280.
Juang, B. H., Rabiner, L., and Wilpon, J. (1986). On the use of bandpass liftering in speech recognition. In ICASSP, pages 765–768.
Junqua, J.-C. (1987). Evaluation of ASR front-ends in speaker-dependent and speaker-independent recognition. J. Acoust. Soc. Am., 81 S1:S93.
Junqua, J.-C. (1989). Toward robustness in isolated-word automatic speech recognition. Ph.D. thesis. University of Nancy I, STL Monograph.
Junqua, J.-C., Wakita, H., and Hermansky, H. (1993). Evaluation and optimization of perceptually-based front-end. IEEE Trans, on Speech and Audio Processing,1(1):39–48.
Kadambe, S. and Boudreaux-Bartels, G. (1991). A comparison of wavelet functions for pitch detection of speech signals. In ICASSP, pages 449–452.
Karjalainen, M. (1987). Auditory models for speech processing. In Eleventh ICphS, pages 2.11–2.20.
Klatt, D. (1982). Prediction of perceived phonetic distance from critical-band spectra: A first step. In ICASSP, pages 1278–1281.
Koljonen, J. and Karjalainen, M. (1984). Use of computational psychoacoustical models in speech processing: Coding and objective performance evaluation. In ICASSP, pages 1.9.1–1.9.4.
Kraniauskas, P. (1994). A plain man’s guide to the FFT. IEEE Signal Processing Magazine, 11(2):24–35.
Leung, S., Wong, O., and Lai, K. (1991). Decomposition of the LPC excitation using wavelet functions. In EUROSPEECH, pages 1327–1331.
Lim, J. (1978). Estimation of LPC coefficients from speech waveforms degraded by additive random noise. In ICASSP, pages 599–601.
Linde, Y., Buzo, A., and Gray, R. (1980). An algorithm for vector quantizer design. IEEE Trans, on Communication, 28(l):84–95.
Lyon, R. F. (1983). A computational model of binaural localization and separation. In ICASSP, pages 1148–1151.
Makhoul, J. (1973). Spectral analysis of speech by linear prediction. IEEE Trans. AS-SP, ASSP-21(3): 140–148.
Makhoul, J. (1974). Selective linear prediction and analysis-by-synthesis in speech analysis. Technical Report 2578, Bolt Beranek and Newman Inc., Cambridge, Mass.
Makhoul, J. (1975). Linear prediction: A tutorial review. IEEE Trans. ASSP, ASSP-63:561,580.
Makhoul, J. and Schwartz, R. (1985). Ignorance modeling: Comments from performing fine phonetic distinctions, r. cole, r. m. stern, and m. j. lasry. In Perkell, J. and Klatt, D., editors, Variability and Invariance in Speech Processes. Lawrence Erlbaum Associates.
Mansour, D. and Juang, B. (1988). The short-time modified coherence representation and its application for noisy speech recognition. In ICASSP, pages 525–528.
Markel, J. and Gray, A. (1976). Linear Prediction of Speech. Springer-Verlag.
Masgrau, E., Salavedra, J., Moreno, A., and Ardanuy, A. (1992). Speech enhancement by adaptive Wiener filtering based on cumulant AR modeling. In ETRW: Speech Processing in Adverse Conditions, pages 143–146.
Massoro, D. (1987). Speech Perception by Ear and Eye. Lawrence Erlbaum Associates.
Moreno, A. and Fonollosa, J. (1992a). Cumulant-based voicing decision in noise corrupted speech. In ICSLP, pages 531–534.
Moreno, A. and Fonollosa, J. (1992b). Pitch determination of noisy speech using higher order statistics. In ICASSP, pages 133–136.
Moreno, A., Tortola, S., Vidal, J., and Fonollosa, J. (1995). New HOS-based parameter estimation methods for speech recognition in noisy environments. In ICASSP, pages 429–432.
Nikias, C. and Mendel, J. (1991). Higher-order spectral analysis. In ICASSP, Tutorial 4.
Nikias, C. and Raghuveer, M. (1987). Bispectrum estimation: A digital signal processing framework. Proc. IEEE, 75(7):869–891.
Ohshima, Y. and Stern, R. (1994). Environmental robustness in automatic speech recognition using physiologically-motivated signal processing. In ICSLP, pages 1347–1350.
Oppenheim, A. and Schafer, R. (1975). Digital Signal Processing. Prentice-Hall.
Paliwal, K. (1988). A study of line spectrum pair frequencies for speech recognition. In ICASSP, pages 485–488.
Paliwal, K. (1992). Dimensionality reduction of the enhanced feature set for HMM speech recognizer. Digital Signal Processing, 2:157–173.
Paliwal, K. and Sondhi, M. (1991). Recognition of noisy speech using cumulant-based linear prediction analysis. In ICASSP, pages 429–432.
Park, S.-W. (1994). Speech compression using ARMA model and wavelet transform. In ICASSP, pages 209–212.
Picone, J. (1993). Signal modeling techniques in speech recognition. Proc. IEEE, 81(9): 1215–1247.
Rabiner, L. and Juang, B.-H. (1993). Fundamentals of Speech Recognition. Prentice Hall.
Rabiner, L., Pan, K., and Soong, F. (1984). On the performance of isolated word speech recognizers using vector quantization and temporal energy contours. AT&T Technical Journal, 63(7): 1245–1260.
Rabiner, L. and Schafer, R. (1978). Digital Processing of Speech Signals. Prentice-Hall.
Raghuveer, M. and Nikias, C. (1985). Bispectrum estimation: A parametric approach. IEEE Trans. ASSP, ASSP-33(4): 1213–1230.
Rioul, O. and Vetterli, M. (1991). Wavelets and signal processing. IEEE Signal Processing Magazine, pages 14–38.
Rupert, A., Caspary, D., and Moushegian, G. (1977). Response characteristics of cochlear nucleus neurons to vowel sounds. Ann. Otol., 86:37–48.
Sambur, M. and Jayant, N. (1976). LPC analysis/synthesis from speech inputs containing quantizing noise or additive noise. IEEE Trans. ASSP, ASSP-24(6):488–494.
Sambur, M. and Rabiner, L. (1975). A speaker-independent digit-recognition system. Bell Syst. Tech. J., 54:81–102.
Schwartz, J. (1981). Apport de la psychoacoustique à la modélisation du système auditif chez l’homme. Ph.D. thesis. Université de Grenoble, Thèse de TI.N.P de Grenoble.
Schwartz, R. and Makhoul, J. (1975). Where the phonemes are: Dealing with ambiguity in acoustic-phonetic recognition. IEEE Trans. ASSP, ASSP-23.50–53.
Seetharaman, S. and Jernigan, M. (1988). Speech signal reconstruction based on higher order spectra. In ICASSP, pages 703–706.
Seneff, S. (1984). Pitch and spectral estimation of speech based on auditory synchrony model. In ICASSP, pages 36.2–36.5.
Seneff, S. (1986). A computational model for the peripheral auditory system: Application to speech recognition research. In ICASSP, pages 1983–1986.
Seneff, S. (1988). A joint synchrony/mean-rate model of auditory speech processing. Journal of Phonetics, 16(l):55–76.
Shamma, S. (1986). Encoding the acoustic spectrum in the spatio-temporal responses of the auditory nerve. In Moore, B.C. J. and Patterson, R. D., editors, Auditory Frequency Selectivity, pages 289–296. New York, Plenum.
Shamma, S. (1988). The acoustic features of speech sounds in a model of auditory processing: Vowels and voiceless fricatives. Journal of Phonetics, 16:77–91.
Steiglitz, K. (1976). On the simultaneous estimation of poles and zeros in speech analysis. IEEE Trans. ASSP, ASSP-25:194–202.
Teolis, A. and Benedetto, J. (1994). Noise suppression using a wavelet model. In ICASSP, pages 17–20.
Tierney, J. (1980). A study of LPC analysis of speech in additive noise. IEEE Trans. ASSP, ASSP-28(4).
Van Alphen, P. and Pols, L. (1991). Comparing various feature vectors in automatic speech recognition. In EUROSPEECH, pages 533–536.
Vetterli, M. and Herley, C. (1990). Wavelets and filter banks: Relationships and new results. In ICASSP, pages 1723–1726.
Vidal, J., Masgrau, E., Moreno, A., and Fonollosa, J. (1992). Speech analysis using higher order statistics. In ETRW: Comparing Signal Representations, Sheffield, England, pages 391–396.
Wakita, H. (1973). Direct estimation of the vocal tract shape by inverse filtering of acoustic speech waveforms. IEEE Trans. ASSP, AU-21(5):417–427.
Wakita, H. (1981). Linear prediction voice synthesizers. Speech Tech., Fall, pages 17–22.
Wakita, H. and Zhao, Y. (1992). On the time-frequency display of speech signals using a generalized time-frequency representation with a cone-shaped kernel. In ETRW: Comparing Signal Representations, Sheffield, England, pages 401–408.
Wells, B. (1985). Voiced/unvoiced decision based on the bispectrum. In ICASSP, pages 1589–1592.
Wigner, E. (1932). On the quantum correction for thermodynamic equilibrium. Physical Review, 40:749–759.
Wilde, S. and Curtis, K. (1992). The wavelet transform for speech analysis. In ICSLP, pages 1621–1624.
Wilpon, J. (1989). A study on the effects of telephone transmission noise on speaker-independent recognition. In Lea, W., editor, Towards Robustness in Speech Recognition, pages 190–206. Speech Science Publications.
Wokurek, M., Rubin, G., and Hlawatsch, F. (1987). Wigner distribution — a new method for high resolution time-frequency analysis of speech signals. In Eleventh ICphS, pages 44–47.
Young, E. and Sachs, M. (1979). Representation of steady-state vowels in the temporal aspects of the discharges patterns of populations of auditory-nerve fibers. J. Acoust. Soc. Am., 66:1381–1403.
Zhao, Y., Atlas, L., and Marks, R. (1990). The use of cone-shaped kernels for generalized time-frequency representations of nonstationary signals. IEEE Trans. ASSP, ASSP-38(7):1084–1091.
Zwicker, E. and Scharf, B. (1965). A model of loudness summation. Psychological Review, 72(l):3–26.
Zwicker, E. and Terhardt, E. (1979). Automatic speech recognition using psychoa-coustic models. J. Acoust. Soc. Am., 65(2).
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 1996 Kluwer Academic Publishers
About this chapter
Cite this chapter
Junqua, JC., Haton, JP. (1996). Background on Speech Analysis. In: Robustness in Automatic Speech Recognition. The Kluwer International Series in Engineering and Computer Science, vol 341. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1297-0_2
Download citation
DOI: https://doi.org/10.1007/978-1-4613-1297-0_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4612-8555-7
Online ISBN: 978-1-4613-1297-0
eBook Packages: Springer Book Archive