Background on Speech Analysis

Junqua, Jean-Claude; Haton, Jean-Paul

doi:10.1007/978-1-4613-1297-0_2

Jean-Claude Junqua³ &
Jean-Paul Haton⁴

Part of the book series: The Kluwer International Series in Engineering and Computer Science ((SECS,volume 341))

205 Accesses

Summary

After presenting the basic principles of speech analysis, we focus on the mathematical techniques which constitute the foundations of most of the methods currently in use in speech processing, such as the Fourier transforms and the linear prediction analysis. Then, we review typical parameter sets generally proposed to encode the speech signal prior recognition. While these methods give a reasonable representation of speech spectra, they do not provide a very accurate temporal localization of a signal’s spectral components. Two classes of techniques having the potential to deal with this problem, such as time-frequency analyses and wavelets, are presented. Finally, we address the problem of robust speech analysis and give a brief overview of the fields of higher-order spectral analysis and auditory modeling, illustrating our presentation with recent applications of these techniques to speech processing. We conclude this chapter by mentioning the limits of standard analysis methods in the presence of noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alinat, P. (1973). Reconnaissance des Phonèmes au Moyen d’une Cochlée Artificielle. Ph.D. thesis. Université de Nice, Thèse de Docteur Ingénieur.
Google Scholar
Ambikairajah, E., Keane, M., Kilmartin, L., and Tattersall, G. (1993). The application of the wavelet transform for speech processing. In EUROSPEECH, pages 151–154.
Google Scholar
Atal, B. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J. Acoust. Soc. Amer., 55:1304–1312.
Article Google Scholar
Atal, B. and Hanauer, S. (1971). Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust. Soc. Am., 50:637–655.
Article Google Scholar
Atal, B. and Schroeder, M. (1968). Predictive coding of speech signals. In 6th International Congress on Acoustic, Tokyo, pages 21–28.
Google Scholar
Atlas, L., Loughlin, P., and Pitton, J. (1991). Truly nonstationary techniques for the analysis and display of voiced speech. In ICASSP, pages 433–436.
Google Scholar
Beet, S. (1990). Automatic speech recognition using a reduced auditory representation and position-tolerant discrimination. Computer Speech and Language, 4:17–33.
Article Google Scholar
Beet, S., Powrie, H., Moore, R., and Tomlinson, M. (1988). Improved speech recognition using a reduced auditory representation. In ICASSP, pages 75–78.
Google Scholar
Bladon, A. (1985). Acoustic phonetics, auditory phonetics, speaker sex and speech recognition: A thread. In Fallside, F. and Woods, W. A., editors, Computer Speech Processing, pages 29–39. Prentice Hall International.
Google Scholar
Bladon, A. (1987). The auditory modelling dilemma, and a phonetic response. In Eleventh ICphS, pages 319–324.
Google Scholar
Blomberg, M., Carlson, R., Elenius, K., and Granström, B. (1984). Auditory models in isolated word recognition. In ICASSP, pages 17.9.1–17.9.4.
Google Scholar
Bregman, A. (1990). Auditory Scene Analysis. M.I.T. Press.
Google Scholar
Brown, G. and Cooke, M. (1995). Temporal synchronisation in a neural oscillator model of primitive auditory stream segregation. In IJCAI Workshop on Computational Auditory Scene Analysis.
Google Scholar
Burg, J. (1995). Maximum Entropy Spectral Analysis. Ph.D. thesis. Stanford University.
Google Scholar
Cadzow, J. (1980). High performance spectral estimation — a new ARMA method. IEEE Trans. ASSP, ASSP-28(5):524–529.
Article MathSciNet Google Scholar
Caelen, J. (1979). Un modèle d’oreille; analyse de la parole continue; reconnaissance phonémique. Université Paul Sabatier de Toulouse, Thèse d’Etat.
Google Scholar
Caelen, J. (1985). Space/time data-information in the ARIAL-project ear model. Speech Communication, 4:163–180.
Article Google Scholar
Carlson, R. and Granström, B. (1982). Towards an auditory spectrogram. In Carlson, R. and Granström, B., editors, The Representation of Speech in the Peripheral Auditory System, pages 109–114. Elsevier Biomedical Press.
Google Scholar
Chester, D., Taylor, F., and Doyle, M. (1984). The Wigner distribution in speech processing applications. Journal of the Franklin Institute, 318:415–430.
Article Google Scholar
Chistovich, L., al., (1982). Temporal processing of peripheral auditory patterns of speech. In Carlson, R. and Granström, B., editors, The Representation of Speech in the Peripheral Auditory System, pages 165–180. Elsevier Biomedical Press.
Google Scholar
Choi, H. and Williams, W. (1989). Improved time-frequency representation of multi-component signals using exponential kernels. IEEE Trans. ASSP, 37:862–871.
Article Google Scholar
Claasen, T. and Mecklenbrauker, W. (1980). The Wigner distribution, a tool for time-frequency signal analysis. Part3: Relations with other time-frequency signal transformations. Philips J. Res., 35:373–389.
Google Scholar
Cohen, J. (1985). Application of an adaptive auditory model to speech recognition. In Workshop on Speech Recognition, Montréal, pages 8–9.
Google Scholar
Cohen, J. (1989a). Application of an auditory model to speech recognition. J. Acoust. Soc. Am., 85(6):2623–2629.
Article Google Scholar
Cohen, L. (1966). Generalized phase-space distribution functions. Journal Math. Phys.,7(5):781–786.
Article Google Scholar
Cohen, L. (1989b). Time-frequency distributions — A review. Proc. IEEE, 77(7):941–981.
Article Google Scholar
Cooke, M. (1986). A computer model of peripheral auditory processing incorporating phase-locking, suppression, and adaptation effects. Speech Communication, 5(3–4):261–281.
Article MathSciNet Google Scholar
d’Alessandro, C. (1992). Speech analysis and synthesis using an auditory-based wavelet representation. In ETRW: Comparing Signal Representations, Sheffield, England, pages 31–38.
Google Scholar
Davis, S. and Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP, ASSP-28(4):357–366.
Article Google Scholar
Delgutte, B. (1982). Some correlates of phonetic distinctions at the level of the auditory nerve. In Carlson, R. and Granström, B., editors, The representation of Speech in the Peripheral Auditory System, pages 131–149. Elsevier Biomedical Press.
Google Scholar
Delgutte, B. (1984). Codage de la Parole dans le Nerf Auditif. Ph.D. thesis, Université Pierre et Marie Curie, Paris 6.
Google Scholar
Delgutte, B. (1986). Comment on the use of peripheral auditory models in speech recognition. In Perkell, J. S. and Klatt, D. H., editors, Variance and Variability in Speech Processes, pages 320–323. Lawrence Erlbaum Associates.
Google Scholar
Dolmazon, J. (1982). Representation of speech-like sounds in the peripheral auditory system in light of a model. In Carlson, R. and Granström, B., editors, The Representation of Speech in the Peripheral Auditory System, pages 151–164. Elsevier Biomedical Press.
Google Scholar
Ephraim, Y., Wilpon, J., and Rabiner, L. (1987). A linear predictive front-end processor for speech recognition in noisy environments. In ICASSP, pages 1324–1327.
Google Scholar
Favero, R. and Gurgen, F. (1994). Using wavelet dyadic grids and neural networks for speech recognition. In ICSLP, pages 1539–1542.
Google Scholar
Fineberg, A. and Yu, K. (1994). A time-frequency analysis technique for speech recognition signal processing. In ICSLP, pages 1615–1618.
Google Scholar
Gao, Y., Huang, T., Chen, S., and Haton, J.-P. (1992). Auditory model-based speech processing. In ICSLP, pages 73–76.
Google Scholar
Gao, Y., Huang, T., and Haton, J.-P. (1993). Central auditory model for spectral processing. In ICASSP, pages 704–707.
Google Scholar
Garudradi, H. (1988). Identification of invariant acoustic cues in stop consonants using the Wigner distribution. Ph.D. thesis. University of British Columbia.
Google Scholar
Gerard, C. and Baudry, M. (1993). Parametrization centiseconde du signal de parole en milieu bruité. In Haton, J.-P., editor. Actes du Séminaire Reconnaissance Automatique de la Parole. GDR-PRC Communication Homme-Machine.
Google Scholar
Gersho, A. and Cuperman, V. (1983). Vector quantization: A pattern-matching technique for speech coding. IEEE Comm. Magazine, 21(9): 15–21.
Article Google Scholar
Gersho, A. and Shoham, Y. (1984). Hierarchical vector quantization of speech with dynamic codebook allocation. In ICASSP, pages 10.7.1–10.7.4.
Google Scholar
Ghitza, O. (1986). Speech analysis/synthesis based on matching the synthesized and the original representations in the auditory nerve level. In ICASSP, pages 1995–1998.
Google Scholar
Ghitza, O. (1987). Robustness against noise: The role of timing-synchrony measurement. In ICASSP, pages 2372–2375.
Google Scholar
Ghitza, O. (1988). Auditory neural feedback as a basis for speech processing. In ICASSP, pages 91–94.
Google Scholar
Gray, R. (1984). Vector quantization. IEEEASSP Magazine, 1:4–29.
Article Google Scholar
Green, P., Cooke, M., and Crawford, M. (1995). Auditory scene analysis and hidden Markov model recognition of speech in noise. In ICASSP, pages 401–404.
Google Scholar
Greenberg, S. (1988a). The ear as a speech analyzer. Journal of Phonetics, 15(4): 139–149.
Google Scholar
Greenberg, S. (1988b). A special issue on the representation of speech in the auditory periphery. Journal of Phonetics, 15(4).
Google Scholar
Hanson, B. and Applebaum, T. (1993). Subband or cepstral domain filtering for recognition of Lombard and channel-distorted speech. In ICASSP, pages II79–II.82.
Google Scholar
Hanson, B. and Wakita, H. (1986). Spectral slope based distortion measures for all-pole models of speech. In ICASSP, pages 757–780.
Google Scholar
Hermansky, H. (1987). An efficient speaker-independent automatic speech recognition by simulation of some properties of human auditory perception. In IC-ASSP, pages 1159–1162.
Google Scholar
Hermansky, H., Hanson, B., and Wakita, H. (1985). Low-dimensional representation of vowels based on all-pole modeling in the psychophysical domain. Speech Communication, 4(1–3): 181–187.
Article Google Scholar
Hermansky, H., Morgan, N., Bayya, A., and Kohn, P. (1991). Compensation for the effect of the communication channel in auditory-like analysis of speech (RAS-TA-PLP). In EUROSPEECH, pages 1367–1370.
Google Scholar
Howitt, A. (1987). Application of the Wigner distribution to speech analysis. S.M. Thesis, Massachusetts Institute of Technology.
Google Scholar
Huber, P., Kleiner, B., Gasser, T., and Dumermuth, G. (1971). Statistical methods for investigating phase relations in stationary stochastic processes. IEEE Trans, on Audio Electroacoustics, pages 78–86.
Google Scholar
Hunt, M. and Lefèbvre, C. (1986). Speech recognition using a cochlear model. In ICASSP, pages 1979–1982.
Google Scholar
Hunt, M. and Lefèbvre, C. (1988). Speaker dependent and independent speech recognition experiments with an auditory model. In ICASSP, pages 215–218.
Google Scholar
Hwang, W.-L. and Mallat, S. (1992). Singularities and noise discrimination with wavelets. In ICASSP, pages 377–380.
Google Scholar
Itakura, F. and Saito, S. (1968). Analysis synthesis telephony based upon the maximum likelihood method. In Kohasi, Y., editor, 6th International Congress on Acoustics, Tokyo, pages C-5–5.
Google Scholar
Itakura, F. and Umezaki, T. (1987). Distance measure for speech recognition based on the smoothed group delay spectrum. In ICASSP, pages 1257–1280.
Google Scholar
Juang, B. H., Rabiner, L., and Wilpon, J. (1986). On the use of bandpass liftering in speech recognition. In ICASSP, pages 765–768.
Google Scholar
Junqua, J.-C. (1987). Evaluation of ASR front-ends in speaker-dependent and speaker-independent recognition. J. Acoust. Soc. Am., 81 S1:S93.
Article Google Scholar
Junqua, J.-C. (1989). Toward robustness in isolated-word automatic speech recognition. Ph.D. thesis. University of Nancy I, STL Monograph.
Google Scholar
Junqua, J.-C., Wakita, H., and Hermansky, H. (1993). Evaluation and optimization of perceptually-based front-end. IEEE Trans, on Speech and Audio Processing,1(1):39–48.
Article Google Scholar
Kadambe, S. and Boudreaux-Bartels, G. (1991). A comparison of wavelet functions for pitch detection of speech signals. In ICASSP, pages 449–452.
Google Scholar
Karjalainen, M. (1987). Auditory models for speech processing. In Eleventh ICphS, pages 2.11–2.20.
Google Scholar
Klatt, D. (1982). Prediction of perceived phonetic distance from critical-band spectra: A first step. In ICASSP, pages 1278–1281.
Google Scholar
Koljonen, J. and Karjalainen, M. (1984). Use of computational psychoacoustical models in speech processing: Coding and objective performance evaluation. In ICASSP, pages 1.9.1–1.9.4.
Google Scholar
Kraniauskas, P. (1994). A plain man’s guide to the FFT. IEEE Signal Processing Magazine, 11(2):24–35.
Article Google Scholar
Leung, S., Wong, O., and Lai, K. (1991). Decomposition of the LPC excitation using wavelet functions. In EUROSPEECH, pages 1327–1331.
Google Scholar
Lim, J. (1978). Estimation of LPC coefficients from speech waveforms degraded by additive random noise. In ICASSP, pages 599–601.
Google Scholar
Linde, Y., Buzo, A., and Gray, R. (1980). An algorithm for vector quantizer design. IEEE Trans, on Communication, 28(l):84–95.
Article Google Scholar
Lyon, R. F. (1983). A computational model of binaural localization and separation. In ICASSP, pages 1148–1151.
Google Scholar
Makhoul, J. (1973). Spectral analysis of speech by linear prediction. IEEE Trans. AS-SP, ASSP-21(3): 140–148.
Google Scholar
Makhoul, J. (1974). Selective linear prediction and analysis-by-synthesis in speech analysis. Technical Report 2578, Bolt Beranek and Newman Inc., Cambridge, Mass.
Google Scholar
Makhoul, J. (1975). Linear prediction: A tutorial review. IEEE Trans. ASSP, ASSP-63:561,580.
Google Scholar
Makhoul, J. and Schwartz, R. (1985). Ignorance modeling: Comments from performing fine phonetic distinctions, r. cole, r. m. stern, and m. j. lasry. In Perkell, J. and Klatt, D., editors, Variability and Invariance in Speech Processes. Lawrence Erlbaum Associates.
Google Scholar
Mansour, D. and Juang, B. (1988). The short-time modified coherence representation and its application for noisy speech recognition. In ICASSP, pages 525–528.
Google Scholar
Markel, J. and Gray, A. (1976). Linear Prediction of Speech. Springer-Verlag.
Book MATH Google Scholar
Masgrau, E., Salavedra, J., Moreno, A., and Ardanuy, A. (1992). Speech enhancement by adaptive Wiener filtering based on cumulant AR modeling. In ETRW: Speech Processing in Adverse Conditions, pages 143–146.
Google Scholar
Massoro, D. (1987). Speech Perception by Ear and Eye. Lawrence Erlbaum Associates.
Google Scholar
Moreno, A. and Fonollosa, J. (1992a). Cumulant-based voicing decision in noise corrupted speech. In ICSLP, pages 531–534.
Google Scholar
Moreno, A. and Fonollosa, J. (1992b). Pitch determination of noisy speech using higher order statistics. In ICASSP, pages 133–136.
Google Scholar
Moreno, A., Tortola, S., Vidal, J., and Fonollosa, J. (1995). New HOS-based parameter estimation methods for speech recognition in noisy environments. In ICASSP, pages 429–432.
Google Scholar
Nikias, C. and Mendel, J. (1991). Higher-order spectral analysis. In ICASSP, Tutorial 4.
Google Scholar
Nikias, C. and Raghuveer, M. (1987). Bispectrum estimation: A digital signal processing framework. Proc. IEEE, 75(7):869–891.
Article Google Scholar
Ohshima, Y. and Stern, R. (1994). Environmental robustness in automatic speech recognition using physiologically-motivated signal processing. In ICSLP, pages 1347–1350.
Google Scholar
Oppenheim, A. and Schafer, R. (1975). Digital Signal Processing. Prentice-Hall.
MATH Google Scholar
Paliwal, K. (1988). A study of line spectrum pair frequencies for speech recognition. In ICASSP, pages 485–488.
Google Scholar
Paliwal, K. (1992). Dimensionality reduction of the enhanced feature set for HMM speech recognizer. Digital Signal Processing, 2:157–173.
Article Google Scholar
Paliwal, K. and Sondhi, M. (1991). Recognition of noisy speech using cumulant-based linear prediction analysis. In ICASSP, pages 429–432.
Google Scholar
Park, S.-W. (1994). Speech compression using ARMA model and wavelet transform. In ICASSP, pages 209–212.
Google Scholar
Picone, J. (1993). Signal modeling techniques in speech recognition. Proc. IEEE, 81(9): 1215–1247.
Article Google Scholar
Rabiner, L. and Juang, B.-H. (1993). Fundamentals of Speech Recognition. Prentice Hall.
Google Scholar
Rabiner, L., Pan, K., and Soong, F. (1984). On the performance of isolated word speech recognizers using vector quantization and temporal energy contours. AT&T Technical Journal, 63(7): 1245–1260.
Google Scholar
Rabiner, L. and Schafer, R. (1978). Digital Processing of Speech Signals. Prentice-Hall.
Google Scholar
Raghuveer, M. and Nikias, C. (1985). Bispectrum estimation: A parametric approach. IEEE Trans. ASSP, ASSP-33(4): 1213–1230.
Article Google Scholar
Rioul, O. and Vetterli, M. (1991). Wavelets and signal processing. IEEE Signal Processing Magazine, pages 14–38.
Google Scholar
Rupert, A., Caspary, D., and Moushegian, G. (1977). Response characteristics of cochlear nucleus neurons to vowel sounds. Ann. Otol., 86:37–48.
Google Scholar
Sambur, M. and Jayant, N. (1976). LPC analysis/synthesis from speech inputs containing quantizing noise or additive noise. IEEE Trans. ASSP, ASSP-24(6):488–494.
Article Google Scholar
Sambur, M. and Rabiner, L. (1975). A speaker-independent digit-recognition system. Bell Syst. Tech. J., 54:81–102.
Google Scholar
Schwartz, J. (1981). Apport de la psychoacoustique à la modélisation du système auditif chez l’homme. Ph.D. thesis. Université de Grenoble, Thèse de TI.N.P de Grenoble.
Google Scholar
Schwartz, R. and Makhoul, J. (1975). Where the phonemes are: Dealing with ambiguity in acoustic-phonetic recognition. IEEE Trans. ASSP, ASSP-23.50–53.
Article Google Scholar
Seetharaman, S. and Jernigan, M. (1988). Speech signal reconstruction based on higher order spectra. In ICASSP, pages 703–706.
Google Scholar
Seneff, S. (1984). Pitch and spectral estimation of speech based on auditory synchrony model. In ICASSP, pages 36.2–36.5.
Google Scholar
Seneff, S. (1986). A computational model for the peripheral auditory system: Application to speech recognition research. In ICASSP, pages 1983–1986.
Google Scholar
Seneff, S. (1988). A joint synchrony/mean-rate model of auditory speech processing. Journal of Phonetics, 16(l):55–76.
Google Scholar
Shamma, S. (1986). Encoding the acoustic spectrum in the spatio-temporal responses of the auditory nerve. In Moore, B.C. J. and Patterson, R. D., editors, Auditory Frequency Selectivity, pages 289–296. New York, Plenum.
Google Scholar
Shamma, S. (1988). The acoustic features of speech sounds in a model of auditory processing: Vowels and voiceless fricatives. Journal of Phonetics, 16:77–91.
Google Scholar
Steiglitz, K. (1976). On the simultaneous estimation of poles and zeros in speech analysis. IEEE Trans. ASSP, ASSP-25:194–202.
Google Scholar
Teolis, A. and Benedetto, J. (1994). Noise suppression using a wavelet model. In ICASSP, pages 17–20.
Google Scholar
Tierney, J. (1980). A study of LPC analysis of speech in additive noise. IEEE Trans. ASSP, ASSP-28(4).
Google Scholar
Van Alphen, P. and Pols, L. (1991). Comparing various feature vectors in automatic speech recognition. In EUROSPEECH, pages 533–536.
Google Scholar
Vetterli, M. and Herley, C. (1990). Wavelets and filter banks: Relationships and new results. In ICASSP, pages 1723–1726.
Google Scholar
Vidal, J., Masgrau, E., Moreno, A., and Fonollosa, J. (1992). Speech analysis using higher order statistics. In ETRW: Comparing Signal Representations, Sheffield, England, pages 391–396.
Google Scholar
Wakita, H. (1973). Direct estimation of the vocal tract shape by inverse filtering of acoustic speech waveforms. IEEE Trans. ASSP, AU-21(5):417–427.
Google Scholar
Wakita, H. (1981). Linear prediction voice synthesizers. Speech Tech., Fall, pages 17–22.
Google Scholar
Wakita, H. and Zhao, Y. (1992). On the time-frequency display of speech signals using a generalized time-frequency representation with a cone-shaped kernel. In ETRW: Comparing Signal Representations, Sheffield, England, pages 401–408.
Google Scholar
Wells, B. (1985). Voiced/unvoiced decision based on the bispectrum. In ICASSP, pages 1589–1592.
Google Scholar
Wigner, E. (1932). On the quantum correction for thermodynamic equilibrium. Physical Review, 40:749–759.
Article MATH Google Scholar
Wilde, S. and Curtis, K. (1992). The wavelet transform for speech analysis. In ICSLP, pages 1621–1624.
Google Scholar
Wilpon, J. (1989). A study on the effects of telephone transmission noise on speaker-independent recognition. In Lea, W., editor, Towards Robustness in Speech Recognition, pages 190–206. Speech Science Publications.
Google Scholar
Wokurek, M., Rubin, G., and Hlawatsch, F. (1987). Wigner distribution — a new method for high resolution time-frequency analysis of speech signals. In Eleventh ICphS, pages 44–47.
Google Scholar
Young, E. and Sachs, M. (1979). Representation of steady-state vowels in the temporal aspects of the discharges patterns of populations of auditory-nerve fibers. J. Acoust. Soc. Am., 66:1381–1403.
Article Google Scholar
Zhao, Y., Atlas, L., and Marks, R. (1990). The use of cone-shaped kernels for generalized time-frequency representations of nonstationary signals. IEEE Trans. ASSP, ASSP-38(7):1084–1091.
Article Google Scholar
Zwicker, E. and Scharf, B. (1965). A model of loudness summation. Psychological Review, 72(l):3–26.
Article Google Scholar
Zwicker, E. and Terhardt, E. (1979). Automatic speech recognition using psychoa-coustic models. J. Acoust. Soc. Am., 65(2).
Google Scholar

Download references

Author information

Authors and Affiliations

Speech Technology Laboratory, USA
Jean-Claude Junqua
CRIN - INRIA, France
Jean-Paul Haton

Authors

Jean-Claude Junqua
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Paul Haton
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Junqua, JC., Haton, JP. (1996). Background on Speech Analysis. In: Robustness in Automatic Speech Recognition. The Kluwer International Series in Engineering and Computer Science, vol 341. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1297-0_2

Download citation

DOI: https://doi.org/10.1007/978-1-4613-1297-0_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4612-8555-7
Online ISBN: 978-1-4613-1297-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics