A Variable-Scale Piecewise Stationary Spectral Analysis Technique Applied to ASR

  • Vivek Tyagi
  • Christian Wellekens
  • Hervé Bourlard
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3869)


It is often acknowledged that speech signals contain short-term and long-term temporal properties [15] that are difficult to capture and model by using the usual fixed scale (typically 20ms) short time spectral analysis used in hidden Markov models (HMMs), based on piecewise stationarity and state conditional independence assumptions of acoustic vectors. For example, vowels are typically quasi-stationary over 40-80ms segments, while plosives typically require analysis below 20ms segments. Thus, a fixed scale analysis is clearly sub-optimal for “optimal” time-frequency resolution and modeling of different stationary phones found in the speech signal. In the present paper, we investigate the potential advantages of using variable size analysis windows towards improving state-of-the-art speech recognition systems. Based on the usual assumption that the speech signal can be modeled by a time-varying autoregressive (AR) Gaussian process, we estimate the largest piecewise quasi-stationary speech segments, based on the likelihood that a segment was generated by the same AR process. This likelihood is estimated from the Linear Prediction (LP) residual error. Each of these quasi-stationary segments is then used as an analysis window from which spectral features are extracted. Such an approach thus results in a variable scale time spectral analysis, adaptively estimating the largest possible analysis window size such that the signal remains quasi-stationary, thus the best temporal/frequency resolution tradeoff. The speech recognition experiments on the OGI Numbers95 database[19] show that the proposed variable-scale piecewise stationary spectral analysis based features indeed yield improved recognition accuracy in clean conditions, compared to features based on minimum cross entropy spectrum [1] as well as those based on fixed scale spectral analysis.


Speech Signal Minimum Mean Square Error Automatic Speech Recognition Analysis Window Speech Recognition System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Loughlin, P., Pitton, J., Hannaford, B.: Approximating Time-Frequency Density Functions via Optimal Combinations of Spectrograms. IEEE Signal Processing Letters 1(12) (December 1994)Google Scholar
  2. 2.
    Itakura, F.: Minimum Prediction Residual Principle Applied to Speech Recognition. IEEE Trans. on ASSP 23(1) (February 1975)Google Scholar
  3. 3.
    Atal, B.S.: Efficient coding of LPC parameters by temporal decomposition. In: The Proc. of IEEE ICASSP, Boston, USA (1983)Google Scholar
  4. 4.
    Svendsen, T., Paliwal, K.K., Harborg, E., Husoy, P.O.: An improved sub-word based speech recognizer. In: Proc. of IEEE ICASSP (1989)Google Scholar
  5. 5.
    Makhoul, J.: Linear Prediction: A Tutorial Review. The Proc. of IEEE 63(4) (April 1975)Google Scholar
  6. 6.
    Coifman, R.R., Wickerhauser, M.V.: Entropy based algorithms for best basis selection. IEEE Trans. on Information Theory 38(2) (March 1992)Google Scholar
  7. 7.
    Tyagi, V., McCowan, I., Bourlard, H., Misra, H.: Mel-Cepstrum Modulation Spectrum (MCMS) features for Robust ASR. In: The Proc. of IEEE ASRU 2003, St. Thomas, Virgin Islands, USA (2003)Google Scholar
  8. 8.
    Srinivasan, S., Kleijn, W.B.: Speech Enhancement Using Adaptive timedomain Segmentation. In: The Proc. of ICSLP 2004, Jeju, S. Korea (2004)Google Scholar
  9. 9.
    Haykin, S.: Adaptive Filter Theory. Prentice-Hall Publishers, NJ (1993)zbMATHGoogle Scholar
  10. 10.
    Brandt, A.V.: Detecting and estimating the parameters jumps using ladder algorithms and likelihood ratio test. In: Proc. of ICASSP, Boston, MA, pp. 1017–1020 (1983)Google Scholar
  11. 11.
    Obrecht, R.A.: A new Statistical Approach for the Automatic Segmentation of Continuous Speech Signals. IEEE Trans. on ASSP 36(1) (January 1988)Google Scholar
  12. 12.
    Ajmera, J., McCowan, I., Boulard, H.: Robust Speaker Change Detection. IEEE Signal Processing Letters 11(8) (August 2004)Google Scholar
  13. 13.
    Achan, K., Roweis, S., Hertzmann, A., Frey, B.: A Segmental HMM for Speech Waveforms. UTML Techical Report 2004-001, Dept. of Computer Science, Univ. of Toronto (May 2004)Google Scholar
  14. 14.
    Kay, S.M.: Fundamentals of Statistical Signal Processing: Detection Theory. Prentice-Hall Publishers, NJ (1998)Google Scholar
  15. 15.
    Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition. Prentice-Hall, NJ (1993)zbMATHGoogle Scholar
  16. 16.
    Davis, S.B., Mermelstein, P.: Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Trans. on ASSP ASSP-28(4) (August 1980)Google Scholar
  17. 17.
    Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4) (April 1990)Google Scholar
  18. 18.
    Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University, Cambridge (1995)Google Scholar
  19. 19.
    Cole, R.A., Fanty, M., Lander, T.: Telephone speech corpus at CSLU. In: Proc. of ICSLP, Yokohama, Japan (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Vivek Tyagi
    • 1
    • 3
  • Christian Wellekens
    • 1
    • 3
  • Hervé Bourlard
    • 2
    • 3
  1. 1.Institute EurecomSophia-AntipolisFrance
  2. 2.IDIAP Research InstituteMartignySwitzerland
  3. 3.Swiss Federal Institute of Technology (EPFL)LausanneSwitzerland

Personalised recommendations