An Adaptive Method for Robust Detection of Vowels in Noisy Environment

  • Avinash Kumar
  • Gayadhar PradhanEmail author
  • S. Shahnawazuddin


Automatic detection of vowels plays a significant role in the analysis and synthesis of speech signal. Detecting vowels within a speech utterance in noisy environment and varied contexts is a very challenging task. In this work, a robust technique based on non-local means (NLM) estimation is proposed for the detection of vowels in noisy speech signals. In the NLM algorithm, the signal value at each sample point is estimated as the weighted sum of signal values at other sample points within a search neighborhood. The weight value is computed by finding square of the difference between the signal values belonging to two different segments. During the estimation, one segment is kept as fixed, while other segment is slid over the search neighborhood. For any particular sample point, the sum of those weight values is significantly less when the segments under consideration are higher in magnitude. In a given speech signal, the vowels are regions of high energy. This will be true even under noisy conditions. In this work, the sum of weight values (SWV), computed at each time instant is used as a discriminating feature for detecting the vowels in a given speech signal. In the proposed approach, the regions where the SWV exhibits significant transitions and attain lower values for a considerable duration of time compared to the preceding and succeeding regions are hypothesized as the vowels. This hypothesis is statistically validated for detecting vowels under clean as well as noisy test conditions. For proper comparison, a three-class statistical classifier (vowel, non-vowel and silence) is developed for detecting the vowels in a given speech signal. For developing the said classifier, the mel-frequency cepstral coefficients are used as the acoustic feature vectors, while deep neural network (DNN)-hidden Markov model (HMM) is employed for acoustic modeling. The proposed vowel detection method is observed to outperform the DNN-HMM-based statistical classifiers as well as existing signal processing approaches under both clean and noisy test conditions.


Vowel detection Non-local means estimation Sum of weight values Vowel onset point Vowel end point 



We thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions that have greatly improved the quality of the manuscript.


  1. 1.
    N. Adiga, S.M. Prasanna, A hybrid text-to-speech synthesis using vowel and non vowel like regions, in Proceedings of the INDICON, pp. 1–5 (2014)Google Scholar
  2. 2.
    T. Anastasakos, J. Mcdonough, R. Schwartz, J. Makhoul, A compact model for speaker-adaptive training. Proc. Int. Conf. Spok. Lang. Process 2, 1137–1140 (1996)Google Scholar
  3. 3.
    A. Buades, B. Coll, J.M. Morel, A non-local algorithm for image denoising, in Computer Vision and Pattern Recognition, pp. 60–65 (2005)Google Scholar
  4. 4.
    A. Buades, B. Coll, J.M. Morel, A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 4(2), 490–530 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process 20(1), 30–42 (2012)CrossRefGoogle Scholar
  6. 6.
    S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process ASSP–28(4), 357–366 (1980)CrossRefGoogle Scholar
  7. 7.
    K.T. Deepak, S.R.M. Prasanna, Foreground speech segmentation and enhancement using glottal closure instants and mel cepstral coefficients. IEEE/ACM Trans. Audio Speech Lang. Process 24(7), 1205–1219 (2016)CrossRefGoogle Scholar
  8. 8.
    C.A. Deledalle, V. Duval, J. Salmon, Non-local methods with shape-adaptive patches (nlm-sap). J. Math. Imaging Vis. 43(2), 103–120 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Audio Speech Lang. Process 3(5), 357–366 (1995)CrossRefGoogle Scholar
  10. 10.
    N. Fakotakis, A. Tsopanoglou, G. Kokkinakis, A text-independent speaker recognition system based on vowel spotting. Speech Commun. 12(1), 57–68 (1993)CrossRefGoogle Scholar
  11. 11.
    M.J.F. Gales, Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process 7(3), 272–281 (1999)CrossRefGoogle Scholar
  12. 12.
    J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1, vol. 33 (Linguistic Data Consortium, Philadelphia, 1993)Google Scholar
  13. 13.
    D.J. Hermes, Vowel onset detection. J. Acoust. Soc. Am. 87(2), 866–873 (1990)CrossRefGoogle Scholar
  14. 14.
    Q. Jin, A. Waibel, Application of LDA to speaker recognition, in Proceedings of the Interspeech, pp. 250–253 (2000)Google Scholar
  15. 15.
    A. Kumar, S. Shahnawazuddin, G. Pradhan, Exploring different acoustic modeling techniques for the detection of vowels in speech signal, in Proceedings of the National Conference on Communication (NCC), pp. 1–5 (2016)Google Scholar
  16. 16.
    A. Kumar, S. Shahnawazuddin, G. Pradhan, Improvements in the detection of vowel onset and offset points in a speech sequence. Circuits Syst. Signal Process 36, 1–26 (2016)MathSciNetGoogle Scholar
  17. 17.
    A. Kumar, S. Shahnawazuddin, G. Pradhan, Non-local estimation of speech signal for vowel onset point detection in varied environments, in Proceedings of the Interspeech, Stockholm, Sweden, pp. 429–433 (2017)Google Scholar
  18. 18.
    T. Lotter, P. Vary, Speech enhancement by map spectral amplitude estimation using a super-Gaussian speech model. EURASIP J. Appl. Signal Process 2005, 1110–1126 (2005)zbMATHGoogle Scholar
  19. 19.
    Y. Lu, P.C. Loizou, Estimators of the magnitude-squared spectrum and methods for incorporating snr uncertainty. IEEE Trans. Audio Speech Lang. Process 19(5), 1123–1137 (2011)CrossRefGoogle Scholar
  20. 20.
    V.K. Mittal, B. Yegnanarayana, Effect of glottal dynamics in the production of shouted speech. J. Acoust. Soc. Am. 133(5), 3050–3061 (2013)CrossRefGoogle Scholar
  21. 21.
    V.K. Mittal, B. Yegnanarayana, Study of changes in glottal vibration characteristics during laughter, in INTERSPEECH (2014)Google Scholar
  22. 22.
    V.K. Mittal, B. Yegnanarayana, P. Bhaskararao, Study of the effects of vocal tract constriction on glottal vibration. J. Acoust. Soc. Am. 136(4), 1932–1941 (2014)CrossRefGoogle Scholar
  23. 23.
    D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace Gaussian mixture model—a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)CrossRefGoogle Scholar
  24. 24.
    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The kaldi speech recognition toolkit, in Workshop on Automatic Speech Recognition and Understanding (2011)Google Scholar
  25. 25.
    G. Pradhan, B. Haris, S. Prasanna, R. Sinha, Speaker verification in sensor and acoustic environment mismatch conditions. Int. J. Speech Technol. 15(3), 381–392 (2012)CrossRefGoogle Scholar
  26. 26.
    G. Pradhan, A. Kumar, S. Shahnawazuddin, Excitation source features for improving the detection of vowel onset and offset points in a speech sequence, in Proceedings of the Interspeech 2017, pp. 1884–1888 (2017)Google Scholar
  27. 27.
    G. Pradhan, S.M. Prasanna, Speaker verification by vowel and nonvowel like segmentation. IEEE Trans. Audio Speech Lang. Process 21(4), 854–867 (2013)CrossRefGoogle Scholar
  28. 28.
    S.M. Prasanna, G. Pradhan, Significance of vowel-like regions for speaker verification under degraded conditions. IEEE Trans. Audio Speech Lang. Process 19(8), 2552–2565 (2011)CrossRefGoogle Scholar
  29. 29.
    S.R.M. Prasanna, B.V.S. Reddy, P. Krishnamoorthy, Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans. Audio Speech Lang. Process 17(4), 556–565 (2009)CrossRefGoogle Scholar
  30. 30.
    S.R.M. Prasanna, B. Yegnanarayana, Detection of vowel onset point events using excitation source information, in Proceedings of the Interspeech, pp. 1133–1136 (2005)Google Scholar
  31. 31.
    J. Rao, C.C. Sekhar, B. Yegnanarayana, Neural network based approach for detection of vowel onset points, in Proceedings of the International Conference on Advanced Pattern Recognition Digital Technologies, vol. 1, pp. 316–320 (1999)Google Scholar
  32. 32.
    K.S. Rao, A.K. Vuppala, Speaker Identification and Time Scale Modification Using VOPs (Springer, Berlin, 2014)CrossRefGoogle Scholar
  33. 33.
    K.S. Rao, B. Yegnanarayana, Duration modification using glottal closure instants and vowel onset points. Speech Commun. 51(12), 1263–1269 (2009)CrossRefGoogle Scholar
  34. 34.
    S.P. Rath, D. Povey, K. Veselý, J. Cernocký, Improved feature process for deep neural networks, in Proceedings of the Interspeech (2013)Google Scholar
  35. 35.
    B.S. Reddy, K.V. Rao, S.M. Prasanna, Keyword spotting using vowel onset point, vector quantization and hidden Markov modeling based techniques, in Proceedings of the TENCON, pp. 1–4 (2008)Google Scholar
  36. 36.
    B. Sarma, S. Prajwal, S.M. Prasanna, Improved vowel onset and offset points detection using bessel features, in Proceedings of the International Conference on Signal Processing and Communication, pp. 1–6 (2014)Google Scholar
  37. 37.
    P. Singh, G. Pradhan, Exploring the non-local similarity present in variational mode functions for effective ecg denoising, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 861–865. IEEE (2018)Google Scholar
  38. 38.
    P. Singh, G. Pradhan, Variational mode decomposition based ecg denoising using non-local means and wavelet domain filtering. Australas. Phys. Eng. Sci. Med. 41, 1–14 (2018)CrossRefGoogle Scholar
  39. 39.
    P. Singh, S. Shahnawazuddin, G. Pradhan, An efficient ecg denoising technique based on non-local means estimation and modified empirical mode decomposition. Circuits Syst. Signal Process. 37(10), 4527–4547 (2018)CrossRefGoogle Scholar
  40. 40.
    N. Srinivas, G. Pradhan, S. Shahnawazuddin, Enhancement of noisy speech signal by non-local means estimation of variational mode functions. Proc. Interspeech 2018, 1156–1160 (2018)CrossRefGoogle Scholar
  41. 41.
    K.N. Stevens, Acoustic Phonetics (The MIT Press Cambridge, London, 2000)Google Scholar
  42. 42.
    B.H. Tracey, E.L. Miller, Nonlocal means denoising of ecg signals. IEEE Trans. Biomed. Eng. 59(9), 2383–2386 (2012)CrossRefGoogle Scholar
  43. 43.
    D. Van De Ville, M. Kocher, Sure-based non-local means. IEEE Signal Process. Lett. 16(11), 973–976 (2009)CrossRefGoogle Scholar
  44. 44.
    A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)CrossRefGoogle Scholar
  45. 45.
    A. Vuppala, J. Yadav, S. Chakrabarti, K.S. Rao, Vowel onset point detection for low bit rate coded speech. IEEE Trans. Audio Speech Lang. Process. 20(6), 1894–1903 (2012)CrossRefGoogle Scholar
  46. 46.
    A.K. Vuppala, K.S. Rao, Vowel onset point detection for noisy speech using spectral energy at formant frequencies. Int. J. Speech Technol. 16(2), 229–235 (2013)CrossRefGoogle Scholar
  47. 47.
    A.K. Vuppala, K.S. Rao, S. Chakrabarti, Improved vowel onset point detection using epoch intervals. AEU Int. J. Electron. Commun. 66(8), 697–700 (2012)CrossRefGoogle Scholar
  48. 48.
    H.K. Vydana, S.R. Kadiri, A.K. Vuppala, Vowel-based non-uniform prosody modification for emotion conversion. Circuits Syst. Signal Process. 35(5), 1643–1663 (2016)CrossRefGoogle Scholar
  49. 49.
    J. Wang, C. Hu, S. Hung, J. Lee, A hierarchical neural network based C/V segmentation algorithm for Mandarin speech recognition. IEEE Trans. Signal Process. 39(9), 2141–2146 (1991)CrossRefGoogle Scholar
  50. 50.
    J.H. Wang, S.H. Chen, A C/V segmentation algorithm for Mandarin speech using wavelet transforms, in Proceedings of the International Conference on Acoustics, Speech, Signal Process., vol. 1, pp. 417–420 (1999)Google Scholar
  51. 51.
    P.J. Wolfe, S.J. Godsill, Simple alternatives to the Ephraim and Malah suppression rule for speech enhancement, in Signal Processing Workshop on Statistical Signal Processing, pp. 496–499 (2001)Google Scholar
  52. 52.
    J. Yadav, K.S. Rao, Detection of vowel offset point from speech signal. IEEE Signal Process. Lett. 20(4), 299–302 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Avinash Kumar
    • 1
  • Gayadhar Pradhan
    • 1
    Email author
  • S. Shahnawazuddin
    • 1
  1. 1.Department of Electronics and Communication EngineeringNational Institute of Technology PatnaPatnaIndia

Personalised recommendations