Abstract
In this paper, a novel technique has been proposed for the vowel region detection from the continuous speech using an envelope of the derivative of the speech signal, which is a non-negative, frequency-weighted energy operator. The proposed vowel region detection method is implemented using a two-stage algorithm. The first stage of vowel region detection consists of speech signal analysis to detect vowel onset points (VOP) and vowel end-points (VEP) using an instantaneous energy contour obtained from the envelope of the derivative of a speech signal. The VOPs and VEPs are spotted using the peak-finding algorithm based upon the first order Gaussian differentiator. The next stage consists of removal of spurious vowel regions and the correction of hypothesized VOP and VEP locations using combined cues obtained from the uniformity of epoch intervals and strength of the excitation of the speech signal. Performance of the proposed method for detecting vowel regions from the speech signal is evaluated using TIMIT acoustic-phonetic speech corpus. The proposed approach resulted in significantly high detection rate and less false alarm rate compared to the state-of-the-art methods in both clean and noisy environments.
Similar content being viewed by others
References
Ananthapadmanabha, T., & Yegnanarayana, B. (1979). Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(4), 309–319.
Deller, J. R, Jr., Proakis, J. G., & Hansen, J. H. (1993). Discrete time processing of speech signals. Englewood Cliffs: Prentice Hall PTR.
Donaldson, G. S., Rogers, C. L., Cardenas, E. S., Russell, B. A., & Hanna, N. H. (2013). Vowel identification by cochlear implant users: Contributions of static and dynamic spectral cues. The Journal of the Acoustical Society of America, 134(4), 3021–3028.
Dumpala, S. H., Nellore, B. T., Nevali, R. R., Gangashetty, S. V., & Yegnanarayana, B. (2016). Robust vowel landmark detection using epoch-based features. In INTERSPEECH (pp. 160–164).
Fant, G. (1971). Acoustic theory of speech production: With calculations based on X-ray studies of Russian articulations. Berlin: Walter de Gruyter.
Gangamohan, P., Kadiri, S. R., Gangashetty, S. V., & Yegnanarayana, B. (2014). Excitation source features for discrimination of anger and happy emotions. In Fifteenth annual conference of the International Speech Communication Association.
Glass, J. R. (2003). A probabilistic framework for segment-based speech recognition. Computer Speech & Language, 17(2), 137–152.
Hansen, J. H., Gray, S. S., & Kim, W. (2010). Automatic voice onset time detection for unvoiced stops (/p/,/t/,/k/) with application to accent classification. Speech Communication, 52(10), 777–789.
Hermes, D. J. (1990). Vowel-onset detection. The Journal of the Acoustical Society of America, 87(2), 866–873.
Johnson, K. (2004). Acoustic and auditory phonetics. Phonetica, 61(1), 56–58.
Juneja, A., & Espy-Wilson, C. (2008). A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition. The Journal of the Acoustical Society of America, 123(2), 1154–1168.
Kaiser, J. F. (1990). On a simple algorithm to calculate the ’energy’ of a signal. In Proceedings of the 1990 international conference on acoustics, speech, and signal processing (ICASSP-90), pp. 381–384.
Kaiser, J. F. (1993). Some useful properties of Teager’s energy operators. In Proceedings of the 18th IEEE international conference on acoustics, speech, and signal processing (ICASSP '93), vol. 3, pp. 149–152.
Kashani, H. B., Sayadiyan, A., & Sheikhzadeh, H. (2017). Vowel detection using a perceptually-enhanced spectrum matching conditioned to phonetic context and speaker identity. Speech Communication, 91, 28–48.
Kumar, A., Shahnawazuddin, S., & Pradhan, G. (2017). Improvements in the detection of vowel onset and offset points in a speech sequence. Circuits, Systems, and Signal Processing, 36(6), 2315–2340.
Liu, S. A. (1996). Landmark detection for distinctive feature-based speech recognition. The Journal of the Acoustical Society of America, 100(5), 3417–3430.
Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4), 561–580.
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.
O’Toole, J. M., Temko, A., & Stevenson, N. (2014). Assessing instantaneous energy in the EEG: A non-negative, frequency-weighted energy operator. In Engineering in Medicine and Biology Society (EMBC), 2014 36th annual international conference of the IEEE, pp. 3288–3291.
Palmu, K., Stevenson, N., Wikström, S., Hellström-Westas, L., Vanhatalo, S., & Palva, J. M. (2010). Optimization of an nleo-based algorithm for automated detection of spontaneous activity transients in early preterm EEG. Physiological Measurement, 31(11), N85.
Pradhan, G., & Prasanna, S. M. (2013). Speaker verification by vowel and nonvowel like segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 21(4), 854–867.
Prasanna, S. M. & Yegnanarayana, B. (2005). Detection of vowel onset point events using excitation information. In Ninth European conference on speech communication and technology.
Prasanna, S. M., & Pradhan, G. (2011). Significance of vowel-like regions for speaker verification under degraded conditions. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2552–2565.
Prasanna, S. M., Reddy, B. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.
Rao, K. S., & Yegnanarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51(12), 1263–1269.
Rose, P. (2003). Forensic speaker identification. Boca Raton: CRC Press.
Saha, P., Laskar, R. H., & Laskar, A. (2016). A pre-processing method for improvement of vowel onset point detection under noisy conditions. Speech Communication, 80, 71–83.
Salomon, A., Espy-Wilson, C. Y., & Deshmukh, O. (2004). Detection of speech landmarks: Use of temporal information. The Journal of the Acoustical Society of America, 115(3), 1296–1305.
Schutte, K., & Glass, J., (2005). Robust detection of sonorant landmarks. In Ninth European conference on speech communication and technology.
Stevens, K. N. (2000). Acoustic phonetics. Cambridge: MIT Press.
Teager, H., & Teager, S. (1990). Evidence for nonlinear sound production mechanisms in the vocal tract. Speech Production and Speech Modelling, 55, 241–261.
Vuppala, A. K., & Rao, K. S. (2013). Vowel onset point detection for noisy speech using spectral energy at formant frequencies. International Journal of Speech Technology, 16(2), 229–235.
Vuppala, A. K., Rao, K. S., & Chakrabarti, S. (2012). Improved vowel onset point detection using epoch intervals. AEU-International Journal of Electronics and Communications, 66(8), 697–700.
Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2012). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1894–1903.
Vydana, H. K., Vikash, P., Vamsi, T., Kumar, K. P., & Vuppala, A. K. (2015). Detection of emotionally significant regions of speech for emotion recognition. In India conference (INDICON), 2015 Annual IEEE, pp. 1–6.
Vydana, H. K., & Vuppala, A. K. (2016). Detection of fricatives using s-transform. The Journal of the Acoustical Society of America, 140(5), 3896–3907.
Yadav, J., & Rao, K. S. (2013). Detection of vowel offset point from speech signal. IEEE Signal Processing Letters, 20(4), 299–302.
Yegnanarayana, B., Prasanna, S. M. & Guruprasad, S. (2011). Study of robustness of zero frequency resonator method for extraction of fundamental frequency. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5392–5395.
Yegnanarayana, B., & Murty, K. S. R. (2009). Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 614–624.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Thirumuru, R., Vuppala, A.K. Application of non-negative frequency-weighted energy operator for vowel region detection. Int J Speech Technol 21, 279–291 (2018). https://doi.org/10.1007/s10772-018-9505-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-9505-x