Robust analysis for improvement of vowel onset point detection under noisy conditions
Vowel onset point (VOP) is the instant of time at which the vowel region starts in a speech signal. The VOPs are used as anchor points to design various speech based systems. Different algorithms exist in the literature to identify the occurrences of vowels in continuous spoken utterances. The algorithm based on combined evidences derived from source excitation, spectral peaks and modulation spectrum have been used as a baseline system for the present study. The baseline system provides a satisfactory level of performance under clean data condition. However under noisy data condition the performance of the previous system may be improved further by additional pre-processing of the raw speech data and post-processing the detected VOPs. In this paper we propose to use the speech enhancement techniques as pre-processing module to remove the noise from the speech data under different noisy conditions. The pre-processed speech data is then passed through the baseline system to detect the VOPs. It has been observed that there exist several spurious VOPs at the output of the baseline system. We propose to use a post-processing module based on average signal-to-noise ratio and information derived from the glottal closure instant to remove the spurious VOPs. The experiments were carried out on clean, artificially injected noisy, and data collected from the practical noisy environments. The results suggest that the proposed system using pre-processing and post-processing modules is robust and shows an improvement of 28–35 % over the existing baseline system by removing the spurious VOPs under different noisy conditions.
KeywordsVowel onset point (VOP) Excitation source Spectral peak Modulation spectrum Glottal closure instance (GCI) Minimum mean square error (MMSE)
This work is supported by the project titled “Development of Speech based Multi-Level Person Authentication System”, funded by the Department of Information Technology (DIT), New Delhi, India.
- Garofolo, J. D. (1993). TIMIT acoustic-phonetic continuous speech corpus linguistic data consortium. Philadelphia, PA: TIMIT.Google Scholar
- Prasanna, S. R. M. & Yegnanarayana, B. (2005). Detection of vowel onset point events using excitation source information. in Proceeding of the interspeech, (pp. 1133-1136), Lisbon.Google Scholar
- Prasanna, S. R. M., Zachariah, J. M., & Yegnanarayana, B. (2003). Begin-end detection using vowel onset points (pp. 33–39). Mumbai: Proceedings of Workshop on Spoken Language Processing.Google Scholar
- Rao, J. Y. S. R. K., Sekhar, C. C. & Yegnanarayana, B. (1999). Neural networks based approach for detection of vowel onset points. In Proceeding of the International Conference Advances in Pattern Recognition and Digital Techniques, (pp. 316–320), Calcutta.Google Scholar
- Sekhar, C. C. (1996). Neural network models for recognition of stop consonant-vowel (SCV) segments in continuous speech. Ph.D. dissertation, Department of Computer Science and Engineering Indian Institute of Technology Madras, Chennai.Google Scholar
- ‘TIMIT acoustic-phonetic continuous speech corpus. (1990). National Institute of Standards and Technology Gaithersburg, MD, NTIS Order PB91-505065, Speech Disc 1-1.1.Google Scholar
- Wang, J. H., & Chen, S. H. (1999). A C/V segmentation algorithm for Mandarin speech using wavelet transforms. Proceeding of the International Conference on Acoustic, Speech and Signal Processing, 1, 1261–1264.Google Scholar