Prosodic word boundary detection from Bengali continuous speech

  • Tanmay BhowmikEmail author
  • Shyamal Kumar Das Mandal
Original Paper


Detection of word boundaries in continuous speech is a tedious process due to the absence of a definite pause or silence in the word boundary position. Thus, continuous speech recognition is a very challenging task. However, the prosodic word boundaries, unlike the written word boundaries, can be predicted using the prosodic parameters of continuous speech. This paper proposes a method for detecting such prosodic word boundaries from Bengali continuous speech. Bengali is a bound-stress language, where stress is observed on the first syllable of a prosodic word. Empirical Mode Decomposition is applied to the logarithm of fundamental frequency (F0) contour of continuous speech to detect prosodic word boundaries. 200 Bengali readout sentences, read by ten speakers, are analyzed for the present work. An overall prosodic boundary detection accuracy of 88.05% is achieved, whereas precision and recall values are 90.73% and 88.31%, respectively, with f-score as 89.5. A prosodic word dictionary comprising 5031 prosodic words has been developed by analyzing 1526 Bengali sentences with the proposed prosodic word boundary detection method.


Prosodic word boundaries Fundamental frequency F0 contour Accent command Onset Offset 



  1. Acharya, S., & Das Mandal, S. K. (2013). Prosodic word and phrase boundary detection based on F0 contour analysis using empirical mode decomposition. In Oriental COCOSDA/CASLRE (pp. 1–5). IEEE.Google Scholar
  2. Agarwal, A., Jain, A., Prakash, N., & Agarwal, S. (2010). Word boundary detection in continuous speech based on supra segmental features for Hindi Language. In 2nd International Conference on Signal Processing Systems (pp. 591–594). Dalian: IEEE.Google Scholar
  3. Alam, F., Murtoza Habib, S., Sultana, A., & Khan, M. (2010). Development of annotated bangla speech corpora. In Spoken Languages Technologies for Under-Resourced Languages (pp. 35–41).Google Scholar
  4. Ananthakrishnan, S., & Narayanan, S. (2007). Improved speech recognition using acoustic and lexical correlates of pitch accent in a n-best rescoring framework. In International Conference on Acoustics, Speech and Signal ProcessingICASSP’07 (pp. 873–876). Honolulu: IEEE.Google Scholar
  5. Ananthakrishnan, S., & Narayanan, S. (2009). Unsupervised adaptation of categorical prosody models for prosody labeling and speech recognition. IEEE Transactions on Audio, Speech and Language Processing,17(1), 138–149.CrossRefGoogle Scholar
  6. Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.CrossRefGoogle Scholar
  7. Bhowmik, T. (2017). Prosodic and Phonological Feature based Speech Recognition System for Bengali, PhD Thesis, IIT Kharagpur. Google Scholar
  8. Bhowmik, T., & Das Mandal, S. K. (2018). Manner of articulation based Bengali phoneme classification. International Journal of Speech Technology,21(2), 233–250.CrossRefGoogle Scholar
  9. Boersma, P., & Weenink, D. (2016). Praat: Doing phonetics by computer.[computer program]. Version 6.0.19. Retrieved 2016, from
  10. Campbell, N. (1993). Automatic detection of prosodic boundaries in speech. Speech Communication,13(3–4), 343–354.CrossRefGoogle Scholar
  11. Campbell, N., & Black, A. (1997). Prosody and the selection of source units for concatenative synthesis. Progress in speech synthesis (pp. 279–292). New York: Springer.CrossRefGoogle Scholar
  12. Chen, S.-H., Yang, J.-H., Chiang, C.-Y., Liu, M.-C., & Wang, Y.-R. (2012). A new prosody-assisted mandarin ASR system. IEEE Transactions on Audio, Speech and Language Processing,20(6), 1669–1684.CrossRefGoogle Scholar
  13. Das Mandal, S. K. (2007). Role of Shape Parameters in Speech Recognition: A study on standard colloquial Bengali (SCB), PhD Thesis, Jadavpur University, Kolkata, India. Google Scholar
  14. Das Mandal, S., Gupta, B., & Datta, A. (2007). Word boundary detection based on suprasegmental features: A case study on Bangla speech. International Journal of Speech Technology,9(1–2), 17–28.CrossRefGoogle Scholar
  15. Das Mandal, S., Saha, A., & Datta, A. (2005). Annotated speech corpora development in Indian languages. Vishwa Bharat,6, 49–64.Google Scholar
  16. Das Mandal, S., Warsi, A., Basu, T., Hirose, K., & Fujisaki, H. (2010). Analysis and synthesis of F0 contours for Bangla readout speech. In Oriental COCOSDA (pp. 1–6). Kathmandu: IEEE.Google Scholar
  17. Fujii, K., Kashioka, H., & Campbell, N. (2003). Target cost of FQ based on polynomial regression in concatenative speech synthesis. In 15th international congress of phonetic sciences (ICPhS-15) (pp. 2577–2580). Barcelona.Google Scholar
  18. Fujisaki, H. (1997). Prosody, models, and spontaneous speech. Computing Prosody (pp. 27–42). New York: Springer.Google Scholar
  19. Fujisaki, H. (2004). Information, prosody, and modeling -with emphasis on tonal features of speech. In Speech Prosody (pp. 1–10). Nara, Japan: ISCA.Google Scholar
  20. Fujisaki, H., & Kawai, H. (1988). Realization of linguistic information in the voice fundamental frequency contour of the spoken Japanese. In International Conference on Acoustic, Speech, and Signal Processing-ICASSP’88 (pp. 663–666). New York: IEEE.Google Scholar
  21. Ganguly, N. R., Datta, A. K., & Mukherjee, B. (1998). Acoustic correlates of perceptual stress in Bengali text reading. In International conference on Computational Linguistics, Speech and Document Processing, (pp. B68–B71). ISI Calcutta.Google Scholar
  22. Hayes, B., & Lahiri, A. (1991). Bengali intonational phonology. Natural Language & Linguistic Theory,9(1), 47–96.CrossRefGoogle Scholar
  23. Hirose, K., & Minematsu, N. (2004). Use of prosodic features for speech recognition. In INTERSPEECH (pp. 1445–1448). Jeju Island, Korea: ISCA.Google Scholar
  24. Huang, N., Shen, Z., Long, S., Wu, M., Shih, H., Zheng, Q., et al. (1998). The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences,454(1971), 903–995.CrossRefGoogle Scholar
  25. Iwano, K., & Hirose, K. (1999). Prosodic word boundary detection using statistical modeling of moraic fundamental frequency contours and its use for continuous speech recognition. In International Conference on Acoustics, Speech, and Signal ProcessingICASSP’99 (pp. 133–136). Phoenix: IEEE.Google Scholar
  26. Lehiste, I., & Lass, N. (1976). Suprasegmental features of speech. In N. Lass (Ed.), Contemporary issues in experimental phonetics (pp. 225–239). New York: Academic Press.CrossRefGoogle Scholar
  27. Milone, D., & Rubio, A. (2003). Prosodic and accentual information for automatic speech recognition. IEEE Transaction on Speech and Audio Processing,11(4), 321–333.CrossRefGoogle Scholar
  28. Narusawa, S., Minematsu, N., Hirose, K., & Fujisaki, H. (2002). A method for automatic extraction of model parameters from fundamental frequency contours of speech. In 2002 IEEE International conference on acoustics, speech, and signal processing (Vol. 1, pp. 506–509). Orlando, Florida: IEEE.Google Scholar
  29. Rajendran, S., & Yegnanarayana, B. (1996). Word boundary hypothesization for continuous speech in Hindi based on F0 patterns. Speech Communication,18(1), 21–46.CrossRefGoogle Scholar
  30. Rilling, G., Flandrin, P., & Goncalves, P. (2003). On empirical mode decomposition and its algorithms. In IEEE-EURASIP workshop on nonlinear signal and image processing (Vol. 3, pp. 8–11). NSIP-03, Grado (I).Google Scholar
  31. Sagisaka, Y., Campbell, N., & Higuchi, N. (2012). Computing PROSODY: Computational models for processing spontaneous speech. Kyoto: Springer Science and Business Media.Google Scholar
  32. Tsiartas, A., Ghosh, P., Georgiou, P., & Narayanan, S. (2009). Robust word boundary detection in spontaneous speech using acoustic and lexical cues. In International Conference on Acoustics, Speech, and Signal ProcessingICASSP’2009 (pp. 4785–4788). Taipei, Taiwan: IEEE.Google Scholar
  33. Vergyri, D., Stolcke, A., Gadde, V., Ferrer, L., & Shriberg, E. (2003). Prosodic knowledge sources for automatic speech recognition. In International Conference on Acoustics, Speech, and Signal ProcessingICASSP’2003 (pp. I–I). Honk Kong: IEEE.Google Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.Bennett UniversityGreater NoidaIndia
  2. 2.CET, Indian Institute of Technology KharagpurKharagpurIndia

Personalised recommendations