Skip to main content

Modeling Supra-Segmental Features of Syllables Using Neural Networks

  • Chapter

Part of the book series: Studies in Computational Intelligence ((SCI,volume 83))

In this chapter we discuss modeling of supra-segmental features (intonation and duration) of syllables, and suggest some applications of these models. These supra-segmental features are also termed as prosodic features, and hence the corresponding models are known as prosody models. Neural networks are used to capture the implicit duration and intonation knowledge in the sequence of syllables of an utterance. A four layer feedforward neural network trained with backpropagation algorithm is used for modeling the duration and intonation knowledge of syllables separately. Labeled broadcast news data in the languages Hindi, Telugu and Tamil is used to develop neural network models in order to predict the duration and F0 of syllables in these languages. The input to the neural network consists of a feature vector representing the positional, contextual and phonological constraints. For improving the accuracy of prediction, further processing is done on the predicted values. We also propose a two-stage duration model for improving the accuracy of prediction. The performance of the prosody models is evaluated using objective measures such as average prediction error, standard deviation and correlation coefficient. The prosody models are examined for applications such as speaker recognition, language identification and text-to-speech synthesis.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Huang, X., Acero, A., Hon, H.W.: In: Spoken Language Proceesing. Prentice-Hall, New York, NJ, USA (2001)

    Google Scholar 

  2. Klatt, D.H.: Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journal of Acoustic Society of America 59 (1976) 1209-1221

    Article  Google Scholar 

  3. Yegnanarayana, B., Murthy, H.A., Sundar, R., Ramachandran, V.R., Kumar, A.S.M., Alwar, N., Rajendran, S.: Development of text-to-speech system for Indian languages. In: Proceedings of the International Conference on Knowledge Based Computer Systems, Pune, India (1990) 467-476

    Google Scholar 

  4. Chen, S.H., Lai, W.H., Wang, Y.R.: A new duration modeling approach for Mandarin speech. IEEE Transactions on Speech and Audio Processing 11 (2003) 308-320

    Article  Google Scholar 

  5. Mixdorff, H., Jokisch, O.: Building an integrated prosodic model of German. In: Proceeding of the European Conference on Speech Communication and Technology. Volume 2, Aalborg, Denmark (2001) 947-950

    Google Scholar 

  6. Mixdorff, H.: An integrated approach to modeling German prosody. PhD Thesis, Technical University, Dresden, Germany (2002)

    Google Scholar 

  7. Santen, J.P.H.V.: Assignment of segment duration in text-to-speech synthesis. Computer Speech and Language 8 (1994) 95-128

    Article  Google Scholar 

  8. Goubanova, O., Taylor, P.: Using Bayesian belief networks for modeling duration in text-to-speech systems. In: Proceedings of the International Conference on Spoken Language Processing. Volume 2, Beijing, China (2000) 427-431

    Google Scholar 

  9. Sayli, O.: Duration analysis and modeling for Turkish text-to-speech synthesis. Master’s Thesis, Department of Electrical and Electronics Engineering, Bogaziei University (2002)

    Google Scholar 

  10. Riley, M.: Tree-based modeling of segmental durations. Talking Machines: Theories, Models and Designs (1992) 265-273

    Google Scholar 

  11. Haykin, S. In: Neural Networks: A Comprehensive Foundation. Pearson Education Asia, Inc., New Delhi, India (1999)

    Google Scholar 

  12. Yegnanarayana, B. In: Artificial Neural Networks. Prentice-Hall, New Delhi, India (1999)

    Google Scholar 

  13. Campbell, W.N.: Analog i/o nets for syllable timing. Speech Communication 9 (1990) 57-61

    Article  Google Scholar 

  14. Campbell, W.N.: Syllable based segment duration. In: Bailly, G., Benoit, C., Sawallis, T.R., eds.: Talking Machines: Theories, Models and Designs. Elsevier (1992) 211-224

    Google Scholar 

  15. Campbell, W.N.: Predicting segmental durations for accommodation within a syllable-level timing framework. In: Proceedings of the European Conference on Speech Communication and Technology. Volume 2, Berlin, Germany (1993) 1081-1084

    Google Scholar 

  16. Barbosa, P.A., Bailly, G.: Characterization of rhythmic patterns for text-to-speech synthesis. Speech Communication 15 (1994) 127-137

    Article  Google Scholar 

  17. Barbosa, P.A., Bailly, G.: Generating segmental duration by P-centers. In: Proceedings of the Fourth Workshop on Rhythm Perception and Production, Bourges, France (1992) 163-168

    Google Scholar 

  18. Cordoba, R., Vallejo, J.A., Montero, J.M., Gutierrezarriola, J., Lopez, M.A., Pardo, J.M.: Automatic modeling of duration in a Spanish text-to-speech system using neural networks. In: Proceedings of the European Conference on Speech Communication and Technology, Budapest, Hungary (1999)

    Google Scholar 

  19. Hifny, Y., Rashwan, M.: Duration modeling of Arabic text-to-speech synthesis. In: Proceedings of the International Conference on Spoken Language Processing, Denver, Colorado, USA (2002) 1773-1776

    Google Scholar 

  20. Sonntag, G.P., Portele, T., Heuft, B.: Prosody generation with a neural network: Weighing the importance of input parameters. In: Proceedings of the IEEE International Conference on Acoustics, Speech, Signal Processing, Munich, Germany (1997) 931-934

    Google Scholar 

  21. Teixeira, J.P., Freitas, D.: Segmental durations predicted with a neural network. In: Proceedings of the European Conference on Speech Communication and Technology, Geneva, Switzerland (2003) 169-172

    Google Scholar 

  22. Klatt, D.H.: Review of text-to-speech conversion for English. Journal of Acoustic Society of America 82(3) (1987) 737-793

    Article  Google Scholar 

  23. Olive, J.P.: Fundamental frequency rules for the synthesis of simple declarative English sentences. Journal of Acoustic Society of America (1975) 476-482

    Google Scholar 

  24. Fujisaki, H., Hirose, K., Takahashi, N.: Acoustic characteristics and the under-lying rules of the intonation of the common Japanese used by radio and TV anouncers. In: Proceedings of the IEEE International Conference on Acoustics, Speech, Signal Processing (1986) 2039-2042

    Google Scholar 

  25. Taylor, P.A.: Analysis and synthesis of intonation using the Tilt model. Journal of Acoustic Society of America 107 (2000) 1697-1714

    Article  Google Scholar 

  26. Madhukumar, A.S., Rajendran, S., Sekhar, C.C., Yegnanarayana, B.: Synthe-sizing intonation for speech in Hindi. In: Proceedings of the Second European Conference on Speech Communication and Technology. Volume 3, Geneva, Italy (1991) 1153-1156

    Google Scholar 

  27. Pierrehumbert, J.B.: The Phonology and Phonetics of English Intonation. PhD Thesis, MIT, MA, USA (1980)

    Google Scholar 

  28. Fujisaki, H.: Dynamic characteristics of voice fundamental frequency in speech and singing. In: MacNeilage, P.F., ed.: The Production of Speech. Springer-Verlag, New York, USA (1983) 39-55

    Google Scholar 

  29. Fujisaki, H.: A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In: Fujimura, O., ed.: Vocal Physiology: Voice Production, Mechanisms and Functions. Raven Press, New York, USA (1988) 347-355

    Google Scholar 

  30. t'Hart, J., Collier, R., Cohen, A.: A Perceptual Study of Intonation. Cambridge University Press, Cambridge

    Google Scholar 

  31. Cosi, P., Tesser, F., Gretter, R.: Festival speaks Italian. In: Proceedings of EUROSPEECH 2001, Aalborg, Denmark (2001) 509-512

    Google Scholar 

  32. Tesser, F., Cosi, P., Drioli, C., Tisato, G.: Prosodic data driven modeling of a narrative style in Festival TTS. In: Fifth ESCA Speech Synthesis Workshop, Pittsburgh, USA (2004) 185-190

    Google Scholar 

  33. Vainio, M., Altosaar, T.: Modeling the microprosody of pitch and loudness for speech synthesis with neural networks. In: Proceedings of the International Conference on Spoken Language Processing, Sidney, Australia (1998)

    Google Scholar 

  34. Vegnaduzzo, M.: Modeling intonation for the Italian festival TTS using linear regression. Master’s Thesis, Department of Linguistics, University of Edinburgh (2003)

    Google Scholar 

  35. Scordilis, M.S., Gowdy, J.N.: Neural network based generation of fundamental frequency contours. In: Proceedings of the IEEE International Conference on Acoustics, Speech, Signal Processing. Volume 1, Glasgow, Scotland (1989) 219-222

    Google Scholar 

  36. Vainio, M.: Artificial neural network based prosody models for Finnish text-to-speech synthesis. PhD Thesis, Department of Phonetics, University of Helsinki, Finland (2001)

    Google Scholar 

  37. Buhmann, J., Vereecken, H., Fackrell, J., Martens, J.P., Coile, B.V.: Data driven intonation modeling of 6 languages. In: Proceedings of the International Conference on Spoken Language Processing. Volume 3, Beijing, China (2000) 179-183

    Google Scholar 

  38. Hwang, S.H., Chen, S.H.: Neural-network-based F0 text-to-speech synthesizer for Mandarin. IEEE Proceedings on Image Signal Processing 141 (1994) 384-390

    Article  Google Scholar 

  39. Khan, A.N., Gangashetty, S.V., Yegnanarayana, B.: Syllabic properties of three Indian languages: Implications for speech recognition and language identifica-tion. In: International Conference on Natural Language Processing, Mysore, India (2003) 125-134

    Google Scholar 

  40. Chopde, A.: (Itrans Indian language transliteration package version 5.2 source) http://www.aczone.con/itrans/.

  41. Prasanna, S.R.M., Yegnanarayana, B.: Extraction of pitch in adverse conditions. In: Proceedings of the IEEE International Conference on Acoustics, Speech, Signal Processing, Montreal, Canada (2004)

    Google Scholar 

  42. Bellegarda, J.R., Silverman, K.E.A., Lenzo, K., Anderson, V.: Statistical prosodic modeling: From corpus design to parameter estimation. IEEE Transactions on Speech and Audio Processing 9 (2001) 52-66

    Article  Google Scholar 

  43. Bellegarda, J.R., Silverman, K.E.A.: Improved duration modeling of English phonemes using a root sinusoidal transformation. In: Proceedings of the International Conference on Spoken Language Processing (1998) 21-24

    Google Scholar 

  44. Silverman, K.E.A., Bellegarda, J.R.: Using a sigmoid transformation for improved modeling of phoneme duration. In: Proceedings of the IEEE International Conference on Acoustics, Speech, Signal Processing, Phoenix, AZ, USA (1999) 385-388

    Google Scholar 

  45. Siebenhaar, B., Zellner-Keller, B., Keller, E.: Phonetic and timing considerations in a Swiss high German TTS system. In: Keller, E., Bailly, G., Monaghan, A., Terken, J., Huckvale, M., eds.: Improvements in Speech Synthesis. Wiley, Chichester (2001)

    Google Scholar 

  46. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2 (1998) 121-167

    Article  Google Scholar 

  47. Shriberg, Elizabeth, Stolcke, Andreas: Prosody modeling for automatic speech understanding: An overview of recent research at SRI. In: Prosody in Speech Recognition and Understanding, ISCA Tutorial and Research Workshop (ITRW), Molly Pitcher Inn, Red Bank, NJ, USA (2001)

    Google Scholar 

  48. Srikanth, S., Kumar, S.R.R., Sundar, R., Yegnanarayana, B. In: A text-to-speech conversion system for Indian languages based on waveform concatenation model. Technical report no. 11, Project VOIS, Department of Computer Science and Engineering, Indian Institute of Technology Madras (1989)

    Google Scholar 

  49. Rao, K.S., Yegnanarayana, B.: Prosodic manipulation using instants of sig-nificant excitation. In: Proceedings of the IEEE International Conference on Multimedia and Expo, Baltimore, Maryland, USA (2003) 389-392

    Google Scholar 

  50. Smits, R., Yegnanarayana, B.: Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Speech and Audio Processing 3 (1995) 325-333

    Article  Google Scholar 

  51. Murthy, P.S., Yegnanarayana, B.: Robustness of group-delay-based method for extraction of significant excitation from speech signals. IEEE Transactions on Speech and Audio Processing 7 (1999) 609-619

    Article  Google Scholar 

  52. Deller, J.R., Proakis, J.G., Hansen, J.H.L. In: Discrete-Time Processing of Speech Signals. Macmillan, New York, USA (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Rao, K.S. (2008). Modeling Supra-Segmental Features of Syllables Using Neural Networks. In: Prasad, B., Prasanna, S.R.M. (eds) Speech, Audio, Image and Biomedical Signal Processing using Neural Networks. Studies in Computational Intelligence, vol 83. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75398-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75398-8_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75397-1

  • Online ISBN: 978-3-540-75398-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics