Abstract
The quality of text-to-speech (TTS) synthesis systems can be improved by controlling the intensities of speech segments in addition to durations and intonation. This paper proposes linguistic and production constraints for modeling the intensity patterns of sequence of syllables. Linguistic constraints are represented by positional, contextual and phonological features, and production constraints are represented by articulatory features associated to syllables. In this work, feedforward neural network (FFNN) is proposed to model the intensities of syllables. The proposed FFNN model is evaluated by means of objective measures such as average prediction error (μ), standard deviation (σ), correlation coefficient (γ X,Y ) and the percentage of syllables predicted within different deviations. The prediction performance of the proposed model is compared with other statistical models such as Linear Regression (LR) and Classification and Regression Tree (CART) models. The models are also evaluated by means of subjective listening tests on the synthesized speech generated by incorporating the predicted syllable intensities in Bengali TTS system. From the evaluation studies, it is observed that prediction accuracy is better for FFNN models, compared to other models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Jilka, M., Mohler, G., Dogil, G.: Rules for generation of TOBI-based American English intonation. Speech Communication 28, 83–108 (1999)
Reddy, V.R., Rao, K.S.: Intonation Modeling using FFNN for Syllable based Bengali Text To Speech Synthesis. In: Proc. Int. Conf. Computer and Communication Technology, MNNIT, Allahabad, pp. 334–339 (2011)
Klatt, D.H.: Synthesis by rule of segmental durations in English sentences. In: Lindblom, B., Ohman, S. (eds.) Frontiers of Speech Communication Research, pp. 287–300. Academic Press, New York (1979)
Rao, K.S., Yegnanarayana, B.: Modeling durations of syllables using neural networks. Computer Speech and Language 21, 282–295 (2007)
Mannel, R.H.: Modelling of the segmental and prosodic aspects of speech intensity in synthetic speech. In: Proc. Int. Conf. Speech Science and Technology, Melbourne, pp. 538–543 (December 2002)
Tesser, F.: Emotional Speech Synthesis: from theory to application. PhD thesis, International Doctorate School in Information and Communication Technologies. DIT - University of Trento, Italy (February 2005)
Narendra, N.P., Rao, K.S., Ghosh, K., Reddy, V.R., Maity, S.: Development of syllable-based text to speech synthesis system in Bengali. Int. J. of Speech Technology 14(3), 167–181 (2011)
Haykin, S.: Neural Networks: A Comprehensive Foundation. Pearson Education Aisa, Inc., New Delhi (1999)
I. P. Association, Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press (1999)
Tamura, S., Tateishi, M.: Capabilities of a Four-Layered Feedforward Neural Network: Four Layers Versus Three, vol. 8, pp. 251–255 (March 1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ramu Reddy, V., Sreenivasa Rao, K. (2012). Intensity Modeling for Syllable Based Text-to-Speech Synthesis. In: Parashar, M., Kaushik, D., Rana, O.F., Samtaney, R., Yang, Y., Zomaya, A. (eds) Contemporary Computing. IC3 2012. Communications in Computer and Information Science, vol 306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32129-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-32129-0_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32128-3
Online ISBN: 978-3-642-32129-0
eBook Packages: Computer ScienceComputer Science (R0)