Intensity Modeling for Syllable Based Text-to-Speech Synthesis

Ramu Reddy, V.; Sreenivasa Rao, K.

doi:10.1007/978-3-642-32129-0_16

V. Ramu Reddy⁷ &
K. Sreenivasa Rao⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 306))

Included in the following conference series:

International Conference on Contemporary Computing

1981 Accesses
1 Citations

Abstract

The quality of text-to-speech (TTS) synthesis systems can be improved by controlling the intensities of speech segments in addition to durations and intonation. This paper proposes linguistic and production constraints for modeling the intensity patterns of sequence of syllables. Linguistic constraints are represented by positional, contextual and phonological features, and production constraints are represented by articulatory features associated to syllables. In this work, feedforward neural network (FFNN) is proposed to model the intensities of syllables. The proposed FFNN model is evaluated by means of objective measures such as average prediction error (μ), standard deviation (σ), correlation coefficient (γ _X,Y) and the percentage of syllables predicted within different deviations. The prediction performance of the proposed model is compared with other statistical models such as Linear Regression (LR) and Classification and Regression Tree (CART) models. The models are also evaluated by means of subjective listening tests on the synthesized speech generated by incorporating the predicted syllable intensities in Bengali TTS system. From the evaluation studies, it is observed that prediction accuracy is better for FFNN models, compared to other models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jilka, M., Mohler, G., Dogil, G.: Rules for generation of TOBI-based American English intonation. Speech Communication 28, 83–108 (1999)
Article Google Scholar
Reddy, V.R., Rao, K.S.: Intonation Modeling using FFNN for Syllable based Bengali Text To Speech Synthesis. In: Proc. Int. Conf. Computer and Communication Technology, MNNIT, Allahabad, pp. 334–339 (2011)
Google Scholar
Klatt, D.H.: Synthesis by rule of segmental durations in English sentences. In: Lindblom, B., Ohman, S. (eds.) Frontiers of Speech Communication Research, pp. 287–300. Academic Press, New York (1979)
Google Scholar
Rao, K.S., Yegnanarayana, B.: Modeling durations of syllables using neural networks. Computer Speech and Language 21, 282–295 (2007)
Article Google Scholar
Mannel, R.H.: Modelling of the segmental and prosodic aspects of speech intensity in synthetic speech. In: Proc. Int. Conf. Speech Science and Technology, Melbourne, pp. 538–543 (December 2002)
Google Scholar
Tesser, F.: Emotional Speech Synthesis: from theory to application. PhD thesis, International Doctorate School in Information and Communication Technologies. DIT - University of Trento, Italy (February 2005)
Google Scholar
Narendra, N.P., Rao, K.S., Ghosh, K., Reddy, V.R., Maity, S.: Development of syllable-based text to speech synthesis system in Bengali. Int. J. of Speech Technology 14(3), 167–181 (2011)
Article Google Scholar
Haykin, S.: Neural Networks: A Comprehensive Foundation. Pearson Education Aisa, Inc., New Delhi (1999)
MATH Google Scholar
I. P. Association, Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press (1999)
Google Scholar
Tamura, S., Tateishi, M.: Capabilities of a Four-Layered Feedforward Neural Network: Four Layers Versus Three, vol. 8, pp. 251–255 (March 1997)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India
V. Ramu Reddy & K. Sreenivasa Rao

Authors

V. Ramu Reddy
View author publications
You can also search for this author in PubMed Google Scholar
K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

TASSL, Dept. of Electrical & Computer Engineering, Rutgers the State University of New Jersey, Brett Road, 08854-8058, Piscataway, NJ, USA
Manish Parashar
Mathematics and Computer Science Division, Argonne National Laboratory, 60439, Argonne, IL, USA
Dinesh Kaushik
School of Computer Science and Welsh Science Center, Cardiff University, 5 The Parade, CF24 3AA, Cardiff, UK
Omer F. Rana
Division of Physical Sciences and Engineering, 4700 King Abdullah University of Science and Technology, Room 3221, Al Jazri Building, 23955-6900, Thuwal, Makkah, Saudi Arabia
Ravi Samtaney
Department of Electrical and Computer Engineering, Stony Brook University, 11794, Stony Brook, New York, USA
Yuanyuan Yang
Faculty of Engineering and Information Technologies, School of Information Technologies, University of Sydney, 2006, Sydney, NSW, Australia
Albert Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramu Reddy, V., Sreenivasa Rao, K. (2012). Intensity Modeling for Syllable Based Text-to-Speech Synthesis. In: Parashar, M., Kaushik, D., Rana, O.F., Samtaney, R., Yang, Y., Zomaya, A. (eds) Contemporary Computing. IC3 2012. Communications in Computer and Information Science, vol 306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32129-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-32129-0_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32128-3
Online ISBN: 978-3-642-32129-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics