Skip to main content
Log in

Learning Prosodic Patterns for Mandarin Speech Synthesis

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Higher quality synthesized speech is required for widespread use of text-to-speech (TTS) technology, and the prosodic pattern is the key feature that makes synthetic speech sound unnatural and monotonous, which mainly describes the variation of pitch. The rules used in most Chinese TTS systems are constructed by experts, with weak quality control and low precision. In this paper, we propose a combination of clustering and machine learning techniques to extract prosodic patterns from actual large mandarin speech databases to improve the naturalness and intelligibility of synthesized speech. Typical prosody models are found by clustering analysis. Some machine learning techniques, including Rough Set, Artificial Neural Network (ANN) and Decision tree, are trained for fundamental frequency and energy contours, which can be directly used in a pitch-synchronous-overlap-add-based (PSOLA-based) TTS system. The experimental results showed that synthesized prosodic features greatly resembled their original counterparts for most syllables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bian, Zhaoqi and Zhang, Xuegong. (1999). Pattern Recognition. TsingHua University Publishing Company.

  • Cai Lianhong, Zhang Wei, and Hu Qiwei. (1998). Prosody Learning and Simulation for Chinese Text to Speech System. Journal of Tsinghua University, 38(S1), 92–95.

    Google Scholar 

  • Chen, J., Bell, D.A., and Liu, W. (1997). An Algorithm for Bayesian Belief Network Construction from Data. In Proceedings of AI and STAT'97, Florida (pp. 83–90).

  • Chen, S.-H., Huang, S.-H., and Wang, Y.-R. (1998). An RNN-Based Prosodic Information Synthesizer for Mandarin Text-to-Speech. IEEE Transaction on Speech and Audio Processing, 6(3), 226–239.

    Google Scholar 

  • Chu, M. (1995). Research on Chinese TTS System with High Intelligibility and Naturalness. Ph.D. Thesis, Institute of Acoustics, Academia Sinica.

  • Hu, C.-H. and Chen, J.-H. (1999). Template-Driven Generation of Prosodic Information for Chinese Concatenate synthesis. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 65–68.

    Google Scholar 

  • Lee, L.S., Tseng, C.Y., and Ouh-Young, M. (1989). The Synthesis Rules in a Chinese Text-to-Speech System. IEEE Trans. Acoust., Speech, Signal Processing, 37, 1309–1320.

    Google Scholar 

  • Lee, S. and Oh, Y.-H. (1999). Tree-Based Modeling of Prosodic Phrasing and Segmental Duration for Korean TTS System. Speech Communication, 28(4), 283–300.

    Google Scholar 

  • Pawlak, Z. (1999). Rough Classification. International Journal of Human-Computer Studies, 51(2), 369–383.

    Google Scholar 

  • Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers Press.

    Google Scholar 

  • Rabiner, L. and Juang, B. (1999). Fundamentals of Speech Recognition. TsingHua University Publishing Company.

  • Ross, K.N. and Ostendorf, M. (1999). A Dynamical System Model for Generating Fundamental Frequency for Speech Synthesis. IEEE Transaction on Speech and Audio Processing, 7(3), 295–309.

    Google Scholar 

  • Russell, S., Binder, J., Koller, D., and Kanazawa, K. (1995). Local Learning in Probabilistic Networks with Hidden Variables. In Proc. 14th Joint Int. Conf. On Artificial Intelligence, Montreal, Vol. 2 (pp. 1146–1152).

    Google Scholar 

  • Suzuki, J. (1996). Learning Bayesian Belief Networks Based on the MDL Principle. In Proceedings of the International Conference on Machine Learning, Bari, Italy.

  • Walczak, B. and Massart, D.L. (1999). Rough Sets Theory. Chemometrics and Intelligent Laboratory Systems, 47(1), 1–16.

    Google Scholar 

  • Wang, Wei. (1995). Principle of Artificial Neural Network—Rudiment and Implement. Beijing University of Aeronautics and Astronautics Press.

  • Wu, C.H., Chen, C.H., and Juang, S.C. (1995). An CELP-Based Prosodic Information Modification and Generation of Mandarin Text-to-Speech. In Proc. ROCLING VIII (pp. 233–251).

  • Wu, Z. (1982). The Tone Variation in Mandarin. Chinese Grammar, 6, 439–449.

    Google Scholar 

  • Wu, Z. (1996). The Design of Prosodic Rule for Improving the Naturalness of the Marian TTS. The Research on Chinese Language and Words (pp. 355–365). Tsinghua University Press.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Y., Gao, W., Zhu, T. et al. Learning Prosodic Patterns for Mandarin Speech Synthesis. Journal of Intelligent Information Systems 19, 95–109 (2002). https://doi.org/10.1023/A:1015568521453

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1015568521453

Navigation