Abstract
Higher quality synthesized speech is required for widespread use of text-to-speech (TTS) technology, and the prosodic pattern is the key feature that makes synthetic speech sound unnatural and monotonous, which mainly describes the variation of pitch. The rules used in most Chinese TTS systems are constructed by experts, with weak quality control and low precision. In this paper, we propose a combination of clustering and machine learning techniques to extract prosodic patterns from actual large mandarin speech databases to improve the naturalness and intelligibility of synthesized speech. Typical prosody models are found by clustering analysis. Some machine learning techniques, including Rough Set, Artificial Neural Network (ANN) and Decision tree, are trained for fundamental frequency and energy contours, which can be directly used in a pitch-synchronous-overlap-add-based (PSOLA-based) TTS system. The experimental results showed that synthesized prosodic features greatly resembled their original counterparts for most syllables.
Similar content being viewed by others
References
Bian, Zhaoqi and Zhang, Xuegong. (1999). Pattern Recognition. TsingHua University Publishing Company.
Cai Lianhong, Zhang Wei, and Hu Qiwei. (1998). Prosody Learning and Simulation for Chinese Text to Speech System. Journal of Tsinghua University, 38(S1), 92–95.
Chen, J., Bell, D.A., and Liu, W. (1997). An Algorithm for Bayesian Belief Network Construction from Data. In Proceedings of AI and STAT'97, Florida (pp. 83–90).
Chen, S.-H., Huang, S.-H., and Wang, Y.-R. (1998). An RNN-Based Prosodic Information Synthesizer for Mandarin Text-to-Speech. IEEE Transaction on Speech and Audio Processing, 6(3), 226–239.
Chu, M. (1995). Research on Chinese TTS System with High Intelligibility and Naturalness. Ph.D. Thesis, Institute of Acoustics, Academia Sinica.
Hu, C.-H. and Chen, J.-H. (1999). Template-Driven Generation of Prosodic Information for Chinese Concatenate synthesis. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 65–68.
Lee, L.S., Tseng, C.Y., and Ouh-Young, M. (1989). The Synthesis Rules in a Chinese Text-to-Speech System. IEEE Trans. Acoust., Speech, Signal Processing, 37, 1309–1320.
Lee, S. and Oh, Y.-H. (1999). Tree-Based Modeling of Prosodic Phrasing and Segmental Duration for Korean TTS System. Speech Communication, 28(4), 283–300.
Pawlak, Z. (1999). Rough Classification. International Journal of Human-Computer Studies, 51(2), 369–383.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers Press.
Rabiner, L. and Juang, B. (1999). Fundamentals of Speech Recognition. TsingHua University Publishing Company.
Ross, K.N. and Ostendorf, M. (1999). A Dynamical System Model for Generating Fundamental Frequency for Speech Synthesis. IEEE Transaction on Speech and Audio Processing, 7(3), 295–309.
Russell, S., Binder, J., Koller, D., and Kanazawa, K. (1995). Local Learning in Probabilistic Networks with Hidden Variables. In Proc. 14th Joint Int. Conf. On Artificial Intelligence, Montreal, Vol. 2 (pp. 1146–1152).
Suzuki, J. (1996). Learning Bayesian Belief Networks Based on the MDL Principle. In Proceedings of the International Conference on Machine Learning, Bari, Italy.
Walczak, B. and Massart, D.L. (1999). Rough Sets Theory. Chemometrics and Intelligent Laboratory Systems, 47(1), 1–16.
Wang, Wei. (1995). Principle of Artificial Neural Network—Rudiment and Implement. Beijing University of Aeronautics and Astronautics Press.
Wu, C.H., Chen, C.H., and Juang, S.C. (1995). An CELP-Based Prosodic Information Modification and Generation of Mandarin Text-to-Speech. In Proc. ROCLING VIII (pp. 233–251).
Wu, Z. (1982). The Tone Variation in Mandarin. Chinese Grammar, 6, 439–449.
Wu, Z. (1996). The Design of Prosodic Rule for Improving the Naturalness of the Marian TTS. The Research on Chinese Language and Words (pp. 355–365). Tsinghua University Press.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Chen, Y., Gao, W., Zhu, T. et al. Learning Prosodic Patterns for Mandarin Speech Synthesis. Journal of Intelligent Information Systems 19, 95–109 (2002). https://doi.org/10.1023/A:1015568521453
Issue Date:
DOI: https://doi.org/10.1023/A:1015568521453