Skip to main content
Log in

A New Korean Corpus-Based Text-to-Speech System

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper describes a new Korean Text-to-Speech (TTS) system based on a large speech corpus. Conventional concatenative TTS systems still produce machine-like synthetic speech. The poor naturalness is caused by excessive prosodic modification using a small speech database. To cope with this problem, we utilized a dynamic unit selection method based on a large speech database without prosodic modification. The proposed TTS system adopts triphones as synthesis units. We designed a new sentence set maximizing phonetic or prosodic coverage of Korean triphones. All the utterances were segmented automatically into phonemes using a speech recognizer. With the segmented phonemes, we achieved a synthesis unit cost of zero if two synthesis units were placed consecutively in an utterance. This reduces the number of concatenating points that may occur due to concatenating mismatches. In this paper, we present data concerning the realization of major prosodic variations through a consideration of prosodic phrase break strength. The phrase break was divided into four kinds of strength based on pause length. Using phrase break strength, triphones were further classified to reflect major prosodic variations. To predict phrase break strength on texts, we adopted an HMM-like Part-of-Speech (POS) sequence model. The performance of the model showed 73.5% accuracy for 4-level break strength prediction. For unit selection, a Viterbi beam search was performed to find the most appropriate triphone sequence, which has the minimum continuation cost of prosody and spectrum at concatenating boundaries. From the informal listening test, we found that the proposed Korean corpus-based TTS system showed better naturalness than the conventional demisyllable-based one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Beutnagel, M., Conkie, A., and Syrdal, A. (1998). Diphone synthesis using unit selection. The 3rd ESCA/COCOSDA Workshop on Speech Synthesis. Jenolan Caves, Australia: ESCA, Paper F.2(R5t2).

    Google Scholar 

  • Black, A.W. and Campbell, N. (1995). Optimizing selection of unit from speech database concatenative synthesis. EUROSPEECH'95 Proceedings. Madrid, Spain: ESCA, vol. 1, pp. 581–584.

    Google Scholar 

  • Campbell, N. (1998). Large-scale single-speaker speech corpora. A Collection of Technical Publications. Advanced Telecommunications Research Institute International(ATR)-Interpreting Telecommunications Research Laboratories. Kyoto, Japan: ATR, pp. 21–26.

    Google Scholar 

  • Hauptmann, A.G. (1993). SPEAKEZ: A first experiment in concatenation synthesis from a large corpus. In EUROSPEECH'93. Berlin, Germany: ESCA, pp. 1701–1704.

    Google Scholar 

  • Hunt, A.J. and Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. ICASSP'96 Proceedings. Atlanta, GA: IEEE, pp. 373–376.

    Google Scholar 

  • Kim, S.H. and Lee, J.C. (1994). Korean text-to-speech system using TD-PSOLA. Australian International Conference on Speech Science and Technology (SST'94) Proceedings. Perth, Australia: ASSTA, pp. 587–592.

    Google Scholar 

  • Kim, S.H., Lee, H.S., and Kim, H.R. (1996). An effectiveness of automatic labeling using speech recognizer. International Conference on Phonetic Sciences (SICOPS'96) Proceedings. Seoul, Korea: Seoul National University, pp. 468–471.

    Google Scholar 

  • Moulines, E. and Charpentier, F. (1990). Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9:453–467.

    Google Scholar 

  • Ostendorf, M. and Veilleux, N. (1994). A hierarchical stochastic model for automatic prediction of prosodic boundary location. Computational Linguistics, 20(1):27–54.

    Google Scholar 

  • Roucos, S. and Wilgus, A.M. (1985). High quality time scale modification for speech. ICASSP'85. Tampa, Florida: IEEE, pp. 493–496.

    Google Scholar 

  • Seiyama, N., Imai, A., Takagi, T., and Miyasaka, E. (1996). A new approach to compensate degeneration of speech intelligibility for elderly listeners. IEEE Transaction on Broadcasting, 42(3):285–292.

    Google Scholar 

  • Taylor, P. and Black, A.W. (1998). Assigning phrase breaks from part-of-speech sequences. Computer Speech and Language, 12:99–117.

    Google Scholar 

  • Wightman, C.W. and Ostendorf, M. (1994). Automatic labeling of prosodic patterns. IEEE Transaction on Speech and Audio Processing, 2(4):469–481.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S., Lee, Y. & Hirose, K. A New Korean Corpus-Based Text-to-Speech System. International Journal of Speech Technology 5, 105–116 (2002). https://doi.org/10.1023/A:1015454829127

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1015454829127

Navigation