A New Korean Corpus-Based Text-to-Speech System

Kim, Sanghun; Lee, Youngjik; Hirose, Keikichi

doi:10.1023/A:1015454829127

A New Korean Corpus-Based Text-to-Speech System

Published: May 2002

Volume 5, pages 105–116, (2002)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Sanghun Kim¹,
Youngjik Lee¹ &
Keikichi Hirose²

101 Accesses
1 Citation
Explore all metrics

Abstract

This paper describes a new Korean Text-to-Speech (TTS) system based on a large speech corpus. Conventional concatenative TTS systems still produce machine-like synthetic speech. The poor naturalness is caused by excessive prosodic modification using a small speech database. To cope with this problem, we utilized a dynamic unit selection method based on a large speech database without prosodic modification. The proposed TTS system adopts triphones as synthesis units. We designed a new sentence set maximizing phonetic or prosodic coverage of Korean triphones. All the utterances were segmented automatically into phonemes using a speech recognizer. With the segmented phonemes, we achieved a synthesis unit cost of zero if two synthesis units were placed consecutively in an utterance. This reduces the number of concatenating points that may occur due to concatenating mismatches. In this paper, we present data concerning the realization of major prosodic variations through a consideration of prosodic phrase break strength. The phrase break was divided into four kinds of strength based on pause length. Using phrase break strength, triphones were further classified to reflect major prosodic variations. To predict phrase break strength on texts, we adopted an HMM-like Part-of-Speech (POS) sequence model. The performance of the model showed 73.5% accuracy for 4-level break strength prediction. For unit selection, a Viterbi beam search was performed to find the most appropriate triphone sequence, which has the minimum continuation cost of prosody and spectrum at concatenating boundaries. From the informal listening test, we found that the proposed Korean corpus-based TTS system showed better naturalness than the conventional demisyllable-based one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A bilingual speech neuroprosthesis driven by cortical articulatory representations shared between languages

Article 20 May 2024

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

Article 18 May 2024

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

Article 29 January 2018

References

Beutnagel, M., Conkie, A., and Syrdal, A. (1998). Diphone synthesis using unit selection. The 3rd ESCA/COCOSDA Workshop on Speech Synthesis. Jenolan Caves, Australia: ESCA, Paper F.2(R5t2).
Google Scholar
Black, A.W. and Campbell, N. (1995). Optimizing selection of unit from speech database concatenative synthesis. EUROSPEECH'95 Proceedings. Madrid, Spain: ESCA, vol. 1, pp. 581–584.
Google Scholar
Campbell, N. (1998). Large-scale single-speaker speech corpora. A Collection of Technical Publications. Advanced Telecommunications Research Institute International(ATR)-Interpreting Telecommunications Research Laboratories. Kyoto, Japan: ATR, pp. 21–26.
Google Scholar
Hauptmann, A.G. (1993). SPEAKEZ: A first experiment in concatenation synthesis from a large corpus. In EUROSPEECH'93. Berlin, Germany: ESCA, pp. 1701–1704.
Google Scholar
Hunt, A.J. and Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. ICASSP'96 Proceedings. Atlanta, GA: IEEE, pp. 373–376.
Google Scholar
Kim, S.H. and Lee, J.C. (1994). Korean text-to-speech system using TD-PSOLA. Australian International Conference on Speech Science and Technology (SST'94) Proceedings. Perth, Australia: ASSTA, pp. 587–592.
Google Scholar
Kim, S.H., Lee, H.S., and Kim, H.R. (1996). An effectiveness of automatic labeling using speech recognizer. International Conference on Phonetic Sciences (SICOPS'96) Proceedings. Seoul, Korea: Seoul National University, pp. 468–471.
Google Scholar
Moulines, E. and Charpentier, F. (1990). Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9:453–467.
Google Scholar
Ostendorf, M. and Veilleux, N. (1994). A hierarchical stochastic model for automatic prediction of prosodic boundary location. Computational Linguistics, 20(1):27–54.
Google Scholar
Roucos, S. and Wilgus, A.M. (1985). High quality time scale modification for speech. ICASSP'85. Tampa, Florida: IEEE, pp. 493–496.
Google Scholar
Seiyama, N., Imai, A., Takagi, T., and Miyasaka, E. (1996). A new approach to compensate degeneration of speech intelligibility for elderly listeners. IEEE Transaction on Broadcasting, 42(3):285–292.
Google Scholar
Taylor, P. and Black, A.W. (1998). Assigning phrase breaks from part-of-speech sequences. Computer Speech and Language, 12:99–117.
Google Scholar
Wightman, C.W. and Ostendorf, M. (1994). Automatic labeling of prosodic patterns. IEEE Transaction on Speech and Audio Processing, 2(4):469–481.
Google Scholar

Download references

Author information

Authors and Affiliations

Spoken Language Processing Team, Electronics and Telecommunications Research Institute, USA
Sanghun Kim & Youngjik Lee
Department of Frontier Informatics, School of Frontier Sciences, University of Tokyo, Japan
Keikichi Hirose

Authors

Sanghun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Youngjik Lee
View author publications
You can also search for this author in PubMed Google Scholar
Keikichi Hirose
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S., Lee, Y. & Hirose, K. A New Korean Corpus-Based Text-to-Speech System. International Journal of Speech Technology 5, 105–116 (2002). https://doi.org/10.1023/A:1015454829127

Download citation

Issue Date: May 2002
DOI: https://doi.org/10.1023/A:1015454829127

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Korean Corpus-Based Text-to-Speech System

Abstract

Access this article

Similar content being viewed by others

A bilingual speech neuroprosthesis driven by cortical articulatory representations shared between languages

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A New Korean Corpus-Based Text-to-Speech System

Abstract

Access this article

Similar content being viewed by others

A bilingual speech neuroprosthesis driven by cortical articulatory representations shared between languages

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation