Abstract
This paper presents a method to realize the hidden Markov model (HMM)-based Mandarin-Tibetan cross-lingual statistical speech synthesis using speaker adaptive training. A set of Speech Assessment Methods Phonetic Alphabet (SAMPA) is designed to label the pronunciation of the initial and the final of Mandarin and Tibetan syllables according to the similarities in pronunciation between Mandarin and Tibetan. A grapheme-to-phoneme conversion method is realized to convert Chinese or Tibetan sentences to SAMPA-based Pinyin sequences. A Mandarin statistical speech synthesis framework is employed to realize Mandarin-Tibetan cross-lingual speech synthesis. A set of context-dependent label format is designed to label the context information of Mandarin and Tibetan sentences. A question set is also realized for context dependent decision tree clustering. The initial and the finalare used as the synthesis units with training using a set of average mixed-lingual models from a large Mandarin multi-speaker-based corpus and a small Tibetan one-speaker-based corpus using speaker adaptive training (SAT). Then, the speaker adaptation transformation is applied to the speaker dependent (SD) training data to obtain a set of speaker dependent Mandarin or Tibetan models from the average mixed-lingual models. The Mandarin speech or Tibetan speech is then synthesized from the speaker dependent Mandarin or Tibetan models. Tests show that this method outperforms the method using only Tibetan SD models when only a small number of Tibetan training utterances are available. When the number of training Tibetan utterances is increased, the performances of the two methods tend to be the same. Mixed Tibetan training sentences have a small effect on the quality of synthesized Mandarin speech.
Similar content being viewed by others
References
Bourlard H, Dines J, Magimai-Doss M, Garner P, Imseng D, Motlicek P, Liang H, Saheer L, Valente Fm (2011) Current trends in multilingual speech processing. Sadhana 36:885–915
Chen Y N, Jiao Y, Qian Y, Soong F K (2009) State mapping for cross-language speaker adaptation in TTS. In: ICSP 2010, pp 4273–4276
Gao D, Gong Y (2005) A statistically study on the qualities of all modern tibetan character set. J Chin Inf Process 19(1):71–75
Gao L, Yu H, Li Y, Liu J (2010) A research on text analysis in tibetan speech synthesis. In: IEEE International Conference on Information and Automation (ICIA) 2010, pp 817–822
Goldstein M (1991) Essentials of modern literary Tibetan. University of California Press, Phuntshog L
Handel Z (2008) What is Sino-Tibetan? snapshot of a field and a language family in flux. Lang Linguistics Compass 2(3):422–441
Kawahara H, Masuda-Katsuse I, de Cheveign A (1999) Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun 27(3):187–207
Latorre J, Iwano K, Furui S (2006) New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer. Speech Commun 48(10):1227–1242
Li Y, Kong J, Yu H (2008) Conversion and realization of tibetan text to pronunciation automatic rules. J Tsinghua Univ (natural science edition) 48(S1):621–626
Liang H, Qian Y, Soong FK, Liu G (2008) A cross-language state mapping approach to bilingual (Mandarin-English) TTS. In: ICASSP 2008, pp 4641–4644
Mirjam W (2010) The EMIME bilingual database. Technical Report EDI-INF-RR-1388, The University of Edinburgh
Peng XL, Oura K, Nankaku Y, Tokuda K (2010) Cross-lingual speaker adaptation for HMM-based speech synthesis considering differences between language-dependent average voices. In: IEEE 10th International Conference on Signal Processing, pp 605–608
Qian Y, Liang H, Soong FK (2009) A cross-language state sharing and mapping approach to bilingual (Mandarin/English) TTS. IEEE Trans Audio Speech Lang Process 17(6):1231–1239
Qian Y, Soong FK, Chen Y, Chu M (2006) An HMM-based Mandarin Chinese Text-To-Speech system. In: ISCSLP 2006, pp 223–232
Schrder M, Hunecke A (2007) MARY TTS participation in the Blizzard Challenge 2007. In: Blizzard Challenge 2007. Bonn, Germany
Siohan O, Myrvoll TA, Lee CH (2002) Structural maximum a posteriori linear regression for fast HMM adaptation. Comput Speech Lang 16(1):5–24
Wells J (1997) SAMPA computer readable phonetic alphabet. In: Gibbon D, Moore R, Winski R (eds) Handbook of standards and resourcesfor spoken language systems
Wu YJ, King S, Tokuda K (2008) Cross-lingual speaker adaptation for HMM-based speech synthesis. In: ISCSLP, 2008, pp 9–12
Wu Y J, Nankaku Y, Tokuda K (2009) State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis. In: Interspeech 2009, pp 528–531
Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J (2009) Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans Audio, Speech, Lang Process 17(1):66–83
Yamagishi J, Tamura M, Masuko T, Tokuda K, Kobayashi T (2003) A training method of average voice model for HMM-based speech synthesis. IEICE Trans Fundam Electron Commun Comput Sci E86-A:1956–1963
Zen H, Braunschweiler N, Buchholz S, Knill K, Krstulovic S, Latorre J (2010) Speaker and language adaptive training for HMM-based polyglot speech synthesis. In: Interspeech 2010, pp 186–191
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064
Zhang J (2009) Machine readable phonetic sampa-sc of chinese mandarin. ACTA ACUSTICA 34(1):81–86
Author information
Authors and Affiliations
Corresponding author
Additional information
The research leading to these results was partly funded by the National Natural Science Foundation of China (Grant No. 61263036, 61262055 ) , Gansu Science Fund for Distinguished Young Scholars (Grant No. 1210RJDA007) and the Core Research for Evolutional Science and Technology (CREST) from Japan Science and Technology Agency (JST).
Rights and permissions
About this article
Cite this article
Yang, H., Oura, K., Wang, H. et al. Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis. Multimed Tools Appl 74, 9927–9942 (2015). https://doi.org/10.1007/s11042-014-2117-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2117-9