Skip to main content
Log in

Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents a method to realize the hidden Markov model (HMM)-based Mandarin-Tibetan cross-lingual statistical speech synthesis using speaker adaptive training. A set of Speech Assessment Methods Phonetic Alphabet (SAMPA) is designed to label the pronunciation of the initial and the final of Mandarin and Tibetan syllables according to the similarities in pronunciation between Mandarin and Tibetan. A grapheme-to-phoneme conversion method is realized to convert Chinese or Tibetan sentences to SAMPA-based Pinyin sequences. A Mandarin statistical speech synthesis framework is employed to realize Mandarin-Tibetan cross-lingual speech synthesis. A set of context-dependent label format is designed to label the context information of Mandarin and Tibetan sentences. A question set is also realized for context dependent decision tree clustering. The initial and the finalare used as the synthesis units with training using a set of average mixed-lingual models from a large Mandarin multi-speaker-based corpus and a small Tibetan one-speaker-based corpus using speaker adaptive training (SAT). Then, the speaker adaptation transformation is applied to the speaker dependent (SD) training data to obtain a set of speaker dependent Mandarin or Tibetan models from the average mixed-lingual models. The Mandarin speech or Tibetan speech is then synthesized from the speaker dependent Mandarin or Tibetan models. Tests show that this method outperforms the method using only Tibetan SD models when only a small number of Tibetan training utterances are available. When the number of training Tibetan utterances is increased, the performances of the two methods tend to be the same. Mixed Tibetan training sentences have a small effect on the quality of synthesized Mandarin speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Bourlard H, Dines J, Magimai-Doss M, Garner P, Imseng D, Motlicek P, Liang H, Saheer L, Valente Fm (2011) Current trends in multilingual speech processing. Sadhana 36:885–915

    Article  Google Scholar 

  2. Chen Y N, Jiao Y, Qian Y, Soong F K (2009) State mapping for cross-language speaker adaptation in TTS. In: ICSP 2010, pp 4273–4276

  3. Gao D, Gong Y (2005) A statistically study on the qualities of all modern tibetan character set. J Chin Inf Process 19(1):71–75

    Google Scholar 

  4. Gao L, Yu H, Li Y, Liu J (2010) A research on text analysis in tibetan speech synthesis. In: IEEE International Conference on Information and Automation (ICIA) 2010, pp 817–822

  5. Goldstein M (1991) Essentials of modern literary Tibetan. University of California Press, Phuntshog L

  6. Handel Z (2008) What is Sino-Tibetan? snapshot of a field and a language family in flux. Lang Linguistics Compass 2(3):422–441

    Article  Google Scholar 

  7. Kawahara H, Masuda-Katsuse I, de Cheveign A (1999) Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun 27(3):187–207

    Article  Google Scholar 

  8. Latorre J, Iwano K, Furui S (2006) New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer. Speech Commun 48(10):1227–1242

    Article  Google Scholar 

  9. Li Y, Kong J, Yu H (2008) Conversion and realization of tibetan text to pronunciation automatic rules. J Tsinghua Univ (natural science edition) 48(S1):621–626

    Google Scholar 

  10. Liang H, Qian Y, Soong FK, Liu G (2008) A cross-language state mapping approach to bilingual (Mandarin-English) TTS. In: ICASSP 2008, pp 4641–4644

  11. Mirjam W (2010) The EMIME bilingual database. Technical Report EDI-INF-RR-1388, The University of Edinburgh

  12. Peng XL, Oura K, Nankaku Y, Tokuda K (2010) Cross-lingual speaker adaptation for HMM-based speech synthesis considering differences between language-dependent average voices. In: IEEE 10th International Conference on Signal Processing, pp 605–608

  13. Qian Y, Liang H, Soong FK (2009) A cross-language state sharing and mapping approach to bilingual (Mandarin/English) TTS. IEEE Trans Audio Speech Lang Process 17(6):1231–1239

    Article  Google Scholar 

  14. Qian Y, Soong FK, Chen Y, Chu M (2006) An HMM-based Mandarin Chinese Text-To-Speech system. In: ISCSLP 2006, pp 223–232

  15. Schrder M, Hunecke A (2007) MARY TTS participation in the Blizzard Challenge 2007. In: Blizzard Challenge 2007. Bonn, Germany

  16. Siohan O, Myrvoll TA, Lee CH (2002) Structural maximum a posteriori linear regression for fast HMM adaptation. Comput Speech Lang 16(1):5–24

    Article  Google Scholar 

  17. Wells J (1997) SAMPA computer readable phonetic alphabet. In: Gibbon D, Moore R, Winski R (eds) Handbook of standards and resourcesfor spoken language systems

  18. Wu YJ, King S, Tokuda K (2008) Cross-lingual speaker adaptation for HMM-based speech synthesis. In: ISCSLP, 2008, pp 9–12

  19. Wu Y J, Nankaku Y, Tokuda K (2009) State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis. In: Interspeech 2009, pp 528–531

  20. Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J (2009) Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans Audio, Speech, Lang Process 17(1):66–83

    Article  Google Scholar 

  21. Yamagishi J, Tamura M, Masuko T, Tokuda K, Kobayashi T (2003) A training method of average voice model for HMM-based speech synthesis. IEICE Trans Fundam Electron Commun Comput Sci E86-A:1956–1963

    Google Scholar 

  22. Zen H, Braunschweiler N, Buchholz S, Knill K, Krstulovic S, Latorre J (2010) Speaker and language adaptive training for HMM-based polyglot speech synthesis. In: Interspeech 2010, pp 186–191

  23. Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064

    Article  Google Scholar 

  24. Zhang J (2009) Machine readable phonetic sampa-sc of chinese mandarin. ACTA ACUSTICA 34(1):81–86

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongwu Yang.

Additional information

The research leading to these results was partly funded by the National Natural Science Foundation of China (Grant No. 61263036, 61262055 ) , Gansu Science Fund for Distinguished Young Scholars (Grant No. 1210RJDA007) and the Core Research for Evolutional Science and Technology (CREST) from Japan Science and Technology Agency (JST).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, H., Oura, K., Wang, H. et al. Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis. Multimed Tools Appl 74, 9927–9942 (2015). https://doi.org/10.1007/s11042-014-2117-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2117-9

Keywords

Navigation