A Study on Tailor-Made Speech Synthesis Based on Deep Neural Networks

Yamada, Shuhei; Nose, Takashi; Ito, Akinori

doi:10.1007/978-3-319-50209-0_20

Shuhei Yamada⁶,
Takashi Nose⁶ &
Akinori Ito⁶

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 63))

1062 Accesses
1 Altmetric

Abstract

We propose “tailor-made speech synthesis,” the speech synthesis technique which enables users to control the synthetic speech naturally and intuitively. As a first step to realizing tailor-made speech synthesis, we introduce F0 context into speaker model training of speech synthesis based on deep neural networks (DNNs). F0 context represents relative log F0 at the mora or the accent-phrase level of training data. It allows users to control the F0 of synthetic speech steplessly on the contrary to conventional F0 context in HMM-based technique. Experiments showed that F0 context was effective to control the F0 because the F0 of synthetic voice followed the value of F0 context.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apple Inc.: iOS - Siri - Apple, http://www.apple.com/ios/siri/
Google Inc.: Google Now, https://www.google.com/search/about/learn-more/now/
Kawahara, H., Masuda-Katsuse, I., de Cheveigné, A.: Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication 27(3–4), 187–207 (1999)
Google Scholar
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Maeno, Y., Nose, T., Kobayashi, T., Koriyama, T., Ijima, Y., Nakajima, H., Mizuno, H., Yoshioka, O.: Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis. Speech Communication 57, 144–154 (2014)
Google Scholar
Nishigaki, Y., Takamichi, S., Toda, T., Neubig, G., Sakti, S., Nakamura, S.: Prosody-controllable HMM-based speech synthesis using speech input. In: Proc. MLSLP (2015)
Google Scholar
Nose, T., Yamagishi, J., Masuko, T., Kobayashi, T.: A style control technique for HMM-based expressive speech synthesis. IEICE Trans. Inf. & Syst. E90-D(9), 1406–1413 (2007)
Google Scholar
Watts, O., Wu, Z., King, S.: Sentence-level control vectors for deep neural network speech synthesis. In: Proc. Interspeech. pp. 2217–2221 (2015)
Google Scholar
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proc. ICASSP. pp. 7962–7966 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Engineering, Tohoku University, Aramaki Aza Aoba 6–6–05, Aoba-ku, Sendai-shi, Miyagi, 980–8579, Japan
Shuhei Yamada, Takashi Nose & Akinori Ito

Authors

Shuhei Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Nose
View author publications
You can also search for this author in PubMed Google Scholar
Akinori Ito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuhei Yamada .

Editor information

Editors and Affiliations

University Town, National Kaohsiung University of Applied University Town, Fujian, China
Jeng-Shyang Pan
College of Information Science and Engin, Fujiang University of Technology College of Information Science and Engin, Fujian, China
Pei-Wei Tsai
National University of Kaohsiung , Kaohsiung, Taiwan
Hsiang-Cheh Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yamada, S., Nose, T., Ito, A. (2017). A Study on Tailor-Made Speech Synthesis Based on Deep Neural Networks. In: Pan, JS., Tsai, PW., Huang, HC. (eds) Advances in Intelligent Information Hiding and Multimedia Signal Processing. Smart Innovation, Systems and Technologies, vol 63. Springer, Cham. https://doi.org/10.1007/978-3-319-50209-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-50209-0_20
Published: 22 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50208-3
Online ISBN: 978-3-319-50209-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics