Voice Conversion from Arbitrary Speakers Based on Deep Neural Networks with Adversarial Learning

Miyamoto, Sou; Nose, Takashi; Ito, Suzunosuke; Koike, Harunori; Chiba, Yuya; Ito, Akinori; Shinozaki, Takahiro

doi:10.1007/978-3-319-63859-1_13

Sou Miyamoto⁷,
Takashi Nose⁷,
Suzunosuke Ito^7,8,
Harunori Koike^7,8,
Yuya Chiba⁷,
Akinori Ito⁷ &
…
Takahiro Shinozaki⁸

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 82))

Included in the following conference series:

International Conference on Intelligent Information Hiding and Multimedia Signal Processing

1181 Accesses

Abstract

In this study, we propose a voice conversion technique from arbitrary speakers based on deep neural networks using adversarial learning, which is realized by introducing adversarial learning to the conventional voice conversion. Adversarial learning is expected to enable us more natural voice conversion by using a discriminative model which classifies input speech to natural speech or converted speech in addition to a generative model. Experiments showed that proposed method was effective to enhance global variance (GV) of mel-cepstrum but naturalness of converted speech was a little lower than speech using the conventional variance compensation technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Desai, S., Raghavendra, E.V., Yegnanarayana, B., Black, A.W., Prahallad, K.: Voice conversion using artificial neural networks. In: Proceedings of the ICASSP, pp. 3893–3896 (2009)
Google Scholar
Furui, S.: Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Trans. Acoust. Speech Sig. Process. 34(1), 52–59 (1986)
Article Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint (2015). arXiv:1502.03167
Kain, A., Macon, M.: Spectral voice conversion for text-to-speech synthesis. In: Proceedings of the ICASSP, pp. 285–288 (1998)
Google Scholar
Koike, H., Nose, T., Shinozaki, T., Ito, A.: Improvement of quality of voice conversion based on spectral differential filter using straight-based mel-cepstral coefficients. J. Acoust. Soc. Am. 140(4), 2963–2963 (2016)
Article Google Scholar
Ling, Z.H., Wu, Y.J., Wang, Y.P., Qin, L., Wang, R.H.: USTC system for blizzard challenge 2006 an improved HMM-based speech synthesis method. In: Blizzard Challenge Workshop (2006)
Google Scholar
Morise, M., Yokomori, F., Ozawa, K.: World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)
Article Google Scholar
Nose, T., Ota, Y., Kobayashi, T.: HMM-based voice conversion using quantized F0 context. IEICE Trans. Inf. Syst. E93–D(9), 2483–2490 (2010)
Article Google Scholar
Nose, T.: Efficient implementation of global variance compensation for parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 24(10), 1694–1704 (2016)
Article Google Scholar
Pilkington, N.C., Zen, H., Gales, M.J., et al.: Gaussian process experts for voice conversion. In: Proceedings of the INTERSPEECH, pp. 2772–2775 (2011)
Google Scholar
Saito, Y., Takamichi, S., Saruwatari, H.: Training algorithm to deceive anti-spoofing verification for DNN-based speech synthesis. In: Proceedings of the ICASSP
Google Scholar
Stylianou, Y.: Voice transformation: a survey. In: Proceedings of the ICASSP, pp. 3585–3588 (2009)
Google Scholar
Tomoki, T., Tokuda, K.: A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Syst. 90(5), 816–824 (2007)
Google Scholar

Download references

Acknowledgment

Part of this work was supported by JSPS KAKENHI Grant Number JP26280055 and JP15H02720.

Author information

Authors and Affiliations

Graduate School of Engineering, Tohoku University, Aramaki Aza Aoba 6–6–05, Aoba-ku, Sendai-shi, Miyagi, 980–8579, Japan
Sou Miyamoto, Takashi Nose, Suzunosuke Ito, Harunori Koike, Yuya Chiba & Akinori Ito
Department of Information and Communication Engineering, School of Engineering, Tokyo Institute of Technology, Nagatsuta-cho 4259, Midori-ku, Yokohama-shi, Kanagawa, 226-8502, Japan
Suzunosuke Ito, Harunori Koike & Takahiro Shinozaki

Authors

Sou Miyamoto
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Nose
View author publications
You can also search for this author in PubMed Google Scholar
Suzunosuke Ito
View author publications
You can also search for this author in PubMed Google Scholar
Harunori Koike
View author publications
You can also search for this author in PubMed Google Scholar
Yuya Chiba
View author publications
You can also search for this author in PubMed Google Scholar
Akinori Ito
View author publications
You can also search for this author in PubMed Google Scholar
Takahiro Shinozaki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sou Miyamoto .

Editor information

Editors and Affiliations

Fujian Provincial Key Lab of Big Data Mining and Applications, Fujian University of Technology, Fuzhou, Fujian, China
Jeng-Shyang Pan
Swinburne University of Technology, Hawthorn, Victoria, Australia
Pei-Wei Tsai
Universiti Teknologi Petronas, Teronoh, Malaysia
Junzo Watada
University of Canberra, Bruce, Aust Capital Terr, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miyamoto, S. et al. (2018). Voice Conversion from Arbitrary Speakers Based on Deep Neural Networks with Adversarial Learning. In: Pan, JS., Tsai, PW., Watada, J., Jain, L. (eds) Advances in Intelligent Information Hiding and Multimedia Signal Processing. IIH-MSP 2017. Smart Innovation, Systems and Technologies, vol 82. Springer, Cham. https://doi.org/10.1007/978-3-319-63859-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-63859-1_13
Published: 18 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63858-4
Online ISBN: 978-3-319-63859-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics