Parallel Dictionary Learning for Multimodal Voice Conversion Using Matrix Factorization

Aihara, Ryo; Masaka, Kenta; Takiguchi, Tetsuya; Ariki, Yasuo

doi:10.1007/978-3-319-40171-3_3

Ryo Aihara³,
Kenta Masaka³,
Tetsuya Takiguchi³ &
…
Yasuo Ariki³

Part of the book series: Studies in Computational Intelligence ((SCI,volume 656))

597 Accesses
5 Citations

Abstract

Parallel dictionary learning for multimodal voice conversion is proposed in this paper. Because of noise robustness of visual features, multimodal feature has been attracted in the field of speech processing, and we have proposed multimodal VC using Non-negative Matrix Factorization (NMF). Experimental results showed that our conventional multimodal VC can effectively converted in a noisy environment, however, the difference of conversion quality between audio input VC and multimodal VC is not so large in a clean environment. We assume this is because our exemplar dictionary is over-complete. Moreover, because of non-negativity constraint for visual features, our conventional multimodal NMF-based VC cannot factorize visual features effectively. In order to enhance the conversion quality of our NMF-based multimodal VC, we propose parallel dictionary learning. Non-negative constraint for visual features is removed so that we can handle visual features which include negative values. Experimental results showed that our proposed method effectively converted multimodal features in a clean environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Esophageal speech enhancement based on statistical voice conversion with Gaussian mixture models. In: ICASSP, pp. 655-658 (1988)
Google Scholar
Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: GMM-based emotional voice conversion using spectrum and prosody features. Am. J. Signal Process. 2(5), 134–138 (2012)
Article Google Scholar
Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: Individuality-preserving voice conversion for articulation disorders based on non-negative matrix factorization. In: ICASSP, pp. 8037–8040 (2013)
Google Scholar
Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: Noise-robust voice conversion based on sparse spectral mapping using non-negative matrix factorization. IEICE Trans. Inf. Syst. E97-D(6), 1411–1418 (2014)
Google Scholar
Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: Voice conversion based on non-negative matrix factorization using phoneme-categorized dictionary. In: ICASSP, pp. 7944–7948 (2014)
Google Scholar
Aihara, R., Takiguchi, T., Ariki, Y.: Multiple non-negative matrix factorization for many-to-many voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. (2016)
Google Scholar
Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J.: Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal. 52(1), 155–173 (2007)
Article MathSciNet MATH Google Scholar
Cichocki, A., Zdnek, R., Phan, A.H., Amari, S.: Non-negative Matrix and Tensor Factorization. Wilkey (2009)
Google Scholar
Gemmeke, J.F., Viratnen, T., Hurmalainen, A.: Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)
Article Google Scholar
Helander, E., Virtanen, T., Nurminen, J., Gabbouj, M.: Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18, 912–921 (2010)
Article Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the SIGIR, pp. 50–57 (1999)
Google Scholar
Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: ICASSP, pp. 285–288 (1998)
Google Scholar
Kawahara, H.: STRAIGHT, exploitation of the other aspect of vocoder: perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Technol. 349–353 (2006)
Google Scholar
Lee, C.H., Wu, C.H.: MAP-based adaptation for speech conversion using adaptation data selection and non-parallel training. In: Interspeech, pp. 2254–2257 (2006)
Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. Neural Inf. Process. Syst. 556–562 (2001)
Google Scholar
Masaka, K., Aihara, R., Takiguchi, T., Ariki, Y.: Multimodal voice conversion based on non-negative matrix factorization. EURASIP J. Audio Speech Music Process. (2015)
Google Scholar
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)
Article Google Scholar
Nakamura, K., Toda, T., Saruwatari, H., Shikano, K.: Speaking aid system for total laryngectomees using voice conversion of body transmitted artificial speech. In: Interspeech, pp. 148–151 (2006)
Google Scholar
Nakamura, K., Toda, T., Saruwatari, H., Shikano, K.: Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun. 54(1), 134–146 (2012)
Article Google Scholar
Palecek, K., Chaloupka, J.: Audio-visual speech recognition in noisy audio environments. In: Proceedings of the International Conference on Telecommunications and Signal Processing, pp. 484–487 (2013)
Google Scholar
Potamianos, G., Graf, H.P.: Discriminative training of HMM stream exponents for audio-visual speech recognition. In: ICASSP, pp. 3733–3736 (1998)
Google Scholar
Saito, D., Yamamoto, K., Minematsu, N., Hirose, K.: One-to-many voice conversion based on tensor representation of speaker space. In: Interspeech, pp. 653–656 (2011)
Google Scholar
Satoshi, T., Chiyomi, M.: Censrec-1-av an evaluation framework for multimodal speech recognition (japanese). Technical report, SLP-2010 (2010)
Google Scholar
Sawada, H., Kameoka, H., Araki, S., Ueda, N.: Efficient algorithms for multichannel extensions of Itakura-Saito nonnegative matrix factorization. In: Proceedings of the ICASSP, pp. 261–264 (2012)
Google Scholar
Schmidt, M.N., Olsson, R.K.: Single-channel speech separation using sparse non-negative matrix factorization. In: Interspeech (2006)
Google Scholar
Stylianou, Y., Cappe, O., Moilines, E.: Continuous probabilistic transform for voice conversion. IEEE Trans Speech and Audio Processing 6(2), 131–142 (1998)
Article Google Scholar
Takashima, R., Takiguchi, T., Ariki, Y.: Exemplar-based voice conversion in noisy environment. In: SLT, pp. 313–317 (2012)
Google Scholar
Toda, T., Ohtani, Y., Shikano, K.: Eigenvoice conversion based on Gaussian mixture model. In: Interspeech, pp. 2446–2449 (2006)
Google Scholar
Toda, T., Black, A., Tokuda, K.: Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)
Article Google Scholar
Tomlinson, M.J., Russell, M.J., Brooke, N.M.: Integrating audio and visual information to provide highly robust speech recognition. In: ICASSP, pp. 821–824 (1996)
Google Scholar
Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 11, 175–187 (1992)
Article Google Scholar
Veaux, C., Robet, X.: Intonation conversion from neutral to expressive speech. In: Interspeech, pp. 2765–2768 (2011)
Google Scholar
Verma, A., Faruquie, T., Neti, C., Basu, S., Senior, A.: Late integration in audio-visual continuous speech recognition. In: ASRU (1999)
Google Scholar
Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)
Article Google Scholar
Wu, Z., Virtanen, T., Chng, E.S., Li, H.: Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Trans. Audio Speech Lang. 22, 1506–1521 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of System Informatics, Kobe University, 1-1, Rokkodai, Nada, Kobe, Japan
Ryo Aihara, Kenta Masaka, Tetsuya Takiguchi & Yasuo Ariki

Authors

Ryo Aihara
View author publications
You can also search for this author in PubMed Google Scholar
Kenta Masaka
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuya Takiguchi
View author publications
You can also search for this author in PubMed Google Scholar
Yasuo Ariki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ryo Aihara .

Editor information

Editors and Affiliations

Software Engineering & Information, Central Michigan University, MOUNT PLEASANT, USA
Roger Lee

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aihara, R., Masaka, K., Takiguchi, T., Ariki, Y. (2016). Parallel Dictionary Learning for Multimodal Voice Conversion Using Matrix Factorization. In: Lee, R. (eds) Computer and Information Science. Studies in Computational Intelligence, vol 656. Springer, Cham. https://doi.org/10.1007/978-3-319-40171-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-40171-3_3
Published: 17 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40170-6
Online ISBN: 978-3-319-40171-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics