Skip to main content

Parallel Dictionary Learning for Multimodal Voice Conversion Using Matrix Factorization

  • Chapter
  • First Online:
Computer and Information Science

Part of the book series: Studies in Computational Intelligence ((SCI,volume 656))

Abstract

Parallel dictionary learning for multimodal voice conversion is proposed in this paper. Because of noise robustness of visual features, multimodal feature has been attracted in the field of speech processing, and we have proposed multimodal VC using Non-negative Matrix Factorization (NMF). Experimental results showed that our conventional multimodal VC can effectively converted in a noisy environment, however, the difference of conversion quality between audio input VC and multimodal VC is not so large in a clean environment. We assume this is because our exemplar dictionary is over-complete. Moreover, because of non-negativity constraint for visual features, our conventional multimodal NMF-based VC cannot factorize visual features effectively. In order to enhance the conversion quality of our NMF-based multimodal VC, we propose parallel dictionary learning. Non-negative constraint for visual features is removed so that we can handle visual features which include negative values. Experimental results showed that our proposed method effectively converted multimodal features in a clean environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Esophageal speech enhancement based on statistical voice conversion with Gaussian mixture models. In: ICASSP, pp. 655-658 (1988)

    Google Scholar 

  2. Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: GMM-based emotional voice conversion using spectrum and prosody features. Am. J. Signal Process. 2(5), 134–138 (2012)

    Article  Google Scholar 

  3. Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: Individuality-preserving voice conversion for articulation disorders based on non-negative matrix factorization. In: ICASSP, pp. 8037–8040 (2013)

    Google Scholar 

  4. Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: Noise-robust voice conversion based on sparse spectral mapping using non-negative matrix factorization. IEICE Trans. Inf. Syst. E97-D(6), 1411–1418 (2014)

    Google Scholar 

  5. Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: Voice conversion based on non-negative matrix factorization using phoneme-categorized dictionary. In: ICASSP, pp. 7944–7948 (2014)

    Google Scholar 

  6. Aihara, R., Takiguchi, T., Ariki, Y.: Multiple non-negative matrix factorization for many-to-many voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. (2016)

    Google Scholar 

  7. Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J.: Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal. 52(1), 155–173 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  8. Cichocki, A., Zdnek, R., Phan, A.H., Amari, S.: Non-negative Matrix and Tensor Factorization. Wilkey (2009)

    Google Scholar 

  9. Gemmeke, J.F., Viratnen, T., Hurmalainen, A.: Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)

    Article  Google Scholar 

  10. Helander, E., Virtanen, T., Nurminen, J., Gabbouj, M.: Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18, 912–921 (2010)

    Article  Google Scholar 

  11. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the SIGIR, pp. 50–57 (1999)

    Google Scholar 

  12. Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: ICASSP, pp. 285–288 (1998)

    Google Scholar 

  13. Kawahara, H.: STRAIGHT, exploitation of the other aspect of vocoder: perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Technol. 349–353 (2006)

    Google Scholar 

  14. Lee, C.H., Wu, C.H.: MAP-based adaptation for speech conversion using adaptation data selection and non-parallel training. In: Interspeech, pp. 2254–2257 (2006)

    Google Scholar 

  15. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. Neural Inf. Process. Syst. 556–562 (2001)

    Google Scholar 

  16. Masaka, K., Aihara, R., Takiguchi, T., Ariki, Y.: Multimodal voice conversion based on non-negative matrix factorization. EURASIP J. Audio Speech Music Process. (2015)

    Google Scholar 

  17. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)

    Article  Google Scholar 

  18. Nakamura, K., Toda, T., Saruwatari, H., Shikano, K.: Speaking aid system for total laryngectomees using voice conversion of body transmitted artificial speech. In: Interspeech, pp. 148–151 (2006)

    Google Scholar 

  19. Nakamura, K., Toda, T., Saruwatari, H., Shikano, K.: Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun. 54(1), 134–146 (2012)

    Article  Google Scholar 

  20. Palecek, K., Chaloupka, J.: Audio-visual speech recognition in noisy audio environments. In: Proceedings of the International Conference on Telecommunications and Signal Processing, pp. 484–487 (2013)

    Google Scholar 

  21. Potamianos, G., Graf, H.P.: Discriminative training of HMM stream exponents for audio-visual speech recognition. In: ICASSP, pp. 3733–3736 (1998)

    Google Scholar 

  22. Saito, D., Yamamoto, K., Minematsu, N., Hirose, K.: One-to-many voice conversion based on tensor representation of speaker space. In: Interspeech, pp. 653–656 (2011)

    Google Scholar 

  23. Satoshi, T., Chiyomi, M.: Censrec-1-av an evaluation framework for multimodal speech recognition (japanese). Technical report, SLP-2010 (2010)

    Google Scholar 

  24. Sawada, H., Kameoka, H., Araki, S., Ueda, N.: Efficient algorithms for multichannel extensions of Itakura-Saito nonnegative matrix factorization. In: Proceedings of the ICASSP, pp. 261–264 (2012)

    Google Scholar 

  25. Schmidt, M.N., Olsson, R.K.: Single-channel speech separation using sparse non-negative matrix factorization. In: Interspeech (2006)

    Google Scholar 

  26. Stylianou, Y., Cappe, O., Moilines, E.: Continuous probabilistic transform for voice conversion. IEEE Trans Speech and Audio Processing 6(2), 131–142 (1998)

    Article  Google Scholar 

  27. Takashima, R., Takiguchi, T., Ariki, Y.: Exemplar-based voice conversion in noisy environment. In: SLT, pp. 313–317 (2012)

    Google Scholar 

  28. Toda, T., Ohtani, Y., Shikano, K.: Eigenvoice conversion based on Gaussian mixture model. In: Interspeech, pp. 2446–2449 (2006)

    Google Scholar 

  29. Toda, T., Black, A., Tokuda, K.: Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)

    Article  Google Scholar 

  30. Tomlinson, M.J., Russell, M.J., Brooke, N.M.: Integrating audio and visual information to provide highly robust speech recognition. In: ICASSP, pp. 821–824 (1996)

    Google Scholar 

  31. Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 11, 175–187 (1992)

    Article  Google Scholar 

  32. Veaux, C., Robet, X.: Intonation conversion from neutral to expressive speech. In: Interspeech, pp. 2765–2768 (2011)

    Google Scholar 

  33. Verma, A., Faruquie, T., Neti, C., Basu, S., Senior, A.: Late integration in audio-visual continuous speech recognition. In: ASRU (1999)

    Google Scholar 

  34. Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)

    Article  Google Scholar 

  35. Wu, Z., Virtanen, T., Chng, E.S., Li, H.: Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Trans. Audio Speech Lang. 22, 1506–1521 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryo Aihara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Aihara, R., Masaka, K., Takiguchi, T., Ariki, Y. (2016). Parallel Dictionary Learning for Multimodal Voice Conversion Using Matrix Factorization. In: Lee, R. (eds) Computer and Information Science. Studies in Computational Intelligence, vol 656. Springer, Cham. https://doi.org/10.1007/978-3-319-40171-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-40171-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-40170-6

  • Online ISBN: 978-3-319-40171-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics