Abstract
Parallel dictionary learning for multimodal voice conversion is proposed in this paper. Because of noise robustness of visual features, multimodal feature has been attracted in the field of speech processing, and we have proposed multimodal VC using Non-negative Matrix Factorization (NMF). Experimental results showed that our conventional multimodal VC can effectively converted in a noisy environment, however, the difference of conversion quality between audio input VC and multimodal VC is not so large in a clean environment. We assume this is because our exemplar dictionary is over-complete. Moreover, because of non-negativity constraint for visual features, our conventional multimodal NMF-based VC cannot factorize visual features effectively. In order to enhance the conversion quality of our NMF-based multimodal VC, we propose parallel dictionary learning. Non-negative constraint for visual features is removed so that we can handle visual features which include negative values. Experimental results showed that our proposed method effectively converted multimodal features in a clean environment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Esophageal speech enhancement based on statistical voice conversion with Gaussian mixture models. In: ICASSP, pp. 655-658 (1988)
Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: GMM-based emotional voice conversion using spectrum and prosody features. Am. J. Signal Process. 2(5), 134–138 (2012)
Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: Individuality-preserving voice conversion for articulation disorders based on non-negative matrix factorization. In: ICASSP, pp. 8037–8040 (2013)
Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: Noise-robust voice conversion based on sparse spectral mapping using non-negative matrix factorization. IEICE Trans. Inf. Syst. E97-D(6), 1411–1418 (2014)
Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: Voice conversion based on non-negative matrix factorization using phoneme-categorized dictionary. In: ICASSP, pp. 7944–7948 (2014)
Aihara, R., Takiguchi, T., Ariki, Y.: Multiple non-negative matrix factorization for many-to-many voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. (2016)
Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J.: Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal. 52(1), 155–173 (2007)
Cichocki, A., Zdnek, R., Phan, A.H., Amari, S.: Non-negative Matrix and Tensor Factorization. Wilkey (2009)
Gemmeke, J.F., Viratnen, T., Hurmalainen, A.: Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)
Helander, E., Virtanen, T., Nurminen, J., Gabbouj, M.: Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18, 912–921 (2010)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the SIGIR, pp. 50–57 (1999)
Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: ICASSP, pp. 285–288 (1998)
Kawahara, H.: STRAIGHT, exploitation of the other aspect of vocoder: perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Technol. 349–353 (2006)
Lee, C.H., Wu, C.H.: MAP-based adaptation for speech conversion using adaptation data selection and non-parallel training. In: Interspeech, pp. 2254–2257 (2006)
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. Neural Inf. Process. Syst. 556–562 (2001)
Masaka, K., Aihara, R., Takiguchi, T., Ariki, Y.: Multimodal voice conversion based on non-negative matrix factorization. EURASIP J. Audio Speech Music Process. (2015)
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)
Nakamura, K., Toda, T., Saruwatari, H., Shikano, K.: Speaking aid system for total laryngectomees using voice conversion of body transmitted artificial speech. In: Interspeech, pp. 148–151 (2006)
Nakamura, K., Toda, T., Saruwatari, H., Shikano, K.: Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun. 54(1), 134–146 (2012)
Palecek, K., Chaloupka, J.: Audio-visual speech recognition in noisy audio environments. In: Proceedings of the International Conference on Telecommunications and Signal Processing, pp. 484–487 (2013)
Potamianos, G., Graf, H.P.: Discriminative training of HMM stream exponents for audio-visual speech recognition. In: ICASSP, pp. 3733–3736 (1998)
Saito, D., Yamamoto, K., Minematsu, N., Hirose, K.: One-to-many voice conversion based on tensor representation of speaker space. In: Interspeech, pp. 653–656 (2011)
Satoshi, T., Chiyomi, M.: Censrec-1-av an evaluation framework for multimodal speech recognition (japanese). Technical report, SLP-2010 (2010)
Sawada, H., Kameoka, H., Araki, S., Ueda, N.: Efficient algorithms for multichannel extensions of Itakura-Saito nonnegative matrix factorization. In: Proceedings of the ICASSP, pp. 261–264 (2012)
Schmidt, M.N., Olsson, R.K.: Single-channel speech separation using sparse non-negative matrix factorization. In: Interspeech (2006)
Stylianou, Y., Cappe, O., Moilines, E.: Continuous probabilistic transform for voice conversion. IEEE Trans Speech and Audio Processing 6(2), 131–142 (1998)
Takashima, R., Takiguchi, T., Ariki, Y.: Exemplar-based voice conversion in noisy environment. In: SLT, pp. 313–317 (2012)
Toda, T., Ohtani, Y., Shikano, K.: Eigenvoice conversion based on Gaussian mixture model. In: Interspeech, pp. 2446–2449 (2006)
Toda, T., Black, A., Tokuda, K.: Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)
Tomlinson, M.J., Russell, M.J., Brooke, N.M.: Integrating audio and visual information to provide highly robust speech recognition. In: ICASSP, pp. 821–824 (1996)
Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 11, 175–187 (1992)
Veaux, C., Robet, X.: Intonation conversion from neutral to expressive speech. In: Interspeech, pp. 2765–2768 (2011)
Verma, A., Faruquie, T., Neti, C., Basu, S., Senior, A.: Late integration in audio-visual continuous speech recognition. In: ASRU (1999)
Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)
Wu, Z., Virtanen, T., Chng, E.S., Li, H.: Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Trans. Audio Speech Lang. 22, 1506–1521 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Aihara, R., Masaka, K., Takiguchi, T., Ariki, Y. (2016). Parallel Dictionary Learning for Multimodal Voice Conversion Using Matrix Factorization. In: Lee, R. (eds) Computer and Information Science. Studies in Computational Intelligence, vol 656. Springer, Cham. https://doi.org/10.1007/978-3-319-40171-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-40171-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40170-6
Online ISBN: 978-3-319-40171-3
eBook Packages: EngineeringEngineering (R0)