Exemplar-based voice conversion using joint nonnegative matrix factorization

Wu, Zhizheng; Chng, Eng Siong; Li, Haizhou

doi:10.1007/s11042-014-2180-2

Exemplar-based voice conversion using joint nonnegative matrix factorization

Published: 06 August 2014

Volume 74, pages 9943–9958, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zhizheng Wu^1,2,
Eng Siong Chng³ &
Haizhou Li⁴

375 Accesses
22 Citations
Explore all metrics

Abstract

Exemplar-based sparse representation is a nonparametric framework for voice conversion. In this framework, a target spectrum is generated as a weighted linear combination of a set of basis spectra, namely exemplars, extracted from the training data. This framework adopts coupled source-target dictionaries consisting of acoustically aligned source-target exemplars, and assumes they can share the same activation matrix. At runtime, a source spectrogram is factorized as a product of the source dictionary and the common activation matrix, which is applied to the target dictionary to generate the target spectrogram. In practice, either low-resolution mel-scale filter bank energies or high-resolution spectra are adopted in the source dictionary. Low-resolution features are flexible in capturing the temporal information without increasing the computational cost and the memory occupation significantly, while high-resolution spectra contain significant spectral details. In this paper, we propose a joint nonnegative matrix factorization technique to find the common activation matrix using low- and high-resolution features at the same time. In this way, the common activation matrix is able to benefit from low- and high-resolution features directly. We conducted experiments on the VOICES database to evaluate the performance of the proposed method. Both objective and subjective evaluations confirmed the effectiveness of the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

https://www.mturk.com/mturk/welcome

References

Afify M, Cui X, Gao Y (2007) Stereo-based stochastic mapping for robust speech recognition. In: Proceedings IEEE international conference on acoustics, speech, and signal processing (ICASSP)
Buchholz S, Latorre J (2011) Crowdsourcing preference tests, and how to detect cheating. In: Proceedings Interspeech
Cichocki A, Zdunek R, Amari S (2006) Csiszars divergences for non-negative matrix factorization: family of new algorithms. In: Independent component analysis and blind signal separation. Springer, pp 32–39
Desai S, Black A, Yegnanarayana B, Prahallad K (2010) Spectral mapping using artificial neural networks for voice conversion. IEEE Trans Audio Speech Lang Process 18(5):954–964
Article Google Scholar
Gemmeke J, Virtanen T, Hurmalainen A (2011) Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans Audio Speech Lang Process 19(7):2067–2080
Article Google Scholar
Helander E, Silén H, Virtanen T, Gabbouj M (2012) Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Audio Speech Lang Process 20(3):806–817
Article Google Scholar
Helander E, Virtanen T, Nurminen J, Gabbouj M (2010) Voice conversion using partial least squares regression. IEEE Trans Audio Speech Lang Process 18(5):912–921
Article Google Scholar
Kain A, Macon MW (1998) Spectral voice conversion for text-to-speech synthesis. In: Proceedings IEEE international conference on acoustics, speech, and signal processing (ICASSP)
Kain AB (2001) High resolution voice transformation. Ph.D. thesis, OGI School of Science & Engineering at Oregon Health & Science University
Kawahara H, Masuda-Katsuse I, de Cheveigné A (1999) Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a repetitive structure in sounds. Speech Commun 27(3):187–207
Article Google Scholar
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40
Article Google Scholar
Kinnunen T, Wu ZZ, Lee KA, Sedlak F, Chng ES, Li H (2012) Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech. In: Proceedings IEEE international conference on acoustics, speech, and signal processing (ICASSP)
Mathias SR, von Kriegstein K (2014) How do we recognise who is speaking? Front Biosci (Scholar edition) 6:92
Article Google Scholar
Nakamura K, Toda T, Saruwatari H, Shikano K (2012) Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech. Speech Commun 54(1):134–146
Article Google Scholar
Ribeiro F, Florêncio D, Zhang C, Seltzer M (2011) Crowdmos: an approach for crowdsourcing mean opinion score studies. In: Proceedings IEEE international conference on acoustics, speech, and signal processing (ICASSP)
Seung D, Lee L (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13:556–562
Google Scholar
Stylianou Y, Cappé O, Moulines E (1998) Continuous probabilistic transform for voice conversion. IEEE Trans Speech Audio Process 6(2):131–142
Article Google Scholar
Takashima R, Takiguchi T, Ariki Y (2012) Exemplar-based voice conversion in noisy environment. In: Proceedings IEEE spoken language technology workshop (SLT)
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235
Article Google Scholar
Toda T, Nakamura K, Sekimoto H, Shikano K (2009) Voice conversion for various types of body transmitted speech. In: Proceedings IEEE international conference on acoustics, speech, and signal processing (ICASSP)
Tokuda K, Kobayashi T, Masuko T, Imai S (1994) Mel-generalized cepstral analysis-a unified approach to speech spectral estimation. In: Proceedings international conference on spoken language processing (ICSLP)
Wolters MK, Isaac KB, Renals S Evaluating speech synthesis intelligibility using amazon mechanical turk. In: 7th ISCA speech synthesis workshop (SSW7)
Wu Z, Kinnunen T, Chng ES, Li H, Ambikairajah E (2012) A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case. In: Proceedings Asia-Pacific signal information processing association annual summit and conference (APSIPA ASC)
Wu Z, Virtanen T, Chng ES, Li H (2014) Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Trans Audio Speech Lang Process
Wu Z, Virtanen T, Kinnunen T, Chng ES, Li H (2013) Exemplar-based voice conversion using non-negative spectrogram deconvolution. In: the 8th ISCA speech synthesis workshop (SSW8)
Zen H, Nankaku Y, Tokuda K (2011) Continuous stochastic feature mapping based on trajectory HMMs. IEEE Trans Audio Speech Lang Process 19(2):417–430
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Engineering, Nanyang Technological University, Singapore, Singapore
Zhizheng Wu
Centre for Speech Technology Research, University of Edinburgh, Edinburgh, United Kingdom
Zhizheng Wu
School of Computer Engineering, Nanyang Technological University, Singapore, Singapore
Eng Siong Chng
Human Language Technology Department, Institute for Infocomm Research, School of Computer Engineering, Nanyang Technological University, Singapore, Singapore
Haizhou Li

Authors

Zhizheng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Eng Siong Chng
View author publications
You can also search for this author in PubMed Google Scholar
Haizhou Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhizheng Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, Z., Chng, E.S. & Li, H. Exemplar-based voice conversion using joint nonnegative matrix factorization. Multimed Tools Appl 74, 9943–9958 (2015). https://doi.org/10.1007/s11042-014-2180-2

Download citation

Received: 15 April 2014
Revised: 04 July 2014
Accepted: 06 July 2014
Published: 06 August 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s11042-014-2180-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exemplar-based voice conversion using joint nonnegative matrix factorization

Abstract

Access this article

Similar content being viewed by others

Small-parallel exemplar-based voice conversion in noisy environments using affine non-negative matrix factorization

Multimodal voice conversion based on non-negative matrix factorization

The Voice Conversion Method Based on Sparse Convolutive Non-negative Matrix Factorization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exemplar-based voice conversion using joint nonnegative matrix factorization

Abstract

Access this article

Similar content being viewed by others

Small-parallel exemplar-based voice conversion in noisy environments using affine non-negative matrix factorization

Multimodal voice conversion based on non-negative matrix factorization

The Voice Conversion Method Based on Sparse Convolutive Non-negative Matrix Factorization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation