Skip to main content
Log in

Exemplar-based voice conversion using joint nonnegative matrix factorization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Exemplar-based sparse representation is a nonparametric framework for voice conversion. In this framework, a target spectrum is generated as a weighted linear combination of a set of basis spectra, namely exemplars, extracted from the training data. This framework adopts coupled source-target dictionaries consisting of acoustically aligned source-target exemplars, and assumes they can share the same activation matrix. At runtime, a source spectrogram is factorized as a product of the source dictionary and the common activation matrix, which is applied to the target dictionary to generate the target spectrogram. In practice, either low-resolution mel-scale filter bank energies or high-resolution spectra are adopted in the source dictionary. Low-resolution features are flexible in capturing the temporal information without increasing the computational cost and the memory occupation significantly, while high-resolution spectra contain significant spectral details. In this paper, we propose a joint nonnegative matrix factorization technique to find the common activation matrix using low- and high-resolution features at the same time. In this way, the common activation matrix is able to benefit from low- and high-resolution features directly. We conducted experiments on the VOICES database to evaluate the performance of the proposed method. Both objective and subjective evaluations confirmed the effectiveness of the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://www.mturk.com/mturk/welcome

References

  1. Afify M, Cui X, Gao Y (2007) Stereo-based stochastic mapping for robust speech recognition. In: Proceedings IEEE international conference on acoustics, speech, and signal processing (ICASSP)

  2. Buchholz S, Latorre J (2011) Crowdsourcing preference tests, and how to detect cheating. In: Proceedings Interspeech

  3. Cichocki A, Zdunek R, Amari S (2006) Csiszars divergences for non-negative matrix factorization: family of new algorithms. In: Independent component analysis and blind signal separation. Springer, pp 32–39

  4. Desai S, Black A, Yegnanarayana B, Prahallad K (2010) Spectral mapping using artificial neural networks for voice conversion. IEEE Trans Audio Speech Lang Process 18(5):954–964

    Article  Google Scholar 

  5. Gemmeke J, Virtanen T, Hurmalainen A (2011) Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans Audio Speech Lang Process 19(7):2067–2080

    Article  Google Scholar 

  6. Helander E, Silén H, Virtanen T, Gabbouj M (2012) Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Audio Speech Lang Process 20(3):806–817

    Article  Google Scholar 

  7. Helander E, Virtanen T, Nurminen J, Gabbouj M (2010) Voice conversion using partial least squares regression. IEEE Trans Audio Speech Lang Process 18(5):912–921

    Article  Google Scholar 

  8. Kain A, Macon MW (1998) Spectral voice conversion for text-to-speech synthesis. In: Proceedings IEEE international conference on acoustics, speech, and signal processing (ICASSP)

  9. Kain AB (2001) High resolution voice transformation. Ph.D. thesis, OGI School of Science & Engineering at Oregon Health & Science University

  10. Kawahara H, Masuda-Katsuse I, de Cheveigné A (1999) Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a repetitive structure in sounds. Speech Commun 27(3):187–207

    Article  Google Scholar 

  11. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40

    Article  Google Scholar 

  12. Kinnunen T, Wu ZZ, Lee KA, Sedlak F, Chng ES, Li H (2012) Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech. In: Proceedings IEEE international conference on acoustics, speech, and signal processing (ICASSP)

  13. Mathias SR, von Kriegstein K (2014) How do we recognise who is speaking? Front Biosci (Scholar edition) 6:92

    Article  Google Scholar 

  14. Nakamura K, Toda T, Saruwatari H, Shikano K (2012) Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech. Speech Commun 54(1):134–146

    Article  Google Scholar 

  15. Ribeiro F, Florêncio D, Zhang C, Seltzer M (2011) Crowdmos: an approach for crowdsourcing mean opinion score studies. In: Proceedings IEEE international conference on acoustics, speech, and signal processing (ICASSP)

  16. Seung D, Lee L (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13:556–562

    Google Scholar 

  17. Stylianou Y, Cappé O, Moulines E (1998) Continuous probabilistic transform for voice conversion. IEEE Trans Speech Audio Process 6(2):131–142

    Article  Google Scholar 

  18. Takashima R, Takiguchi T, Ariki Y (2012) Exemplar-based voice conversion in noisy environment. In: Proceedings IEEE spoken language technology workshop (SLT)

  19. Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235

    Article  Google Scholar 

  20. Toda T, Nakamura K, Sekimoto H, Shikano K (2009) Voice conversion for various types of body transmitted speech. In: Proceedings IEEE international conference on acoustics, speech, and signal processing (ICASSP)

  21. Tokuda K, Kobayashi T, Masuko T, Imai S (1994) Mel-generalized cepstral analysis-a unified approach to speech spectral estimation. In: Proceedings international conference on spoken language processing (ICSLP)

  22. Wolters MK, Isaac KB, Renals S Evaluating speech synthesis intelligibility using amazon mechanical turk. In: 7th ISCA speech synthesis workshop (SSW7)

  23. Wu Z, Kinnunen T, Chng ES, Li H, Ambikairajah E (2012) A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case. In: Proceedings Asia-Pacific signal information processing association annual summit and conference (APSIPA ASC)

  24. Wu Z, Virtanen T, Chng ES, Li H (2014) Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Trans Audio Speech Lang Process

  25. Wu Z, Virtanen T, Kinnunen T, Chng ES, Li H (2013) Exemplar-based voice conversion using non-negative spectrogram deconvolution. In: the 8th ISCA speech synthesis workshop (SSW8)

  26. Zen H, Nankaku Y, Tokuda K (2011) Continuous stochastic feature mapping based on trajectory HMMs. IEEE Trans Audio Speech Lang Process 19(2):417–430

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhizheng Wu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Z., Chng, E.S. & Li, H. Exemplar-based voice conversion using joint nonnegative matrix factorization. Multimed Tools Appl 74, 9943–9958 (2015). https://doi.org/10.1007/s11042-014-2180-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2180-2

Keywords

Navigation