Advertisement

Multimedia Tools and Applications

, Volume 75, Issue 9, pp 5311–5327 | Cite as

On the study of replay and voice conversion attacks to text-dependent speaker verification

  • Zhizheng Wu
  • Haizhou Li
Article

Abstract

Automatic speaker verification (ASV) is to automatically accept or reject a claimed identity based on a speech sample. Recently, individual studies have confirmed the vulnerability of state-of-the-art text-independent ASV systems under replay, speech synthesis and voice conversion attacks on various databases. However, the behaviours of text-dependent ASV systems have not been systematically assessed in the face of various spoofing attacks. In this work, we first conduct a systematic analysis of text-dependent ASV systems to replay and voice conversion attacks using the same protocol and database, in particular the RSR2015 database which represents mobile device quality speech. We then analyse the interplay of voice conversion and speaker verification by linking the voice conversion objective evaluation measures with the speaker verification error rates to take a look at the vulnerabilities from the perspective of voice conversion.

Keywords

Speaker verification Spoofing attack Replay Voice conversion Security 

References

  1. 1.
    Alegre F, Amehraye A, Evans N (2013) A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns. In: Proceedings of the international conference on biometrics: theory, applications and systems (BTAS)Google Scholar
  2. 2.
    Alegre F, Vipperla R, Evans N, et al. (2012) Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals. In: Proceedings interspeechGoogle Scholar
  3. 3.
    Bonastre JF, Matrouf D, Fredouille C (2006) Transfer function-based voice transformation for speaker recognition. In: Proceedings Odyssey: the speaker and language recognition workshopGoogle Scholar
  4. 4.
    Bonastre JF, Matrouf D, Fredouille C (2007) Artificial impostor voice transformation effects on false acceptance rates. In: Proceedings interspeechGoogle Scholar
  5. 5.
    Campbell J (1997) Speaker recognition: A tutorial. Proc IEEE 85(9):1437–1462CrossRefGoogle Scholar
  6. 6.
    Center ST VoiceGrid (TM) RT: Sophisticated distributed solution for real-time speaker identification. In: http://speechpro.com/product/biometric/voicegridrt
  7. 7.
    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366CrossRefGoogle Scholar
  8. 8.
    De Leon P, Pucher M, Yamagishi J, Hernaez I, Saratxaga I (2012) Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Trans Audio Speech Lang Process 20(8):2280–2290CrossRefGoogle Scholar
  9. 9.
    Dehak N, Dumouchel P, Kenny P (2007) Modeling prosodic features with joint factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 15(7):2095–2103CrossRefGoogle Scholar
  10. 10.
    Farrús M, Wagner M, Anguita J, Hernando J (2008) How vulnerable are prosodic features to professional imitators?. In: Proceedings Odyssey: the speaker and language recognition workshopGoogle Scholar
  11. 11.
    Faundez-Zanuy M, Hagmüller M, Kubin G (2006) Speaker verification security improvement by means of speech watermarking. Speech Comm 48(12):1608–1619CrossRefMATHGoogle Scholar
  12. 12.
    Hautamäki RG, Kinnunen T, Hautamäki V, Leino T, Laukkanen AM (2013) I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry. In: Proceedings interspeechGoogle Scholar
  13. 13.
    Hebert M (2008) Text-dependent speaker recognition. In: Benesty J, Sondhi M, Huang Y (eds) Springer Handbook of Speech Processing. Springer Berlin, Heidelberg, pp 743–762CrossRefGoogle Scholar
  14. 14.
    Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP)Google Scholar
  15. 15.
    Jin Q, Toth A, Black A, Schultz T (2008) Is voice transformation a threat to speaker identification?. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP)Google Scholar
  16. 16.
    Khoury E, Kinnunen T, Sizov A, Wu Z, Marcel S (2014) Introducing i-vectors for joint anti-spoong and speaker verication. In: Proceedings interspeechGoogle Scholar
  17. 17.
    Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: From features to supervectors. Speech Comm 52(1):12–40CrossRefGoogle Scholar
  18. 18.
    Kinnunen T, Wu Z, Lee K, Sedlak F, Chng E, Li H (2012) Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP) Google Scholar
  19. 19.
    Kockmann M, Burget L, Cernocky J (2010) Investigations into prosodic syllable contour features for speaker recognition. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP) Google Scholar
  20. 20.
    Kons Z, Aronowitz H (2013) Voice transformation-based spoofing of text-dependent speaker verification systems. In: Proceedings interspeechGoogle Scholar
  21. 21.
    Larcher A, Bonastre JF, Mason JS (2013) Constrained temporal structure for text-dependent speaker verification. Digital Signal Processing 23(6):1910–1917CrossRefGoogle Scholar
  22. 22.
    Larcher A, Lee KA, Ma B, Li H (2012) The RSR2015: Database for text-dependent speaker verification using multiple pass-phrases. In: Proceedings interspeechGoogle Scholar
  23. 23.
    Larcher A, Lee KA, Ma B, Li H (2014) Text-dependent speaker verification: Classifiers, databases and RSR2015. Speech Comm 60:5677Google Scholar
  24. 24.
    Lau YW, Wagner M, Tran D (2004) Vulnerability of speaker verification to voice mimicking. In: Proceedings of the IEEE international symposium on intelligent multimedia, video and speech processingGoogle Scholar
  25. 25.
    Lee CH, Huo Q (2000) On adaptive decision rules and decision parameter adaptation for automatic speech recognition. Proc IEEE 88(8):1241–1269CrossRefGoogle Scholar
  26. 26.
    Lee KA, Larcher A, Thai H, Ma B, Li H (2011) Joint application of speech and speaker recognition for automation and security in smart home. In: Proceedings interspeechGoogle Scholar
  27. 27.
    Lee KA, Ma B, Li H (2013) Speaker verification makes its debut in smartphone. In: IEEE signal processing society speech and language technical committee newsletterGoogle Scholar
  28. 28.
    Li H, Ma B (2010) Techware: Speaker and spoken language recognition resources [best of the web]. IEEE Signal Proc Mag 27(6):139–142Google Scholar
  29. 29.
    Li H, Ma B, Lee KA (2013) Spoken language recognition: From fundamentals to practice. Proc IEEE 101(5):1136–1159CrossRefGoogle Scholar
  30. 30.
    Lindberg J, Blomberg M, et al. (1999) Vulnerability in speaker verification-a study of technical impostor techniques. In: Proceedings of the European conference on speech communication and technology (Eurospeech)Google Scholar
  31. 31.
    Masuko T, Hitotsumatsu T, Tokuda K, Kobayashi T (1999) On the security of HMM-based speaker verification systems against imposture using synthetic speech. In: Proceedings of the European conference on speech communication and technology (Eurospeech)Google Scholar
  32. 32.
    Masuko T, Tokuda K, Kobayashi T (2000) Imposture using synthetic speech against speaker verification based on spectrum and pitch. In: Proceedings of the international conference on spoken language processing (ICSLP)Google Scholar
  33. 33.
    Matrouf D, Bonastre JF, Fredouille C (2006) Effect of speech transformation on impostor acceptance. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP)Google Scholar
  34. 34.
  35. 35.
    Qian Y, Soong FK, Yan ZJ (2013) A unified trajectory tiling approach to high quality speech rendering. IEEE Trans Audio Speech Lang Process 21(2):280–290CrossRefGoogle Scholar
  36. 36.
    Ratha NK, Connell JH (2001) Bolle, R.M.: Enhancing security and privacy in biometrics-based authentication systems. IBM Syst J 40(3):614–634CrossRefGoogle Scholar
  37. 37.
    Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted gaussian mixture models. Digital signal processing 10(1):19–41CrossRefGoogle Scholar
  38. 38.
    Satoh T, Masuko T, Kobayashi T, Tokuda K (2001) A robust speaker verification system against imposture using a HMM-based speech synthesis system. In: Proceedings of the European conference on speech communication and technology (Eurospeech)Google Scholar
  39. 39.
    Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Comm 46 (3):455–472CrossRefGoogle Scholar
  40. 40.
    Stafylakis T, Kenny P, Ouellet P, Perez J, Kockmann M, Dumouchel P (2013) Text-dependent speaker recognition using PLDA with uncertainty propagation. In: Proceedings interspeechGoogle Scholar
  41. 41.
    Stylianou Y, Cappé O, Moulines E (1998) Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing 6(2):131–142CrossRefGoogle Scholar
  42. 42.
    Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235CrossRefGoogle Scholar
  43. 43.
    Villalba J, Lleida E (2010) Speaker verification performance degradation against spoofing and tampering attacks. In: Proceedings FALA 10 workshopGoogle Scholar
  44. 44.
    Wu Z, Chng E, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: Proceedings interspeechGoogle Scholar
  45. 45.
    Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Comm 66:130–153CrossRefGoogle Scholar
  46. 46.
    Wu Z, Gao S, Cling ES, Li H (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC)Google Scholar
  47. 47.
    Wu Z, Khodabakhsh A, Demiroglu C, Yamagishi J, Saito D, Toda T, King S (2015) SAS: A speaker verification spoofing database containing diverse attacks. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP)Google Scholar
  48. 48.
    Wu Z, Kinnunen T, Chng E, Li H, Ambikairajah E (2012) A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case. In: Proceedings Asia-Pacific signal information processing association annual summit and conference (APSIPA ASC)Google Scholar
  49. 49.
    Wu Z, Larcher A, Lee KA, Chng ES, Kinnunen T, Li H (2013) Vulnerability evaluation of speaker verification under voice conversion spoofing: the effect of text constraints. In: Proceedings interspeechGoogle Scholar
  50. 50.
    Wu Z, Li H (2014) Voice conversion versus speaker verification: an overview. APSIPA Transactions on Signal and Information Processing 3(e17). doi: 10.1017/ATSIP.2014.17
  51. 51.
    Wu Z, Swietojanski P, Veaux C, Renals S, King S (2015) A study of speaker adaptation for DNN-based speech synthesis. In: Proceedings interspeechGoogle Scholar
  52. 52.
    Wu Z, Xiao X, Chng ES, Li H (2013) Synthetic speech detection using temporal modulation feature. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP)Google Scholar
  53. 53.
    Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J (2009) Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans Audio Speech Lang Process 17(1):66–83CrossRefGoogle Scholar
  54. 54.
    Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Comm 51(11):1039–1064CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.The Centre for Speech Technology Research (CSTR)University of EdinburghEdinburghUK
  2. 2.Human Language Technology DepartmentInstitute for Infocomm Research (I2R)SingaporeSingapore

Personalised recommendations