Advertisement

Introduction to Voice Presentation Attack Detection and Recent Advances

  • Md SahidullahEmail author
  • Héctor Delgado
  • Massimiliano Todisco
  • Tomi Kinnunen
  • Nicholas Evans
  • Junichi Yamagishi
  • Kong-Aik Lee
Chapter
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)

Abstract

Over the past few years, significant progress has been made in the field of presentation attack detection (PAD) for automatic speaker recognition (ASV). This includes the development of new speech corpora, standard evaluation protocols and advancements in front-end feature extraction and back-end classifiers. The use of standard databases and evaluation protocols has enabled for the first time the meaningful benchmarking of different PAD solutions. This chapter summarises the progress, with a focus on studies completed in the last 3 years. The article presents a summary of findings and lessons learned from two ASVspoof challenges, the first community-led benchmarking efforts. These show that ASV PAD remains an unsolved problem and that further attention is required to develop generalised PAD solutions which have potential to detect diverse and previously unseen spoofing attacks.

References

  1. 1.
    Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: From features to supervectors. Speech Commun 52(1):12–40.  https://doi.org/10.1016/j.specom.2009.08.009. http://www.sciencedirect.com/science/article/pii/S0167639309001289CrossRefGoogle Scholar
  2. 2.
    Hansen J, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag 32(6):74–99CrossRefGoogle Scholar
  3. 3.
    ISO/IEC 30107: Information technology—biometric presentation attack detection. International Organization for Standardization (2016)Google Scholar
  4. 4.
    Kinnunen T, Sahidullah M, Kukanov I, Delgado H, Todisco M, Sarkar A, Thomsen N, Hautamäki V, Evans N, Tan ZH (2016) Utterance verification for text-dependent speaker recognition: a comparative assessment using the reddots corpus. In: Proceedings of Interspeech, pp 430–434Google Scholar
  5. 5.
    Shang, W, Stevenson, M. (2010). Score normalization in playback attack detection. In: Proceedings of ICASSP. IEEE, pp 1678–1681Google Scholar
  6. 6.
    Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Commun 66:130–153CrossRefGoogle Scholar
  7. 7.
    Korshunov P, Marcel S, Muckenhirn H, Gonçalves A, Mello A, Violato R, Simoes F, Neto M, de Angeloni AM, Stuchi J, Dinkel H, Chen N, Qian Y, Paul D, Saha G, Sahidullah M. (2016). Overview of BTAS 2016 speaker anti-spoofing competition. In: 2016 IEEE 8th international conference on biometrics theory, applications and systems (BTAS), pp 1–6 (2016)Google Scholar
  8. 8.
    Evans N, Kinnunen T, Yamagishi J, Wu Z, Alegre F, DeLeon P (2014) Speaker recognition anti-spoofing. In: Marcel S, Li, SZ, Nixon M (eds) Handbook of biometric anti-spoofing. SpringerGoogle Scholar
  9. 9.
    Marcel S, Li SZ, Nixon M (eds) Handbook of biometric anti-spoofing: trusted biometrics under spoofing attacks. Springer (2014)Google Scholar
  10. 10.
    Farrús Cabeceran M, Wagner M, Erro D, Pericás H (2010) Automatic speaker recognition as a measurement of voice imitation and conversion. The Int J Speech Lang Law 1(17):119–142Google Scholar
  11. 11.
    Perrot P, Aversano G, Chollet G (2007) Voice disguise and automatic detection: review and perspectives. Progress in nonlinear speech processing, pp. 101–117Google Scholar
  12. 12.
    Zetterholm E (2007) Detection of speaker characteristics using voice imitation. In: Speaker Classification II. Springer, pp 192–205Google Scholar
  13. 13.
    Lau Y, Wagner M, Tran D (2004) Vulnerability of speaker verification to voice mimicking. In: Proceedings of 2004 international symposium on intelligent multimedia, video and speech processing, 2004. IEEE, pp 145–148Google Scholar
  14. 14.
    Lau Y, Tran D, Wagner M (2005) Testing voice mimicry with the YOHO speaker verification corpus. In: International conference on knowledge-based and intelligent information and engineering systems. Springer, pp 15–21Google Scholar
  15. 15.
    Mariéthoz J, Bengio S (2005) Can a professional imitator fool a GMM-based speaker verification system? Technical report, Idiap Research InstituteGoogle Scholar
  16. 16.
    Panjwani S, Prakash A (2014) Crowdsourcing attacks on biometric systems. In: Symposium on usable privacy and security (SOUPS 2014), pp 257–269Google Scholar
  17. 17.
    Hautamäki R, Kinnunen T, Hautamäki V, Laukkanen AM (2015) Automatic versus human speaker verification: the case of voice mimicry. Speech Commun 72:13–31CrossRefGoogle Scholar
  18. 18.
    Ergunay S, Khoury E, Lazaridis A, Marcel S (2015) On the vulnerability of speaker verification to realistic voice spoofing. In: IEEE international conference on biometrics: theory, applications and systems, pp 1–8Google Scholar
  19. 19.
    Lindberg J, Blomberg M (1999) Vulnerability in speaker verification-a study of technical impostor techniques. Proceedings of the European conference on speech communication and technology 3:1211–1214Google Scholar
  20. 20.
    Villalba J, Lleida E (2010) Speaker verification performance degradation against spoofing and tampering attacks. In: FALA 10 workshop, pp 131–134Google Scholar
  21. 21.
    Wang ZF, Wei G, He QH (2011) Channel pattern noise based playback attack detection algorithm for speaker recognition. In: 2011 International conference on machine learning and cybernetics, vol 4, pp 1708–1713Google Scholar
  22. 22.
    Villalba J, Lleida E (2011) Preventing replay attacks on speaker verification systems. In: 2011 IEEE International Carnahan Conference on Security Technology (ICCST). IEEE, pp 1–8Google Scholar
  23. 23.
    Gałka J, Grzywacz M, Samborski R (2015) Playback attack detection for text-dependent speaker verification over telephone channels. Speech Commun 67:143–153CrossRefGoogle Scholar
  24. 24.
    Taylor P (2009) Text-to-speech synthesis. Cambridge University PressGoogle Scholar
  25. 25.
    Klatt DH (1980) Software for a cascade/parallel formant synthesizer. J Acoust Soc Am 67:971–995CrossRefGoogle Scholar
  26. 26.
    Moulines E, Charpentier F (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun 9:453–467CrossRefGoogle Scholar
  27. 27.
    Hunt A, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings ICASSP, pp 373–376Google Scholar
  28. 28.
    Breen A, Jackson P (1998) A phonologically motivated method of selecting nonuniform units. In: Proceedings of ICSLP, pp 2735–2738Google Scholar
  29. 29.
    Donovan RE, Eide EM (1998) The IBM trainable speech synthesis system. In: Proceedings of ICSLP, pp 1703–1706Google Scholar
  30. 30.
    Beutnagel B, Conkie A, Schroeter J, Stylianou Y, Syrdal A (1999) The AT&T Next-Gen TTS system. In: Proceedigns of joint ASA, EAA and DAEA meeting, pp 15–19CrossRefGoogle Scholar
  31. 31.
    Coorman G, Fackrell J, Rutten P, Coile B (2000) Segment selection in the L & H realspeak laboratory TTS system. In: Proceedings of ICSLP, pp 395–398Google Scholar
  32. 32.
    Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proceedings of Eurospeech, pp 2347–2350Google Scholar
  33. 33.
    Ling ZH, Wu YJ, Wang YP, Qin L, Wang RH (2006) USTC system for Blizzard Challenge 2006 an improved HMM-based speech synthesis method. In: Proceedings of the Blizzard challenge workshopGoogle Scholar
  34. 34.
    Black A (2006) CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling. In: Proceedings of Interspeech, pp 1762–1765Google Scholar
  35. 35.
    Zen H, Toda T, Nakamura M, Tokuda K (2007) Details of the Nitech HMM-based speech synthesis system for the Blizzard challenge 2005. IEICE Trans Inf Syst E90-D(1):325–333CrossRefGoogle Scholar
  36. 36.
    Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064CrossRefGoogle Scholar
  37. 37.
    Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J (2009) Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans Speech Audio Lang Process 17(1), 66–83 (2009)CrossRefGoogle Scholar
  38. 38.
    Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9:171–185CrossRefGoogle Scholar
  39. 39.
    Woodland PC (2001) Speaker adaptation for continuous density HMMs: a review. In: Proceedings of ISCA workshop on adaptation methods for speech recognition, p 119Google Scholar
  40. 40.
    Ze H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: Proceedings of ICASSP, pp 7962–7966Google Scholar
  41. 41.
    Ling ZH, Deng L, Yu D (2013) Modeling spectral envelopes using restricted boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Trans Audio Speech Lang Process 21(10):2129–2139CrossRefGoogle Scholar
  42. 42.
    Fan Y, Qian Y, Xie FL, Soong F (2014) TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proceedings of Interspeech, pp 1964–1968Google Scholar
  43. 43.
    Zen H, Sak H (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceedings of ICASSP, pp 4470–4474Google Scholar
  44. 44.
    Wu Z, King S (2016) Investigating gated recurrent networks for speech synthesis. In: Proceedings of ICASSP, pp 5140–5144 (2016)Google Scholar
  45. 45.
    Wang X, Takaki S, Yamagishi J (2016) Investigating very deep highway networks for parametric speech synthesis. In: 9th ISCA speech synthesis workshop, pp 166–171Google Scholar
  46. 46.
    Wang X, Takaki S, Yamagishi J (2018) Investigating very deep highway networks for parametric speech synthesis. Speech Commun 96:1–9CrossRefGoogle Scholar
  47. 47.
    Wang X, Takaki S, Yamagishi J (2017) An autoregressive recurrent mixture density network for parametric speech synthesis. In: Proceedings of ICASSP, pp 4895–4899Google Scholar
  48. 48.
    Wang X, Takaki S, Yamagishi J (2017) An RNN-based quantized F0 model with multi-tier feedback links for text-to-speech synthesis. In: Proceedings of Interspeech, pp 1059–1063 (2017)Google Scholar
  49. 49.
    Saito, Y., Takamichi, S., Saruwatari, H.: Training algorithm to deceive anti-spoofing verification for DNN-based speech synthesis. In: Proc. ICASSP, pp 4900–4904 (2017)Google Scholar
  50. 50.
    Saito Y, Takamichi S, Saruwatari H (2018) Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Trans Audio Speech Lang Process 26(1):84–96CrossRefGoogle Scholar
  51. 51.
    Kaneko T, Kameoka H, Hojo N, Ijima Y, Hiramatsu K, Kashino K (2017) Generative adversarial network-based postfilter for statistical parametric speech synthesis. In: Proceedings of ICASSP, pp 4910–4914Google Scholar
  52. 52.
    Van Oord D, Dieleman A, Zen S, Simonyan H, Vinyals K, Graves O, Kalchbrenner A, Senior N, Kavukcuoglu AK (2016) Wavenet: a generative model for raw audio. arXiv:1609.03499
  53. 53.
    Mehri S, Kumar K, Gulrajani I, Kumar R, Jain S, Sotelo J, Courville A, Bengio Y (2016) Samplernn: an unconditional end-to-end neural audio generation model. arXiv:1612.07837
  54. 54.
    Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss R, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le Q, Agiomyrgiannakis Y, Clark R, Saurous R (2017) Tacotron: towards end-to-end speech synthesis. In: Proceedings of Interspeech, pp 4006–4010Google Scholar
  55. 55.
    Gibiansky A, Arik S, Diamos G, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017) Deep voice 2: multi-speaker neural text-to-speech. In: Advances in neural information processing systems, pp 2966–2974Google Scholar
  56. 56.
    Shen J, Schuster M, Jaitly N, Skerry-Ryan R, Saurous R, Weiss R, Pang R, Agiomyrgiannakis Y, Wu Y, Zhang Y, Wang Y, Chen Z, Yang Z (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceedigns of ICASSPGoogle Scholar
  57. 57.
    King S (2014) Measuring a decade of progress in text-to-speech. Loquens 1(1):006CrossRefGoogle Scholar
  58. 58.
    King S, Wihlborg L, Guo W (2017) The blizzard challenge 2017. In: Proceedings of Blizzard Challenge Workshop, Stockholm, SwedenGoogle Scholar
  59. 59.
    Foomany F, Hirschfield A, Ingleby M (2009) Toward a dynamic framework for security evaluation of voice verification systems. In: 2009 IEEE toronto international conference science and technology for humanity (TIC-STH), pp 22–27Google Scholar
  60. 60.
    Masuko T, Hitotsumatsu T, Tokuda K, Kobayashi T (1999) On the security of HMM-based speaker verification systems against imposture using synthetic speech. In: Proceedings of EUROSPEECHGoogle Scholar
  61. 61.
    Matsui T, Furui S (1995) Likelihood normalization for speaker verification using a phoneme- and speaker-independent model. Speech Commun 17(1–2):109–116CrossRefGoogle Scholar
  62. 62.
    Masuko T, Tokuda K, Kobayashi T, Imai S (1996) Speech synthesis using HMMs with dynamic features. In: Proceedings of ICASSPGoogle Scholar
  63. 63.
    Masuko T, Tokuda K, Kobayashi T, Imai S (1997) Voice characteristics conversion for HMM-based speech synthesis system. In: Proceedings of ICASSPGoogle Scholar
  64. 64.
    De Leon PL, Pucher M, Yamagishi J, Hernaez I, Saratxaga I (2012) Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Trans Audio Speech Lang Process 20(8):2280–2290CrossRefGoogle Scholar
  65. 65.
    Galou G (2011) Synthetic voice forgery in the forensic context: a short tutorial. In: Forensic speech and audio analysis working group (ENFSI-FSAAWG), pp 1–3Google Scholar
  66. 66.
    Cai W, Doshi A, Valle R (2018) Attacking speaker recognition with deep generative models. arXiv:1801.02384
  67. 67.
    Satoh T, Masuko T, Kobayashi T, Tokuda K (2001) A robust speaker verification system against imposture using an HMM-based speech synthesis system. In: Proceedings of Eurospeech (2001)Google Scholar
  68. 68.
    Chen LW, Guo W, Dai LR (2010) Speaker verification against synthetic speech. In: 2010 7th International symposium on Chinese spoken language processing (ISCSLP), pp 309–312Google Scholar
  69. 69.
    Quatieri TF (2002) Discrete-time speech signal processing: principles and practice. Prentice-Hall, IncGoogle Scholar
  70. 70.
    Wu Z, Chng E, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: Proceedings of InterspeechGoogle Scholar
  71. 71.
    Ogihara A, Unno H, Shiozakai A (2005) Discrimination method of synthetic speech using pitch frequency against synthetic speech falsification. IEICE Trans Fund Electron Commun Comput Sci 88(1):280–286CrossRefGoogle Scholar
  72. 72.
    De Leon P, Stewart B, Yamagishi J (2012) Synthetic speech discrimination using pitch pattern statistics derived from image analysis. In: Proceedings of Interspeech 2012. Portland, Oregon, USAGoogle Scholar
  73. 73.
    Stylianou Y (2009) Voice transformation: a survey. In: Proceedings of ICASSP, pp 3585–3588Google Scholar
  74. 74.
    Pellom B, Hansen J (1999) An experimental study of speaker verification sensitivity to computer voice-altered imposters. In: Proceedings of ICASSP, vol 2, pp 837–840Google Scholar
  75. 75.
    Mohammadi S, Kain A (2017) An overview of voice conversion systems. Speech Commun 88:65–82CrossRefGoogle Scholar
  76. 76.
    Abe M, Nakamura S, Shikano K, Kuwabara H (1988) Voice conversion through vector quantization. In: Proceedigns of ICASSP, pp 655–658Google Scholar
  77. 77.
    Arslan L (1999) Speaker transformation algorithm using segmental codebooks (STASC). Speech Commun 28(3):211–226CrossRefGoogle Scholar
  78. 78.
    Kain A, Macon M (1998) Spectral voice conversion for text-to-speech synthesis. In: Proceedings of ICASSP vol 1, pp 285–288Google Scholar
  79. 79.
    Stylianou Y, Cappé O, Moulines E (1998) Continuous probabilistic transform for voice conversion. IEEE Trans Speech Audio Process 6(2):131–142CrossRefGoogle Scholar
  80. 80.
    Toda T, Black A, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235CrossRefGoogle Scholar
  81. 81.
    Kobayashi K, Toda T, Neubig G, Sakti S, Nakamura S (2014) Statistical singing voice conversion with direct waveform modification based on the spectrum differential. In: Proceedings of InterspeechGoogle Scholar
  82. 82.
    Popa V, Silen H, Nurminen J, Gabbouj M (2012) Local linear transformation for voice conversion. In: Proceedigns of ICASSP. IEEE, pp 4517–4520Google Scholar
  83. 83.
    Chen Y, Chu M, Chang E, Liu J, Liu R (2003) Voice conversion with smoothed GMM and MAP adaptation. In: Proceedings of EUROSPEECH, pp 2413–2416Google Scholar
  84. 84.
    Hwang HT, Tsao Y, Wang HM, Wang YR, Chen SH (2012) A study of mutual information for GMM-based spectral conversion. In: Proceedigns of InterspeechGoogle Scholar
  85. 85.
    Helander E, Virtanen T, Nurminen J, Gabbouj M (2010) Voice conversion using partial least squares regression. IEEE Trans Audio Speech Lang Process 18(5):912–921CrossRefGoogle Scholar
  86. 86.
    Pilkington N, Zen H, Gales M (2011) Gaussian process experts for voice conversion. In: Proceedings of InterspeechGoogle Scholar
  87. 87.
    Saito D, Yamamoto K, Minematsu N, Hirose K (2011) One-to-many voice conversion based on tensor representation of speaker space. In: Proceedings of Interspeech, pp 653–656Google Scholar
  88. 88.
    Zen H, Nankaku Y, Tokuda K (2011) Continuous stochastic feature mapping based on trajectory HMMs. IEEE Trans Audio Speech Lang Process 19(2):417–430CrossRefGoogle Scholar
  89. 89.
    Wu Z, Kinnunen T, Chng E, Li H (2012) Mixture of factor analyzers using priors from non-parallel speech for voice conversion. IEEE Signal Process Lett 19(12)CrossRefGoogle Scholar
  90. 90.
    Saito D, Watanabe S, Nakamura A, Minematsu N (2012) Statistical voice conversion based on noisy channel model. IEEE Trans Audio Speech Lang Process 20(6):1784–1794CrossRefGoogle Scholar
  91. 91.
    Song P, Bao Y, Zhao L, Zou C (2011) Voice conversion using support vector regression. Electron Lett 47(18):1045–1046CrossRefGoogle Scholar
  92. 92.
    Helander E, Silén H, Virtanen T, Gabbouj M (2012) Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Audio Speech Lang Process 20(3):806–817CrossRefGoogle Scholar
  93. 93.
    Wu Z, Chng E, Li H (2013) Conditional restricted boltzmann machine for voice conversion. In: The first IEEE China summit and international conference on signal and information processing (ChinaSIP). IEEEGoogle Scholar
  94. 94.
    Narendranath M, Murthy H, Rajendran S, Yegnanarayana B (1995) Transformation of formants for voice conversion using artificial neural networks. Speech Commun 16(2):207–216CrossRefGoogle Scholar
  95. 95.
    Desai S, Raghavendra E, Yegnanarayana B, Black A, Prahallad K (2009) Voice conversion using artificial neural networks. In: Proceedings of ICASSP. IEEE, pp 3893–3896Google Scholar
  96. 96.
    Saito Y, Takamichi S, Saruwatari H (2017) Voice conversion using input-to-output highway networks. IEICE Transactions on Inf Syst E100.D(8):1925–1928CrossRefGoogle Scholar
  97. 97.
    Nakashika T, Takiguchi T, Ariki Y (2015) Voice conversion using RNN pre-trained by recurrent temporal restricted boltzmann machines. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 23(3):580–587CrossRefGoogle Scholar
  98. 98.
    Sun L, Kang S, Li K, Meng H (2015) Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: Proceedings of ICASSP, pp 4869–4873Google Scholar
  99. 99.
    Sundermann D, Ney H (2003) VTLN-based voice conversion. In: Proceedings of the 3rd IEEE international symposium on signal processing and information technology, 2003. ISSPIT 2003. IEEEGoogle Scholar
  100. 100.
    Erro D, Moreno A, Bonafonte A (2010) Voice conversion based on weighted frequency warping. IEEE Trans Audio Speech Lang Process 18(5):922–931CrossRefGoogle Scholar
  101. 101.
    Erro D, Navas E, Hernaez I (2013) Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans Audio Speech Lang Process 21(3):556–566CrossRefGoogle Scholar
  102. 102.
    Hsu CC, Hwang HT, Wu YC, Tsao Y, Wang HM (2017) Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. In: Proceedings of Interspeech, vol 2017, pp 3364–3368Google Scholar
  103. 103.
    Miyoshi H, Saito Y, Takamichi S, Saruwatari H (2017) Voice conversion using sequence-to-sequence learning of context posterior probabilities. Proceedings of Interspeech, vol 2017, pp 1268–1272Google Scholar
  104. 104.
    Fang F, Yamagishi J, Echizen I, Lorenzo-Trueba J (2018) High-quality nonparallel voice conversion based on cycle-consistent adversarial network. In: Proceedings of ICASSP 2018Google Scholar
  105. 105.
    Kobayashi K, Hayashi T, Tamamori A, Toda T (2017) Statistical voice conversion with wavenet-based waveform generation. In: Proceedings of Interspeech, pp 1138–1142Google Scholar
  106. 106.
    Gillet B, King S (2003) Transforming F0 contours. In: Proceedings of EUROSPEECH, pp 101–104 (2003)Google Scholar
  107. 107.
    Wu CH, Hsia CC, Liu TH, Wang JF (2006) Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis. IEEE Trans Audio Speech Lang Process 14(4):1109–1116CrossRefGoogle Scholar
  108. 108.
    Helander E, Nurminen J (2007) A novel method for prosody prediction in voice conversion. In: Proceedings of ICASSP, vol 4. IEEE, pp IV–509Google Scholar
  109. 109.
    Wu Z, Kinnunen T, Chng E, Li H (2010) Text-independent F0 transformation with non-parallel data for voice conversion. In: Proceedings of InterspeechGoogle Scholar
  110. 110.
    Lolive D, Barbot N, Boeffard O (2008) Pitch and duration transformation with non-parallel data. Speech Prosody 2008:111–114Google Scholar
  111. 111.
    Toda T, Chen LH, Saito D, Villavicencio F, Wester M, Wu Z, Yamagishi J (2016) The voice conversion challenge 2016. In: Proceedings of Interspeech, pp 1632–1636Google Scholar
  112. 112.
    Wester M, Wu Z, Yamagishi J (2016) Analysis of the voice conversion challenge 2016 evaluation results. In: Proceedings of Interspeech, pp 1637–1641Google Scholar
  113. 113.
    Perrot P, Aversano G, Blouet R, Charbit M, Chollet G (2005) Voice forgery using ALISP: indexation in a client memory. In: Proceedings of ICASSP, vol 1. IEEE, pp 17–20Google Scholar
  114. 114.
    Matrouf D, Bonastre JF, Fredouille C (2006) Effect of speech transformation on impostor acceptance. In: Proceedings of ICASSP, vol 1. IEEE, pp I–IGoogle Scholar
  115. 115.
    Kinnunen T, Wu Z, Lee K, Sedlak F, Chng E, Li H (2012) Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech. In: Proceedings of ICASSP. IEEE, pp 4401–4404Google Scholar
  116. 116.
    Sundermann D, Hoge H, Bonafonte A, Ney H, Black A, Narayanan S (2006) Text-independent voice conversion based on unit selection. In: Proceedings of ICASSP, vol 1, pp I–IGoogle Scholar
  117. 117.
    Wu Z, Larcher A, Lee K, Chng E, Kinnunen T, Li H (2013) Vulnerability evaluation of speaker verification under voice conversion spoofing: the effect of text constraints. In: Proceedings of Interspeech, Lyon, France (2013)Google Scholar
  118. 118.
    Alegre F, Vipperla R, Evans N, Fauve B (2012) On the vulnerability of automatic speaker recognition to spoofing attacks with artificial signals. In: 2012 EURASIP conference on european conference on signal processing (EUSIPCO)Google Scholar
  119. 119.
    De Leon PL, Hernaez I, Saratxaga I, Pucher M, Yamagishi J (2011) Detection of synthetic speech for the problem of imposture. In: Proceedings of ICASSP, Dallas, USA, pp 4844–4847Google Scholar
  120. 120.
    Wu Z, Kinnunen T, Chng E, Li H, Ambikairajah E (2012) A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case. In: Proceedings of Asia-Pacific signal information processing association annual summit and conference (APSIPA ASC). IEEE, pp 1–5Google Scholar
  121. 121.
    Alegre F, Vipperla R, Evans,N (2012) Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals. In: Proceedings of InterspeechGoogle Scholar
  122. 122.
    Alegre F, Amehraye A, Evans N (2013) Spoofing countermeasures to protect automatic speaker verification from voice conversion. In: Proceedings of ICASSPGoogle Scholar
  123. 123.
    Wu Z, Xiao X, Chng E, Li H (2013) Synthetic speech detection using temporal modulation feature. In: Proceedings of ICASSPGoogle Scholar
  124. 124.
    Alegre F, Vipperla R, Amehraye A, Evans N (2013) A new speaker verification spoofing countermeasure based on local binary patterns. In: Proceedings of Interspeech, Lyon, FranceGoogle Scholar
  125. 125.
    Wu Z, Kinnunen T, Evans N, Yamagishi J, Hanilçi C, Sahidullah M, Sizov A (2015) ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: Proceedings of InterspeechGoogle Scholar
  126. 126.
    Kinnunen T, Sahidullah M, Delgado H, Todisco M, Evans N, Yamagishi J, Lee K (2017) The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection. In: INTERSPEECHGoogle Scholar
  127. 127.
    Wu Z, Khodabakhsh A, Demiroglu C, Yamagishi J, Saito D, Toda T, King S (2015) SAS: a speaker verification spoofing database containing diverse attacks. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)Google Scholar
  128. 128.
    Wu Z, Kinnunen T, Evans N, Yamagishi J (2014) ASVspoof 2015: automatic speaker verification spoofing and countermeasures challenge evaluation plan. http://www.spoofingchallenge.org/asvSpoof.pdf
  129. 129.
    Patel T, Patil H (2015) Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In: Proceedings of InterspeechGoogle Scholar
  130. 130.
    Novoselov S, Kozlov A, Lavrentyeva G, Simonchik K, Shchemelinin V (2016) STC anti-spoofing systems for the ASVspoof 2015 challenge. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 5475–5479Google Scholar
  131. 131.
    Chen N, Qian Y, Dinkel H, Chen B, Yu K (2015) Robust deep feature for spoofing detection-the SJTU system for ASVspoof 2015 challenge. In: Proceedings of InterspeechGoogle Scholar
  132. 132.
    Xiao X, Tian X, Du S, Xu H, Chng E, Li H (2015) Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge. In: Proceedings of InterspeechGoogle Scholar
  133. 133.
    Alam M, Kenny P, Bhattacharya G, Stafylakis T (2015) Development of CRIM system for the automatic speaker verification spoofing and countermeasures challenge 2015. In: Proceedings of InterspeechGoogle Scholar
  134. 134.
    Wu Z, Yamagishi J, Kinnunen T, Hanilçi C, Sahidullah M, Sizov A, Evans N, Todisco M, Delgado H (2017) Asvspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE J Sel Top Signal Process 11(4):588–604CrossRefGoogle Scholar
  135. 135.
    Delgado H, Todisco M, Sahidullah M, Evans N, Kinnunen T, Lee K, Yamagishi J (2018) ASVspoof 2017 version 2.0: meta-data analysis and baseline enhancements. In: Proceedings of Odyssey 2018 the speaker and language recognition workshop, pp 296–303Google Scholar
  136. 136.
    Todisco M, Delgado H, Evans N (2016) A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. In: Proceedings of Odyssey: the speaker and language recognition workshop, Bilbao, Spain, pp 283–290Google Scholar
  137. 137.
    Todisco M, Delgado H, Evans N (2017) Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput Speech Lang 45:516–535CrossRefGoogle Scholar
  138. 138.
    Lavrentyeva G, Novoselov S, Malykh E, Kozlov A, Kudashev O, Shchemelinin V (2017) Audio replay attack detection with deep learning frameworks. In: Proceedings of Interspeech, pp 82–86Google Scholar
  139. 139.
    Ji Z, Li Z, Li P, An M, Gao S, Wu D, Zhao F (2017) Ensemble learning for countermeasure of audio replay spoofing attack in ASVspoof2017. In: Proceedings of Interspeech, pp 87–91Google Scholar
  140. 140.
    Li L, Chen Y, Wang D, Zheng T (2017) A study on replay attack and anti-spoofing for automatic speaker verification. In: Proceedings of Interspeech, pp 92–96Google Scholar
  141. 141.
    Patil H, Kamble M, Patel T, Soni M (2017) Novel variable length teager energy separation based instantaneous frequency features for replay detection. In: Proceedings of Interspeech, pp 12–16Google Scholar
  142. 142.
    Chen Z, Xie Z, Zhang W, Xu X (2017) ResNet and model fusion for automatic spoofing detection. In: Proceedings of Interspeech, pp 102–106Google Scholar
  143. 143.
    Wu Z, Gao S, Cling E, Li H (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Proceedings of Asia-Pacific signal information processing association annual summit and conference (APSIPA ASC). IEEE, pp 1–5Google Scholar
  144. 144.
    Li Q (2009) An auditory-based transform for audio signal processing. In: 2009 IEEE workshop on applications of signal processing to audio and acoustics. IEEE, pp 181–184Google Scholar
  145. 145.
    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366CrossRefGoogle Scholar
  146. 146.
    Sahidullah M, Kinnunen T, Hanilçi C (2015) A comparison of features for synthetic speech detection. In: Proceedings of Interspeech. ISCA, pp 2087–2091Google Scholar
  147. 147.
    Brown J (1991) Calculation of a constant Q spectral transform. J Acoust Soc Am 89(1):425–434CrossRefGoogle Scholar
  148. 148.
    Alam M, Kenny P (2017) Spoofing detection employing infinite impulse response—constant Q transform-based feature representations. In: Proceedings of European signal processing conference (EUSIPCO)Google Scholar
  149. 149.
    Cancela P, Rocamora M, López E (2009) An efficient multi-resolution spectral transform for music analysis. In: Proceedings of international society for music information retrieval conference, pp 309–314Google Scholar
  150. 150.
    Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127MathSciNetzbMATHCrossRefGoogle Scholar
  151. 151.
    Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning. MIT Press, CambridgezbMATHGoogle Scholar
  152. 152.
    Tian Y, Cai M, He L, Liu J (2015) Investigation of bottleneck features and multilingual deep neural networks for speaker verification. In: Proceedings of Interspeech, pp 1151–1155Google Scholar
  153. 153.
    Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675CrossRefGoogle Scholar
  154. 154.
    Hinton G, Deng L, Yu D, Dahl GE, Mohamed RA, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97CrossRefGoogle Scholar
  155. 155.
    Alam M, Kenny P, Gupta V, Stafylakis T (2016) Spoofing detection on the ASVspoof2015 challenge corpus employing deep neural networks. In: Proceedings of Odyssey: the Speaker and Language Recognition Workshop, Bilbao, Spain, pp 270–276Google Scholar
  156. 156.
    Qian Y, Chen N, Yu K (2016) Deep features for automatic spoofing detection. Speech Commun 85:43–52CrossRefGoogle Scholar
  157. 157.
    Yu H, Tan ZH, Zhang Y, Ma Z, Guo J (2017) DNN filter bank cepstral coefficients for spoofing detection. IEEE Access 5:4779–4787CrossRefGoogle Scholar
  158. 158.
    Sriskandaraja K, Sethu V, Ambikairajah E, Li H (2017) Front-end for antispoofing countermeasures in speaker verification: scattering spectral decomposition. IEEE J Sel Top Signal Process 11(4):632–643.  https://doi.org/10.1109/JSTSP.2016.2647202CrossRefGoogle Scholar
  159. 159.
    Andén J, Mallat S (2014) Deep scattering spectrum. IEEE Trans Signal Process 62(16):4114–4128MathSciNetzbMATHCrossRefGoogle Scholar
  160. 160.
    Mallat S (2012) Group invariant scattering. Commun Pure Appl Math 65:1331–1398MathSciNetzbMATHCrossRefGoogle Scholar
  161. 161.
    Pal M, Paul D, Saha G (2018) Synthetic speech detection using fundamental frequency variation and spectral features. Comput Speech Lang 48:31–50CrossRefGoogle Scholar
  162. 162.
    Laskowski K, Heldner M, Edlund J (2008) The fundamental frequency variation spectrum. Proc FONETIK 2008:29–32Google Scholar
  163. 163.
    Saratxaga I, Sanchez J, Wu Z, Hernaez I, Navas E (2016) Synthetic speech detection using phase information. Speech Commun 81:30–41CrossRefGoogle Scholar
  164. 164.
    Wang L, Nakagawa S, Zhang Z, Yoshida Y, Kawakami Y (2017) Spoofing speech detection using modified relative phase information. IEEE J Sel Top Signal Process 11(4):660–670CrossRefGoogle Scholar
  165. 165.
    Chakroborty S, Saha G (2009) Improved text-independent speaker identification using fused MFCC & IMFCC feature sets based on Gaussian filter. Int J Signal Process 5(1):11–19Google Scholar
  166. 166.
    Wu X, He R, Sun Z, Tan T (2018) A light CNN for deep face representation with noisy labels. IEEE Trans Inf Forensics Secur 13(11):2884–2896CrossRefGoogle Scholar
  167. 167.
    Goncalves AR, Violato RPV, Korshunov P, Marcel S, Simoes FO (2017) On the generalization of fused systems in voice presentation attack detection. In: 2017 International conference of the biometrics special interest group (BIOSIG), pp 1–5.  https://doi.org/10.23919/BIOSIG.2017.8053516
  168. 168.
    Paul D, Pal M, Saha G (2016) Novel speech features for improved detection of spoofing attacks. In: Proceedings of annual IEEE India conference (INDICON)Google Scholar
  169. 169.
    Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798CrossRefGoogle Scholar
  170. 170.
    Khoury E, Kinnunen T, Sizov A, Wu Z, Marcel S (2014) Introducing i-vectors for joint anti-spoofing and speaker verification. In: Proceedings of InterspeechGoogle Scholar
  171. 171.
    Sizov A, Khoury E, Kinnunen T, Wu Z, Marcel S (2015) Joint speaker verification and antispoofing in the i-vector space. IEEE Trans Inf Forensics Secur 10(4):821–832CrossRefGoogle Scholar
  172. 172.
    Hanilçi C (2018) Data selection for i-vector based automatic speaker verification anti-spoofing. Digit Signal Process 72:171–180CrossRefGoogle Scholar
  173. 173.
    Tian X, Wu Z, Xiao X, Chng E, Li H (2016) Spoofing detection from a feature representation perspective. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 2119–2123Google Scholar
  174. 174.
    Yu H, Tan ZH, Ma Z, Martin R, Guo J (2018) Spoofing detection in automatic speaker verification systems using dnn classifiers and dynamic acoustic features. IEEE Trans Neural Netw Learn Syst PP(99):1–12Google Scholar
  175. 175.
    Dinkel H, Chen N, Qian Y, Yu K (2017) End-to-end spoofing detection with raw waveform cldnns. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp 4860–4864Google Scholar
  176. 176.
    Sainath T, Weiss R, Senior A, Wilson K, Vinyals O (2015) Learning the speech front-end with raw waveform CLDNNs. In: Proceedigns of InterspeechGoogle Scholar
  177. 177.
    Zhang C, Yu C, Hansen JHL (2017) An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J Sel Top Signal Process 11(4):684–694CrossRefGoogle Scholar
  178. 178.
    Muckenhirn H, Magimai-Doss M, Marcel S (2017) End-to-end convolutional neural network-based voice presentation attack detection. In: 2017 IEEE international joint conference on biometrics (IJCB), pp 335–341Google Scholar
  179. 179.
    Chen S, Ren K, Piao S, Wang C, Wang Q, Weng J, Su L, Mohaisen A (2017) You can hear but you cannot steal: Defending against voice impersonation attacks on smartphones. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS). IEEE, pp 183–195Google Scholar
  180. 180.
    Shiota S, Villavicencio F, Yamagishi J, Ono N, Echizen I, Matsui T (2015) Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification. In: Proceedings of InterspeechGoogle Scholar
  181. 181.
    Shiota S, Villavicencio F, Yamagishi J, Ono N, Echizen I, Matsui T (2016) Voice liveness detection for speaker verification based on a tandem single/double-channel pop noise detector. In: ODYSSEYGoogle Scholar
  182. 182.
    Sahidullah M, Thomsen D, Hautamäki R, Kinnunen T, Tan ZH, Parts R, Pitkänen M (2018) Robust voice liveness detection and speaker verification using throat microphones. IEEE/ACM Trans Audio Speech Lang Process 26(1):44–56CrossRefGoogle Scholar
  183. 183.
    Elko G, Meyer J, Backer S, Peissig J (2007) Electronic pop protection for microphones. In: 2007 IEEE workshop on applications of signal processing to audio and acoustics. IEEE, pp 46–49Google Scholar
  184. 184.
    Zhang L, Tan S, Yang J, Chen Y (2016) Voicelive: a phoneme localization based liveness detection for voice authentication on smartphones. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. ACM, pp 1080–1091Google Scholar
  185. 185.
    Zhang L, Tan S, Yang J (2017) Hearing your voice is not enough: An articulatory gesture based liveness detection for voice authentication. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. ACM, pp 57–71Google Scholar
  186. 186.
    Hanilçi C, Kinnunen T, Sahidullah M, Sizov A (2016) Spoofing detection goes noisy: an analysis of synthetic speech detection in the presence of additive noise. Speech Commun 85:83–97CrossRefGoogle Scholar
  187. 187.
    Yu H, Sarkar A, Thomsen D, Tan ZH, Ma Z, Guo J (2016) Effect of multi-condition training and speech enhancement methods on spoofing detection. In: Proceedings of international workshop on sensing, processing and learning for intelligent machines (SPLINE)Google Scholar
  188. 188.
    Tian X, Wu Z, Xiao X, Chng E, Li H (2016) An investigation of spoofing speech detection under additive noise and reverberant conditions. In: Proceedings of Interspeech (2016)Google Scholar
  189. 189.
    Delgado H, Todisco M, Evans N, Sahidullah M, Liu W, Alegre F, Kinnunen T, Fauve B (2017) Impact of bandwidth and channel variation on presentation attack detection for speaker verification. In: 2017 International conference of the biometrics special interest group (BIOSIG), pp 1–6Google Scholar
  190. 190.
    Qian Y, Chen N, Dinkel H, Wu Z (2017) Deep feature engineering for noise robust spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 25(10):1942–1955CrossRefGoogle Scholar
  191. 191.
    Korshunov P, Marcel S (2016) Cross-database evaluation of audio-based spoofing detection systems. In: Proceedings of InterspeechGoogle Scholar
  192. 192.
    Paul D, Sahidullah M, Saha G (2017) Generalization of spoofing countermeasures: a case study with ASVspoof 2015 and BTAS 2016 corpora. In: Proceedigns of IEEE international conference on acoustics, speech, and signal processing (ICASSP). IEEE pp 2047–2051Google Scholar
  193. 193.
    Lorenzo-Trueba J, Fang F, Wang X, Echizen I, Yamagishi J, Kinnunen T (2018) Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data. In: Proceedings of Odyssey: the speaker and language recognition workshopGoogle Scholar
  194. 194.
    Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680Google Scholar
  195. 195.
    Kreuk F, Adi Y, Cisse M, Keshet J (2018) Fooling end-to-end speaker verification by adversarial examples. arXiv:1801.03339
  196. 196.
    Sahidullah M, Delgado H, Todisco M, Yu H, Kinnunen T, Evans N, Tan ZH (2016) Integrated spoofing countermeasures and automatic speaker verification: an evaluation on ASVspoof 2015. In: Proceedings of InterspeechGoogle Scholar
  197. 197.
    Muckenhirn H, Korshunov P, Magimai-Doss M, Marcel S (2017) Long-term spectral statistics for voice presentation attack detection. IEEE/ACM Trans Audio Speech Lang Process 25(11):2098–2111CrossRefGoogle Scholar
  198. 198.
    Sarkar A, Sahidullah M, Tan ZH, Kinnunen T (2017) Improving speaker verification performance in presence of spoofing attacks using out-of-domain spoofed data. In: Proceedings of InterspeechGoogle Scholar
  199. 199.
    Kinnunen T, Lee K, Delgado H, Evans N, Todisco M, Sahidullah M, Yamagishi J, Reynolds D (2018) t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. In: Proceedings of Odyssey: the speaker and language recognition workshopGoogle Scholar
  200. 200.
    Todisco M, Delgado H, Lee K, Sahidullah M, Evans N, Kinnunen T, Yamagishi J (2018) Integrated presentation attack detection and automatic speaker verification: common features and Gaussian back-end fusion. In: Proceedings of InterspeechGoogle Scholar
  201. 201.
    Wu Z, De Leon P, Demiroglu C, Khodabakhsh A, King S, Ling ZH, Saito D, Stewart B, Toda T, Wester M, Yamagishi Y (2016) Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio Speech Lang Process 24(4):768–783CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Md Sahidullah
    • 1
    Email author
  • Héctor Delgado
    • 2
  • Massimiliano Todisco
    • 2
  • Tomi Kinnunen
    • 1
  • Nicholas Evans
    • 2
  • Junichi Yamagishi
    • 3
    • 4
  • Kong-Aik Lee
    • 5
  1. 1.School of ComputingUniversity of Eastern FinlandKuopioFinland
  2. 2.Department of Digital SecurityEURECOMBiot Sophia AntipolisFrance
  3. 3.National Institute of InformaticsTokyoJapan
  4. 4.University of EdinburghEdinburghScotland
  5. 5.Data Science Research LaboratoriesNEC Corporation (Japan)TokyoJapan

Personalised recommendations