Voice Presentation Attack Detection Using Convolutional Neural Networks

  • Ivan HimawanEmail author
  • Srikanth Madikeri
  • Petr Motlicek
  • Milos Cernak
  • Sridha Sridharan
  • Clinton Fookes
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)


Current state-of-the-art automatic speaker verification (ASV) systems are prone to spoofing. The security and reliability of ASV systems can be threatened by different types of spoofing attacks using voice conversion, synthetic speech, or recorded passphrase. It is therefore essential to develop countermeasure techniques which can detect such spoofed speech. Inspired by the success of deep learning approaches in various classification tasks, this work presents an in-depth study of convolutional neural networks (CNNs) for spoofing detection in automatic speaker verification (ASV) systems. Specifically, we have compared the use of three different CNNs architectures: AlexNet, CNNs with max-feature-map activation, and an ensemble of standard CNNs for developing spoofing countermeasures, and discussed their potential to avoid overfitting due to small amounts of training data that is usually available in this task. We used popular deep learning toolkits for the system implementation and have released the implementation code of our methods publicly. We have evaluated the proposed countermeasure systems for detecting replay attacks on recently released spoofing corpora ASVspoof 2017, and also provided in-depth visual analyses of CNNs to aid for future research in this area.



Computational (and/or data visualization) resources and services used in this work were provided by the HPC and Research Support Group, Queensland University of Technology, Brisbane, Australia. This project was supported in part by an Australian Research Council Linkage grant LP 130100110.


  1. 1.
    Hautamäki RS et al (2015) Automatic versus human speaker verification: the case of voice mimicry. Speech Commun 72:13–31Google Scholar
  2. 2.
    Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio, Speech, Lang Process 15(8):2222–2235CrossRefGoogle Scholar
  3. 3.
    Erro D, Polyakova T, Moreno A (2008) On combining statistical methods and frequency warping for high-quality voice conversion. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 4665–4668Google Scholar
  4. 4.
    Masuko T, Tokuda K, Kobayashi T (2008) Imposture using synthetic speech against speaker verification based on spectrum and pitch. In: Proceedings of international conference on spoken language processing, pp 302–305Google Scholar
  5. 5.
    Satoh T et al (2001) A robust speaker verification system against imposture using an HMM-based speech synthesis system. In: Proceedings of interspeech, pp 759–762Google Scholar
  6. 6.
    Zheng TF, Li L (2017) Robustness-related issues in speaker recognition. Springer, SingaporeGoogle Scholar
  7. 7.
    Wu Z et al (2015) ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: Proceedings of interspeech, pp 2037–2041Google Scholar
  8. 8.
    ISO/IEC JTC 1/SC 37 Biometrics: ISO/IEC 30107-1:2016, Information technology - Biometrics presentation attack detection - part 1: Framework. ISO/IEC Information Technology Task Force (ITTF) (2016)Google Scholar
  9. 9.
    Wu Z et al (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Proceedings of asia-pacific signal and information processing association, annual summit and conference (APSIPA), pp 1–5Google Scholar
  10. 10.
    Gałka J, Grzywacz M, Samborski R (2015) Playback attack detection for text-dependent speaker verification over telephone channels. Speech Commun 67:143–153CrossRefGoogle Scholar
  11. 11.
    Janicki A, Alegre F, Evans N (2016) An assessment of automatic speaker verification vulnerabilities to replay spoofing attacks. Sec Commun Netw 9:3030–3044CrossRefGoogle Scholar
  12. 12.
    Lavrentyeva G et al (2017) Audio replay attack detection with deep learning frameworks. In: Proceedings of interspeech, pp 82–86Google Scholar
  13. 13.
    Chen Z et al (2017) ResNet and model fusion for automatic spoofing detection. In: Proceedings of interspeech, pp 102–106Google Scholar
  14. 14.
    Cai W et al (2017) Countermeasures for automatic speaker verification replay spoofing attack: on data augmentation, feature representation, classification and fusion. In: Proceedings of intespeech, pp 17–21Google Scholar
  15. 15.
    Hinton GE et al (2012) Improving neural networks by preventing co-adaption of feature detectors. arXiv:1207.0580
  16. 16.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2:1097–1105Google Scholar
  17. 17.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  18. 18.
    Russakovsky O et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  19. 19.
    Abadi M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467
  20. 20.
    Paszke A et al (2017) Automatic differentiation in PyTorch. In: 31st conference on neural information processing systemsGoogle Scholar
  21. 21.
    Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP-97)vol 2, pp 1331–1334.
  22. 22.
    Paliwal KK (1998) Spectral subband centroid features for speech recognition. Proc IEEE Int Conf Acoustic, Speech Signal Process 2:617–620Google Scholar
  23. 23.
    Youngberg J, Boll S (1978) Constant-Q signal analysis and synthesis. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 375–378Google Scholar
  24. 24.
    Mallat S (2008) A wavelet tour of signal processing, 3rd edn. The sparse way. Academic press, New YorkGoogle Scholar
  25. 25.
    Liu Y, Tian Y, He L, Liu J, Johnson MT (2015) Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing. sign (gp- gc) 2:1Google Scholar
  26. 26.
    Sahidullah Md et al (2016) Integrated spoofing countermeasures and automatic speaker verification: an evaluation on ASVspoof 2015. In: Proceedings of interspeech, pp 1700–1704Google Scholar
  27. 27.
    Villalba J et al (2015) Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In: Proceedings of interspeech, pp 2067–2071Google Scholar
  28. 28.
    Chakroborty S, Roy A, Saha G (2007) Improved close set text-independent speaker identification by combining MFCC with evidence from flipped filter banks. Int J Signal Process 4(2):114–121Google Scholar
  29. 29.
    Xiao X, Tian X, Du S, Xu H, Chng ES, Li H (2015) Spoofing speech detection using high dimensional magnitude and phase features: The NTU approach for ASVspoof 2015 challenge. In: Proceedings of interspeechGoogle Scholar
  30. 30.
    Saratxaga I (2016) Synthetic speech detection using phase information. Speech Commun 81:30–41CrossRefGoogle Scholar
  31. 31.
    Wu Z (2016) Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio, Speech Lang Process 24:768–783CrossRefGoogle Scholar
  32. 32.
    Chen N, Qian Y, Dinkel H, Chen B, Yu K (2015) Robust deep feature for spoofing detection-the SJTU system for ASVspoof 2015 challenge. In: Proceedings of interspeechGoogle Scholar
  33. 33.
    Korshunov P, Marcel S, Muckenhirn H, Gonçalves AR, Mello AGS, Violato RPV, Simoes FO, Neto MU, de Assis Angeloni M, Stuchi JA, Dinkel H, Chen N, Qian Y, Paul D, Saha G, Sahidullah M (2016) Overview of BTAS 2016 speaker anti-spoofing competition. In: IEEE international conference on biometrics theory, applications and systems (BTAS)Google Scholar
  34. 34.
    Qian Y (2017) Deep feature engineering for noise robust spoofing detection. IEEE/ACM Trans Audio, Speech Lang Process 25(10):1942–1955CrossRefGoogle Scholar
  35. 35.
    Dinkel H et al (2017) End-to-end spoofing detection with raw waveform CLDNNS. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 4860–4864Google Scholar
  36. 36.
    Zhang C, Yu C, Hansen JHL (2017) An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J Sel Top Signal Process 11(4):684–694. Scholar
  37. 37.
    Alam MJ et al (2016) Spoofing detection on the ASVspoof 2015 challenge corpus employing deep neural networks. In: Proceedings of odyssey, pp 270–276Google Scholar
  38. 38.
    Yu H et al (2017) Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. In: IEEE transactions on neural networks and learning systems, pp 1–12Google Scholar
  39. 39.
    Yu H et al (2017) DNN filter bank cepstral coefficients for spoofing detection. IEEE Access 5:4779–4787CrossRefGoogle Scholar
  40. 40.
    Qian Y, Chen N, Yu K (2016) Deep features for automatic spoofing detection. Speech Commun 85(C):43–52. Scholar
  41. 41.
    Korshunov P et al (2018) On the use of convolutional neural network for speech presentation attack detection. In: Proceedings of IEEE international conference on identity, security, and behavior analysisGoogle Scholar
  42. 42.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778Google Scholar
  43. 43.
    Veit A et al (2016) Residual networks behave like ensembles of relatively shallow networks. Adv Neural Inf Process Syst 550–558Google Scholar
  44. 44.
    Muckenhirn H, Magimai-Doss M, Marcel S (2017) End-to-end convolutional neural network-based voice presentation attack detection. In: Proceedings of international joint conference on biometricsGoogle Scholar
  45. 45.
    Chen Z et al (2018) Recurrent neural networks for automatic replay spoofing attack detection. In: Proceedings of IEEE international conference on acoustic, speech, and signal processingGoogle Scholar
  46. 46.
    Nandakumar K (2008) Likelihood ratio-based biometric score fusion. IEEE Trans Pattern Anal Mach Intell 30(2):342–347CrossRefGoogle Scholar
  47. 47.
    Todisco M, Delgado H, Evans N (2016) Articulation rate filtering of CQCC features for automatic speaker verification. In: Proceeding of interspeech, pp 3628–3632Google Scholar
  48. 48.
    Todisco M, Delgado H, Evans N (2016) A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients. In: Proceedings of odessey, pp 283–290Google Scholar
  49. 49.
    Kinnunen T et al (2016) Utterance verification for text-dependent speaker recognition: a comparative assessment using the RedDots corpus. In: Proceedings of interspeech, pp 430–434Google Scholar
  50. 50.
    Kinnunen T et al (2017) The ASVspoof 2017 challenge: Assesing the limits of replay spoofing attack detection. In: Proceedings of interspeech, pp 2–6Google Scholar
  51. 51.
    Wang X, Takaki S, Yamagishi J (2017) An RNN-based quantized F0 model with multi-tier feedback links for text-to-speech synthesis. In: Proceedings of interspeech, pp 1059–1063Google Scholar
  52. 52.
    Font R, Espín JM, Cano MJ (2017) Experimental analysis of features for replay attack detection-results on the ASVspoof 2017 challenge. In: Proceedings of interspeech, pp 7–11Google Scholar
  53. 53.
    Sermanet P et al (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. In: Proceedings of international conference on learning representationsGoogle Scholar
  54. 54.
    Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, pp 818–833Google Scholar
  55. 55.
    Lin M, Chen Q, Yan S (2014) Network in network. In: Proceedings of international conference on learning representationsGoogle Scholar
  56. 56.
    Wu X, He R, Sun Z (2015) A lightened CNN for deep face representation. arXiv:1511.02683v1
  57. 57.
    Tieleman T, Hinton G (2012) Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: neural networks for machine learningGoogle Scholar
  58. 58.
    Brümmer N, du Preez J (2006) Application-independent evaluation of speaker detection. Comput Speech Lang 20:230–275CrossRefGoogle Scholar
  59. 59.
    van der Maaten LJP, Hinton GE (2008) Visualizing high-dimensional data using t-SNE. J Mach Learn Res 9:2579–2605zbMATHGoogle Scholar
  60. 60.
    Selvaraju RR et al (2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of IEEE international conference on computer vision, pp 618–626Google Scholar
  61. 61.
    Samek W (2017) Evaluating the visualization of what a deep neural network has learned. IEEE Trans Neural Netw Learn Syst 28(11):2660–2673MathSciNetCrossRefGoogle Scholar
  62. 62.
    Nagarsheth P et al (2017) Replay attack detection using DNN for channel discrimination. In: Proceedings of interspeech, pp 97–101Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Ivan Himawan
    • 1
    Email author
  • Srikanth Madikeri
    • 2
  • Petr Motlicek
    • 2
  • Milos Cernak
    • 3
  • Sridha Sridharan
    • 1
  • Clinton Fookes
    • 1
  1. 1.Queensland University of TechnologyBrisbaneAustralia
  2. 2.Idiap Research InstituteMartignySwitzerland
  3. 3.LogitechLausanneSwitzerland

Personalised recommendations