Abstract
Current state-of-the-art automatic speaker verification (ASV) systems are prone to spoofing. The security and reliability of ASV systems can be threatened by different types of spoofing attacks using voice conversion, synthetic speech, or recorded passphrase. It is therefore essential to develop countermeasure techniques which can detect such spoofed speech. Inspired by the success of deep learning approaches in various classification tasks, this work presents an in-depth study of convolutional neural networks (CNNs) for spoofing detection in automatic speaker verification (ASV) systems. Specifically, we have compared the use of three different CNNs architectures: AlexNet, CNNs with max-feature-map activation, and an ensemble of standard CNNs for developing spoofing countermeasures, and discussed their potential to avoid overfitting due to small amounts of training data that is usually available in this task. We used popular deep learning toolkits for the system implementation and have released the implementation code of our methods publicly. We have evaluated the proposed countermeasure systems for detecting replay attacks on recently released spoofing corpora ASVspoof 2017, and also provided in-depth visual analyses of CNNs to aid for future research in this area.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hautamäki RS et al (2015) Automatic versus human speaker verification: the case of voice mimicry. Speech Commun 72:13–31
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio, Speech, Lang Process 15(8):2222–2235
Erro D, Polyakova T, Moreno A (2008) On combining statistical methods and frequency warping for high-quality voice conversion. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 4665–4668
Masuko T, Tokuda K, Kobayashi T (2008) Imposture using synthetic speech against speaker verification based on spectrum and pitch. In: Proceedings of international conference on spoken language processing, pp 302–305
Satoh T et al (2001) A robust speaker verification system against imposture using an HMM-based speech synthesis system. In: Proceedings of interspeech, pp 759–762
Zheng TF, Li L (2017) Robustness-related issues in speaker recognition. Springer, Singapore
Wu Z et al (2015) ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: Proceedings of interspeech, pp 2037–2041
ISO/IEC JTC 1/SC 37 Biometrics: ISO/IEC 30107-1:2016, Information technology - Biometrics presentation attack detection - part 1: Framework. ISO/IEC Information Technology Task Force (ITTF) (2016)
Wu Z et al (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Proceedings of asia-pacific signal and information processing association, annual summit and conference (APSIPA), pp 1–5
Gałka J, Grzywacz M, Samborski R (2015) Playback attack detection for text-dependent speaker verification over telephone channels. Speech Commun 67:143–153
Janicki A, Alegre F, Evans N (2016) An assessment of automatic speaker verification vulnerabilities to replay spoofing attacks. Sec Commun Netw 9:3030–3044
Lavrentyeva G et al (2017) Audio replay attack detection with deep learning frameworks. In: Proceedings of interspeech, pp 82–86
Chen Z et al (2017) ResNet and model fusion for automatic spoofing detection. In: Proceedings of interspeech, pp 102–106
Cai W et al (2017) Countermeasures for automatic speaker verification replay spoofing attack: on data augmentation, feature representation, classification and fusion. In: Proceedings of intespeech, pp 17–21
Hinton GE et al (2012) Improving neural networks by preventing co-adaption of feature detectors. arXiv:1207.0580
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2:1097–1105
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Russakovsky O et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Abadi M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467
Paszke A et al (2017) Automatic differentiation in PyTorch. In: 31st conference on neural information processing systems
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP-97)vol 2, pp 1331–1334. https://doi.org/10.1109/ICASSP.1997.596192
Paliwal KK (1998) Spectral subband centroid features for speech recognition. Proc IEEE Int Conf Acoustic, Speech Signal Process 2:617–620
Youngberg J, Boll S (1978) Constant-Q signal analysis and synthesis. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 375–378
Mallat S (2008) A wavelet tour of signal processing, 3rd edn. The sparse way. Academic press, New York
Liu Y, Tian Y, He L, Liu J, Johnson MT (2015) Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing. sign (gp- gc) 2:1
Sahidullah Md et al (2016) Integrated spoofing countermeasures and automatic speaker verification: an evaluation on ASVspoof 2015. In: Proceedings of interspeech, pp 1700–1704
Villalba J et al (2015) Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In: Proceedings of interspeech, pp 2067–2071
Chakroborty S, Roy A, Saha G (2007) Improved close set text-independent speaker identification by combining MFCC with evidence from flipped filter banks. Int J Signal Process 4(2):114–121
Xiao X, Tian X, Du S, Xu H, Chng ES, Li H (2015) Spoofing speech detection using high dimensional magnitude and phase features: The NTU approach for ASVspoof 2015 challenge. In: Proceedings of interspeech
Saratxaga I (2016) Synthetic speech detection using phase information. Speech Commun 81:30–41
Wu Z (2016) Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio, Speech Lang Process 24:768–783
Chen N, Qian Y, Dinkel H, Chen B, Yu K (2015) Robust deep feature for spoofing detection-the SJTU system for ASVspoof 2015 challenge. In: Proceedings of interspeech
Korshunov P, Marcel S, Muckenhirn H, Gonçalves AR, Mello AGS, Violato RPV, Simoes FO, Neto MU, de Assis Angeloni M, Stuchi JA, Dinkel H, Chen N, Qian Y, Paul D, Saha G, Sahidullah M (2016) Overview of BTAS 2016 speaker anti-spoofing competition. In: IEEE international conference on biometrics theory, applications and systems (BTAS)
Qian Y (2017) Deep feature engineering for noise robust spoofing detection. IEEE/ACM Trans Audio, Speech Lang Process 25(10):1942–1955
Dinkel H et al (2017) End-to-end spoofing detection with raw waveform CLDNNS. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 4860–4864
Zhang C, Yu C, Hansen JHL (2017) An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J Sel Top Signal Process 11(4):684–694. https://doi.org/10.1109/JSTSP.2016.2647199
Alam MJ et al (2016) Spoofing detection on the ASVspoof 2015 challenge corpus employing deep neural networks. In: Proceedings of odyssey, pp 270–276
Yu H et al (2017) Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. In: IEEE transactions on neural networks and learning systems, pp 1–12
Yu H et al (2017) DNN filter bank cepstral coefficients for spoofing detection. IEEE Access 5:4779–4787
Qian Y, Chen N, Yu K (2016) Deep features for automatic spoofing detection. Speech Commun 85(C):43–52. https://doi.org/10.1016/j.specom.2016.10.007
Korshunov P et al (2018) On the use of convolutional neural network for speech presentation attack detection. In: Proceedings of IEEE international conference on identity, security, and behavior analysis
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Veit A et al (2016) Residual networks behave like ensembles of relatively shallow networks. Adv Neural Inf Process Syst 550–558
Muckenhirn H, Magimai-Doss M, Marcel S (2017) End-to-end convolutional neural network-based voice presentation attack detection. In: Proceedings of international joint conference on biometrics
Chen Z et al (2018) Recurrent neural networks for automatic replay spoofing attack detection. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing
Nandakumar K (2008) Likelihood ratio-based biometric score fusion. IEEE Trans Pattern Anal Mach Intell 30(2):342–347
Todisco M, Delgado H, Evans N (2016) Articulation rate filtering of CQCC features for automatic speaker verification. In: Proceeding of interspeech, pp 3628–3632
Todisco M, Delgado H, Evans N (2016) A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients. In: Proceedings of odessey, pp 283–290
Kinnunen T et al (2016) Utterance verification for text-dependent speaker recognition: a comparative assessment using the RedDots corpus. In: Proceedings of interspeech, pp 430–434
Kinnunen T et al (2017) The ASVspoof 2017 challenge: Assesing the limits of replay spoofing attack detection. In: Proceedings of interspeech, pp 2–6
Wang X, Takaki S, Yamagishi J (2017) An RNN-based quantized F0 model with multi-tier feedback links for text-to-speech synthesis. In: Proceedings of interspeech, pp 1059–1063
Font R, EspÃn JM, Cano MJ (2017) Experimental analysis of features for replay attack detection-results on the ASVspoof 2017 challenge. In: Proceedings of interspeech, pp 7–11
Sermanet P et al (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. In: Proceedings of international conference on learning representations
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, pp 818–833
Lin M, Chen Q, Yan S (2014) Network in network. In: Proceedings of international conference on learning representations
Wu X, He R, Sun Z (2015) A lightened CNN for deep face representation. arXiv:1511.02683v1
Tieleman T, Hinton G (2012) Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: neural networks for machine learning
Brümmer N, du Preez J (2006) Application-independent evaluation of speaker detection. Comput Speech Lang 20:230–275
van der Maaten LJP, Hinton GE (2008) Visualizing high-dimensional data using t-SNE. J Mach Learn Res 9:2579–2605
Selvaraju RR et al (2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of IEEE international conference on computer vision, pp 618–626
Samek W (2017) Evaluating the visualization of what a deep neural network has learned. IEEE Trans Neural Netw Learn Syst 28(11):2660–2673
Nagarsheth P et al (2017) Replay attack detection using DNN for channel discrimination. In: Proceedings of interspeech, pp 97–101
Acknowledgements
Computational (and/or data visualization) resources and services used in this work were provided by the HPC and Research Support Group, Queensland University of Technology, Brisbane, Australia. This project was supported in part by an Australian Research Council Linkage grant LP 130100110.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Himawan, I., Madikeri, S., Motlicek, P., Cernak, M., Sridharan, S., Fookes, C. (2019). Voice Presentation Attack Detection Using Convolutional Neural Networks. In: Marcel, S., Nixon, M., Fierrez, J., Evans, N. (eds) Handbook of Biometric Anti-Spoofing. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-92627-8_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-92627-8_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92626-1
Online ISBN: 978-3-319-92627-8
eBook Packages: Computer ScienceComputer Science (R0)