Skip to main content

GRU-SVM Model for Synthetic Speech Detection

  • Conference paper
  • First Online:
Digital Forensics and Watermarking (IWDW 2019)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12022))

Included in the following conference series:

  • 1358 Accesses

Abstract

Voice conversion and speech synthesis techniques present a threat to current automatic speaker verification systems. Therefore, to prevent such spoofing attack, choosing an appropriate classifier for learning relevant information from speech feature is an important issue. In this paper, a GRU-SVM model for synthetic speech detection is proposed. The Gate Recurrent Unit (GRU) neural network is considered to learn the feature. The GRU can overcome the problems of gradients vanishing and explosion in traditional Recurrent Neural Networks (RNN) when learning the temporal dependencies. The Support Vector Machines (SVM) plays a role in regression before softmax layer for classification. An excellent performance after the SVM regression has shown in the case of classification ability and data gradient descent. We also obtain the optimal speech feature extraction method and apply it to the classifier for training by a large amount of verification and analysis. Experimental results show that the proposed GRU-SVM models gain higher prediction accuracy on data sets, and an average detection rate of 99.63% has been achieved in our development database. In addition, the proposed method can improve the learning ability of the model effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Langford, J., Guzdial, M.: The arbitrariness of reviews, and advice for school administrators. Commun. ACM 58(4), 12–13 (2016)

    Article  Google Scholar 

  2. Campbell, J.P.: Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)

    Article  Google Scholar 

  3. Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: IEEE International Conference on Acoustics, pp. 285–288 (1998)

    Google Scholar 

  4. Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis (2018)

    Google Scholar 

  5. Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9(5–6), 453–467 (1990)

    Article  Google Scholar 

  6. Reynolds, D.A.: Speaker identification and verification using Gaussian mixture speaker models. Speech Commun. 17(1–2), 91–108 (1995)

    Article  Google Scholar 

  7. Zhao, X., Wang, D.L.: Analyzing noise robustness of MFCC and GFCC features in speaker identification. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7204–7208. IEEE (2013)

    Google Scholar 

  8. Yuan, Y., Zhao, P., Zhou, Q.: Research of speaker recognition based on combination of LPCC and MFCC. In: IEEE International Conference on Intelligent Computing & Intelligent Systems, pp. 765–767. IEEE (2010)

    Google Scholar 

  9. Wang, J.-C., Wang, C.-Y., Chin, Y.-H., Liu, Y.-T., Chen, E.-T., Chang, P.-C.: Spectral-temporal receptive fields and MFCC balanced feature extraction for robust speaker recognition. Multimed. Tools Appl. 76(3), 4055–4068 (2016). https://doi.org/10.1007/s11042-016-3335-0

    Article  Google Scholar 

  10. Ahmad, K.S., Thosar, A.S., Nirmal, J.H.: A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: Eighth International Conference on Advances in Pattern Recognition, pp. 1–6 (2015)

    Google Scholar 

  11. Jagtap, S.S., Bhalke, D.G.: Speaker verification using Gaussian mixture model. In: International Conference on Pervasive Computing, pp. 1–5 (2015)

    Google Scholar 

  12. Shahamiri, S.R., Salim, S.S.B.: Artificial neural networks as speech recognisers for dysarthric speech: identifying the best-performing set of MFCC parameters and studying a speaker-independent approach. Adv. Eng. Inform. 28(1), 102–110 (2014)

    Article  Google Scholar 

  13. LeCun, Y.: Generalization and network design strategies. Ph.D. thesis, University of Toronto (1989)

    Google Scholar 

  14. Lipton, Z.C.: A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019 (2015)

  15. Mou, L., Ghamisi, P., Zhu, X.X.: Deep recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 55(7), 3639–3655 (2017)

    Article  Google Scholar 

  16. Kawakami, K.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technical University of Munich (2008)

    Google Scholar 

  17. Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)

  18. Hu, H., Xu, M.X., Wu, W.: GMM supervector based SVM with spectral features for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 413–416 (2007)

    Google Scholar 

  19. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset

  20. Hanhart, P., Ebrahimi, T.: Calculation of average coding efficiency based on subjective quality scores. J. Vis. Commun. Image Represent. 25(3), 555–564 (2014)

    Article  Google Scholar 

  21. Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132. IEEE (2016)

    Google Scholar 

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants 61972269 and 61902263, the Fundamental Research Funds for the Central Universities under the grant No. YJ201881, and Doctoral Innovation Fund Program of Southwest Jiaotong University under the grant No. DCX201824.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongxia Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, T., Wang, H., Chen, Y., He, P. (2020). GRU-SVM Model for Synthetic Speech Detection. In: Wang, H., Zhao, X., Shi, Y., Kim, H., Piva, A. (eds) Digital Forensics and Watermarking. IWDW 2019. Lecture Notes in Computer Science(), vol 12022. Springer, Cham. https://doi.org/10.1007/978-3-030-43575-2_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43575-2_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43574-5

  • Online ISBN: 978-3-030-43575-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics