GRU-SVM Model for Synthetic Speech Detection

  • Ting Huang
  • Hongxia WangEmail author
  • Yi Chen
  • Peisong He
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12022)


Voice conversion and speech synthesis techniques present a threat to current automatic speaker verification systems. Therefore, to prevent such spoofing attack, choosing an appropriate classifier for learning relevant information from speech feature is an important issue. In this paper, a GRU-SVM model for synthetic speech detection is proposed. The Gate Recurrent Unit (GRU) neural network is considered to learn the feature. The GRU can overcome the problems of gradients vanishing and explosion in traditional Recurrent Neural Networks (RNN) when learning the temporal dependencies. The Support Vector Machines (SVM) plays a role in regression before softmax layer for classification. An excellent performance after the SVM regression has shown in the case of classification ability and data gradient descent. We also obtain the optimal speech feature extraction method and apply it to the classifier for training by a large amount of verification and analysis. Experimental results show that the proposed GRU-SVM models gain higher prediction accuracy on data sets, and an average detection rate of 99.63% has been achieved in our development database. In addition, the proposed method can improve the learning ability of the model effectively.


Synthetic speech detection Gate Recurrent Unit Support Vector Machines 



This work is supported by the National Natural Science Foundation of China (NSFC) under Grants 61972269 and 61902263, the Fundamental Research Funds for the Central Universities under the grant No. YJ201881, and Doctoral Innovation Fund Program of Southwest Jiaotong University under the grant No. DCX201824.


  1. 1.
    Langford, J., Guzdial, M.: The arbitrariness of reviews, and advice for school administrators. Commun. ACM 58(4), 12–13 (2016)CrossRefGoogle Scholar
  2. 2.
    Campbell, J.P.: Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)CrossRefGoogle Scholar
  3. 3.
    Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: IEEE International Conference on Acoustics, pp. 285–288 (1998)Google Scholar
  4. 4.
    Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis (2018)Google Scholar
  5. 5.
    Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9(5–6), 453–467 (1990)CrossRefGoogle Scholar
  6. 6.
    Reynolds, D.A.: Speaker identification and verification using Gaussian mixture speaker models. Speech Commun. 17(1–2), 91–108 (1995)CrossRefGoogle Scholar
  7. 7.
    Zhao, X., Wang, D.L.: Analyzing noise robustness of MFCC and GFCC features in speaker identification. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7204–7208. IEEE (2013)Google Scholar
  8. 8.
    Yuan, Y., Zhao, P., Zhou, Q.: Research of speaker recognition based on combination of LPCC and MFCC. In: IEEE International Conference on Intelligent Computing & Intelligent Systems, pp. 765–767. IEEE (2010)Google Scholar
  9. 9.
    Wang, J.-C., Wang, C.-Y., Chin, Y.-H., Liu, Y.-T., Chen, E.-T., Chang, P.-C.: Spectral-temporal receptive fields and MFCC balanced feature extraction for robust speaker recognition. Multimed. Tools Appl. 76(3), 4055–4068 (2016). Scholar
  10. 10.
    Ahmad, K.S., Thosar, A.S., Nirmal, J.H.: A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: Eighth International Conference on Advances in Pattern Recognition, pp. 1–6 (2015)Google Scholar
  11. 11.
    Jagtap, S.S., Bhalke, D.G.: Speaker verification using Gaussian mixture model. In: International Conference on Pervasive Computing, pp. 1–5 (2015)Google Scholar
  12. 12.
    Shahamiri, S.R., Salim, S.S.B.: Artificial neural networks as speech recognisers for dysarthric speech: identifying the best-performing set of MFCC parameters and studying a speaker-independent approach. Adv. Eng. Inform. 28(1), 102–110 (2014)CrossRefGoogle Scholar
  13. 13.
    LeCun, Y.: Generalization and network design strategies. Ph.D. thesis, University of Toronto (1989)Google Scholar
  14. 14.
    Lipton, Z.C.: A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019 (2015)
  15. 15.
    Mou, L., Ghamisi, P., Zhu, X.X.: Deep recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 55(7), 3639–3655 (2017)CrossRefGoogle Scholar
  16. 16.
    Kawakami, K.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technical University of Munich (2008)Google Scholar
  17. 17.
    Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)
  18. 18.
    Hu, H., Xu, M.X., Wu, W.: GMM supervector based SVM with spectral features for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 413–416 (2007)Google Scholar
  19. 19.
  20. 20.
    Hanhart, P., Ebrahimi, T.: Calculation of average coding efficiency based on subjective quality scores. J. Vis. Commun. Image Represent. 25(3), 555–564 (2014)CrossRefGoogle Scholar
  21. 21.
    Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132. IEEE (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of Information Science and TechnologySouthwest Jiaotong UniversityChengduPeople’s Republic of China
  2. 2.College of CybersecuritySichuan UniversityChengduPeople’s Republic of China

Personalised recommendations