Automatic no-reference speech quality assessment with convolutional neural networks

Abstract

In this paper, it is presented a convolutional neural network model to address the automatic speech quality assessment problem. It is a no-reference methodology that applies convolutional layers as feature extractors for visual representation through Mel-Frequency Cepstral Coefficients of the speech signal. Its performance is evaluated through comparison to the methodologies PESQ, ViSQOL and P.563. The experiments were conducted in publicly available databases and in another database that was built to evaluate our model in the context of background noise. The results are analyzed by means of correlation measures and statistical descriptions. Through four experiments, we have concluded that: (1) our model achieved high overall generalization, even when it was trained with a limited quantity of samples; (2) it also characterized speech and background sound even for databases where complex degradation is present; and (3) the proposed model tends to assign high scores to clean speech and low scores to samples with just noise, right as expected.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. 1.

    Uddin Z, Nilsson EG (2020) Emotion recognition using speech and neural structured learning to facilitate edge intelligence. Eng Appl Artif Intell 94:2–11

    Article  Google Scholar 

  2. 2.

    Grozdic DT, Jovicic ST, Subotic M (2017) Whispered speech recognition using deep denoising autoencoder. Eng Appl Artif Intell 59:15–22

    Article  Google Scholar 

  3. 3.

    Orozco-Arroyave J et al (2018) Neurospeech: an open-source software for parkinson’s speech analysis. Digit Signal Process 77:207–221

    Article  Google Scholar 

  4. 4.

    Braga D, Madureira A, Coelho L, Ajith R (2019) Automatic detection of Parkinson’s disease based on acoustic analysis of speech. Eng Appl Artif Intell 77:148–158

    Article  Google Scholar 

  5. 5.

    Furundzic D, Stankovic S, Jovicic S, Punisic S, Subotic M (2017) Distance based resampling of imbalanced classes: with an application example of speech quality assessment. Eng Appl Artif Intell 64:440–461

    Article  Google Scholar 

  6. 6.

    Almeida FL, Rosa RL, Rodriguez DZ (2018) Voice quality assessment in communication services using deep learning. International Symposium on Wireless Communication Systems, 1–6.

  7. 7.

    Soni MH, Patil HA (2016) Novel deep autoencoder features for non-intrusive speech quality assessment. European Signal Processing Conference (EUSIPCO), 2315–2319.

  8. 8.

    Allonso E, Rosa R, Rodriguez DZ (2017) Speech quality assessment over lossy transmission channels using deep belief networks. IEEE Signal Process Lett 25(1):1–1

    Google Scholar 

  9. 9.

    Fu SW, Tsao Y, Hwang HT, Wang HM (2018) Quality-net: An end-to-end non-intrusive speech quality assessment model based on BLSTM. Interspeech, 1873-1877

  10. 10.

    Avila AR, Gamper H, Reddy C, Cutler R, Tashev I, Gehrke J (2019) Non-intrusive speech quality assessment using neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing, 631–635.

  11. 11.

    Lo CC, Fu SW, Huang WC, X. Wang, Yamagishi J, Tsao Y, Wang HM (2019) Mosnet: Deep learning-based objective assessment for voice conversion. Interspeech.

  12. 12.

    ITU-T (2001) Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, Recommendation P.862.

  13. 13.

    ITU-T (1998) Objective quality measurement of telephone-band (300–3400 HZ) speech codecs, Recommendation P.861.

  14. 14.

    ITU-T (2017) Wideband extension to recommendation p.862 for the assessment of wideband telephone networks and speech codecs, Recommendation P.862.2.

  15. 15.

    ITU-T (2011) Perceptual objective listening quality assessment, Recommendation P.863.

  16. 16.

    ITU-T (2013) Perceptual objective listening quality prediction, Recommendation P.863.

  17. 17.

    Toral-Cruz H, Argaez-Xool J, Estrada-Vargas L, Torres-Roman D (2011) An introduction to voip: End-to-end elements and QOS parameters. In-Tech.

  18. 18.

    Hines A, Skoglund J, Kokaram A, Harte N (2015) VISQOL: An objective speech quality model. EURASIP J Audio Speech Music Process 13:1–18

    Google Scholar 

  19. 19.

    ITU-T (2004) Single ended method for objective speech quality assessment in narrowband telephony applications, Recommendation P.563.

  20. 20.

    Kim DS (2005) Anique: an auditory model for single-ended speech quality estimation. IEEE Trans Speech Audio Process 13:821–831

    Article  Google Scholar 

  21. 21.

    Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22:1533–1545

    Article  Google Scholar 

  22. 22.

    Park S, Lee J (2017) A fully convolutional neural network for speech enhancement. Interspeech, 1993–1997.

  23. 23.

    Andersen A, Haan J, Tan ZH, Jensen J (2018) Non-intrusive speech intelligibility prediction using convolutional neural networks. IEEE/ACM Trans Audio Speech Lang Process 26(10):1925–1939

    Article  Google Scholar 

  24. 24.

    Voice conversion challenge [homepage on the Internet]. Available from: http://www.vc-challenge.org/

  25. 25.

    ITU-T, P.sup23: ITU-T coded-speech database, Recommendation P.Sup23.

  26. 26.

    Mcloughlin I (2009) Applied speech and audio processing with matlab examples. Cambridge University Press, Cambridge

    Google Scholar 

  27. 27.

    Dubey RK, Kumar A (2013) Non-intrusive objective speech quality assessment using a combination of MFCC, PLP and LSF features. International Conference on Signal Processing and Communication, 297–302.

  28. 28.

    Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning, 448–456.

  29. 29.

    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  30. 30.

    Harte N, Gillen E, Hines A (2015) TCD-VOIP: a research database of degraded speech for assessing quality in VOIP applications. International Workshop on Quality of Multimedia Experience.

  31. 31.

    ITU-T, Application guide for objective quality measurement based on recommendations p.862, p.862.1 and p.862.2, Recommendation P.862.3.

  32. 32.

    Barras B. Sox: Sound exchange [homepage on the Internet]. Available from: http://sox.sourceforge.net/

  33. 33.

    Upadhyay N, Karmakar A (2015) Speech enhancement using spectral subtraction-type algorithms: a comparison and simulation study. Procedia Comput Sci 54:574–584

    Article  Google Scholar 

  34. 34.

    Hirsch HG, Fant - Filtering and noise adding tool [homepage on the Internet]. Available from: https://github.com/i3thuan5/FaNT

  35. 35.

    Hu Y, Loizou P (2008) Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process 16:229–238

    Article  Google Scholar 

  36. 36.

    ETSI, Speech and multimedia transmission quality (SQT); speech quality performance in the presence of background noise; part 3: Background noise transmission - objective test methods, ETSI EG 202 396–3.

  37. 37.

    Beerends J et al (2020) Subjective and Objective Assessment of Full Bandwidth Speech Quality. IEEE Trans Audio Speech Lang Process 28:440–449

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the support of NVIDIA Corporation with the donation of the Titan XP GPU used for this research.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Carlos A. B. Mello.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Albuquerque, R.Q., Mello, C.A.B. Automatic no-reference speech quality assessment with convolutional neural networks. Neural Comput & Applic (2021). https://doi.org/10.1007/s00521-021-05767-4

Download citation

Keywords

  • Speech quality assessment
  • Convolutional neural networks
  • Mel-frequency cepstral coefficient