In this paper, it is presented a convolutional neural network model to address the automatic speech quality assessment problem. It is a no-reference methodology that applies convolutional layers as feature extractors for visual representation through Mel-Frequency Cepstral Coefficients of the speech signal. Its performance is evaluated through comparison to the methodologies PESQ, ViSQOL and P.563. The experiments were conducted in publicly available databases and in another database that was built to evaluate our model in the context of background noise. The results are analyzed by means of correlation measures and statistical descriptions. Through four experiments, we have concluded that: (1) our model achieved high overall generalization, even when it was trained with a limited quantity of samples; (2) it also characterized speech and background sound even for databases where complex degradation is present; and (3) the proposed model tends to assign high scores to clean speech and low scores to samples with just noise, right as expected.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Uddin Z, Nilsson EG (2020) Emotion recognition using speech and neural structured learning to facilitate edge intelligence. Eng Appl Artif Intell 94:2–11
Grozdic DT, Jovicic ST, Subotic M (2017) Whispered speech recognition using deep denoising autoencoder. Eng Appl Artif Intell 59:15–22
Orozco-Arroyave J et al (2018) Neurospeech: an open-source software for parkinson’s speech analysis. Digit Signal Process 77:207–221
Braga D, Madureira A, Coelho L, Ajith R (2019) Automatic detection of Parkinson’s disease based on acoustic analysis of speech. Eng Appl Artif Intell 77:148–158
Furundzic D, Stankovic S, Jovicic S, Punisic S, Subotic M (2017) Distance based resampling of imbalanced classes: with an application example of speech quality assessment. Eng Appl Artif Intell 64:440–461
Almeida FL, Rosa RL, Rodriguez DZ (2018) Voice quality assessment in communication services using deep learning. International Symposium on Wireless Communication Systems, 1–6.
Soni MH, Patil HA (2016) Novel deep autoencoder features for non-intrusive speech quality assessment. European Signal Processing Conference (EUSIPCO), 2315–2319.
Allonso E, Rosa R, Rodriguez DZ (2017) Speech quality assessment over lossy transmission channels using deep belief networks. IEEE Signal Process Lett 25(1):1–1
Fu SW, Tsao Y, Hwang HT, Wang HM (2018) Quality-net: An end-to-end non-intrusive speech quality assessment model based on BLSTM. Interspeech, 1873-1877
Avila AR, Gamper H, Reddy C, Cutler R, Tashev I, Gehrke J (2019) Non-intrusive speech quality assessment using neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing, 631–635.
Lo CC, Fu SW, Huang WC, X. Wang, Yamagishi J, Tsao Y, Wang HM (2019) Mosnet: Deep learning-based objective assessment for voice conversion. Interspeech.
ITU-T (2001) Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, Recommendation P.862.
ITU-T (1998) Objective quality measurement of telephone-band (300–3400 HZ) speech codecs, Recommendation P.861.
ITU-T (2017) Wideband extension to recommendation p.862 for the assessment of wideband telephone networks and speech codecs, Recommendation P.862.2.
ITU-T (2011) Perceptual objective listening quality assessment, Recommendation P.863.
ITU-T (2013) Perceptual objective listening quality prediction, Recommendation P.863.
Toral-Cruz H, Argaez-Xool J, Estrada-Vargas L, Torres-Roman D (2011) An introduction to voip: End-to-end elements and QOS parameters. In-Tech.
Hines A, Skoglund J, Kokaram A, Harte N (2015) VISQOL: An objective speech quality model. EURASIP J Audio Speech Music Process 13:1–18
ITU-T (2004) Single ended method for objective speech quality assessment in narrowband telephony applications, Recommendation P.563.
Kim DS (2005) Anique: an auditory model for single-ended speech quality estimation. IEEE Trans Speech Audio Process 13:821–831
Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22:1533–1545
Park S, Lee J (2017) A fully convolutional neural network for speech enhancement. Interspeech, 1993–1997.
Andersen A, Haan J, Tan ZH, Jensen J (2018) Non-intrusive speech intelligibility prediction using convolutional neural networks. IEEE/ACM Trans Audio Speech Lang Process 26(10):1925–1939
Voice conversion challenge [homepage on the Internet]. Available from: http://www.vc-challenge.org/
ITU-T, P.sup23: ITU-T coded-speech database, Recommendation P.Sup23.
Mcloughlin I (2009) Applied speech and audio processing with matlab examples. Cambridge University Press, Cambridge
Dubey RK, Kumar A (2013) Non-intrusive objective speech quality assessment using a combination of MFCC, PLP and LSF features. International Conference on Signal Processing and Communication, 297–302.
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning, 448–456.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
Harte N, Gillen E, Hines A (2015) TCD-VOIP: a research database of degraded speech for assessing quality in VOIP applications. International Workshop on Quality of Multimedia Experience.
ITU-T, Application guide for objective quality measurement based on recommendations p.862, p.862.1 and p.862.2, Recommendation P.862.3.
Barras B. Sox: Sound exchange [homepage on the Internet]. Available from: http://sox.sourceforge.net/
Upadhyay N, Karmakar A (2015) Speech enhancement using spectral subtraction-type algorithms: a comparison and simulation study. Procedia Comput Sci 54:574–584
Hirsch HG, Fant - Filtering and noise adding tool [homepage on the Internet]. Available from: https://github.com/i3thuan5/FaNT
Hu Y, Loizou P (2008) Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process 16:229–238
ETSI, Speech and multimedia transmission quality (SQT); speech quality performance in the presence of background noise; part 3: Background noise transmission - objective test methods, ETSI EG 202 396–3.
Beerends J et al (2020) Subjective and Objective Assessment of Full Bandwidth Speech Quality. IEEE Trans Audio Speech Lang Process 28:440–449
The authors would like to thank the support of NVIDIA Corporation with the donation of the Titan XP GPU used for this research.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Albuquerque, R.Q., Mello, C.A.B. Automatic no-reference speech quality assessment with convolutional neural networks. Neural Comput & Applic (2021). https://doi.org/10.1007/s00521-021-05767-4
- Speech quality assessment
- Convolutional neural networks
- Mel-frequency cepstral coefficient