Intelligent Speech Features Mining for Robust Synthesis System Evaluation

  • Moses E. EkpenyongEmail author
  • Udoinyang G. Inyang
  • Victor E. Ekong
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10930)


Speech synthesis evaluation involves the analytical description of useful features, sufficient to assess the performance of a speech synthesis system. Its primary focus is to determine the degree of semblance of synthetic voice to a natural or human voice. The task of evaluation is usually driven by two methods: the subjective and objective methods, which have indeed become a regular standard for evaluating voice quality, but are mostly challenged by high speech variability as well as human discernment errors. Machine learning (ML) techniques have proven to be successful in the determination and enhancement of speech quality. Hence, this contribution utilizes both supervised and unsupervised ML tools to recognize and classify speech quality classes. Data were collected from a listening test (experiment) and the speech quality assessed by domain experts for naturalness, intelligibility, comprehensibility, as well as, tone, vowel and consonant correctness. During the pre-processing stage, a Principal Component Analysis (PCA) identified 4 principal components (intelligibility, naturalness, comprehensibility and tone) – accounting for 76.79% variability in the dataset. An unsupervised visualization using self organizing map (SOM), then discovered five distinct target clusters with high densities of instances, and showed modest correlation between significant input factors. A Pattern recognition using deep neural network (DNN), produced a confusion matrix with an overall performance accuracy of 93.1%, thus signifying an excellent classification system.


Deep neural network Dimension reduction Machine learning Pattern recognition Speech quality evaluation 


  1. 1.
    Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM based speech synthesis. In: Proceedings of EUROSPEECH Conference (1999)Google Scholar
  2. 2.
    Zen, H., Oura, K., Nose T., Yamagishi, J., Sako, S., Toda, T., Masuko, T., Black, A.W., Tokuda, K.: Recent development of the HMM-based speech synthesis system (HTS). In: Proceedings of APSIPA Annual Summit and Conference, Sapporo, Japan, pp. 121–130 (2009)Google Scholar
  3. 3.
    Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–7966 (2013)Google Scholar
  4. 4.
    Hunt, A., Black, A.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of ICASSP, Atlanta, Georgia, vol. 1, pp. 373–376 (1996)Google Scholar
  5. 5.
    Savargiv, M., Bastanfard, A.: Study on unit-selection and statistical parametric speech synthesis techniques. J. Comput. Robot. 2(7–1), 19–25 (2014)Google Scholar
  6. 6.
    Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Deng, L., Dong, Y.: Deep Learning: Methods and Applications. Microsoft Research/NOW Publishers, UK (2014)zbMATHGoogle Scholar
  9. 9.
    Ekpenyong, M.E., Inyang, U.G., Ekong, V.E.: A DNN framework for robust speech synthesis systems evaluation. In: Zygmunt, V., Mariani, H. (eds.) Proceedings of 7th Language and Technology Conference (LTC), Poznan, Poland, pp. 256261. Fundacja Uniwersytetu im. A. Mickiewicza (2015)Google Scholar
  10. 10.
    Cambell, N.: Evaluation of speech synthesis. In: Dybkjaer, L., Hamsen, H., Minker, W. (eds.) Evaluation of Text and Speech Systems. Text, Speech and Language Technology, vol. 37, pp. 29–64. Springer, The Netherlands (2007). Scholar
  11. 11.
    Francis, A.L., Nusbaum, H.C.: Evaluating the quality of synthetic speech. In: Gardner-Bonneau, D. (ed.) Human Factors and Voice Interactive systems, pp. 63–97. Kluwer Academic, Boston (1999)CrossRefGoogle Scholar
  12. 12.
    Morton, K.: Expectations for assessment techniques applied to speech synthesis. Proc. Inst. Acoust. 13(2), 1–10 (1991)Google Scholar
  13. 13.
    Klatt, D.: Review of text-to-speech conversion for English. J. Acoust. Soc. Am. JASA 82(3), 737–793 (1987)CrossRefGoogle Scholar
  14. 14.
    Mariniak, A.: Global framework for the assessment of synthetic speech without subjects. In: Proceedings of Eurospeech, vol. 93, no. 3, pp. 1683–1686 (1993)Google Scholar
  15. 15.
    Goldstein, M.: Classification of methods used for assessment of text-to-speech systems according to the demands placed on the listener. Speech Commun. 16, 225–244 (1995)CrossRefGoogle Scholar
  16. 16.
    Logan, J., Greene, B., Pisoni, D.: Segmental intelligibility of synthetic speech produced by rule. J. Acoust. Soc. Am. JASA. 86(2), 566–581 (1989)CrossRefGoogle Scholar
  17. 17.
    Pisoni, D., Hunnicutt, S.: Perceptual evaluation of MITalk: the MIT unrestricted text-to-speech system. In: Proceedings of ICASSP, vol. 80, no. 3, pp. 572–575 (1980)Google Scholar
  18. 18.
    Bernstein, J., Pisoni, D.: Unlimited text-to-speech system: description and evaluation of a self organized maps. In: International Conference on Emerging Trends in Networks and Computer Communications (ETNCC), pp. 215–222 (1980)Google Scholar
  19. 19.
    Duffy, S.A., Pisoni, D.B.: Comprehension of synthetic speech produced by rule: a review and theoretical interpretation. Lang. Speech 35, 351–389 (1992)CrossRefGoogle Scholar
  20. 20.
    Kraft, V., Portele, T.: Quality evaluation of five German speech synthesis systems. Acta Acust. 3(1995), 351–365 (1995)Google Scholar
  21. 21.
    Pavlovic, C., Rossi, M., Espesser, R.: Use of the magnitude estimation technique for assessing the performance of text-to-speech synthesis system. J. Acoust. Soc. Am. JASA 87(1), 373–382 (1990)CrossRefGoogle Scholar
  22. 22.
    Mannell, R.: Evaluation of speech synthesis systems. Macquarie University, Australia (2009). Accessed 26 June 2017
  23. 23.
    Clark, R.A., Dusterhoff, K.E.: Objective methods for evaluating synthetic intonation. In Proceedings of Eurospeech, vol. 4, pp. 1623–1626 (1999)Google Scholar
  24. 24.
    Kohonen, T.: Essential of self organizing maps. Neural Netw. 37, 52–65 (2013)CrossRefGoogle Scholar
  25. 25.
    Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009)CrossRefGoogle Scholar
  26. 26.
    Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In: Proceedings of IEEE International Conference on Neural Networks, San Francisco, CA, USA, pp. 586–591 (1993)Google Scholar
  27. 27.
    Vasan, K., Surendiran, B.: Dimensionality reduction using Principal Component Analysis for network intrusion detection. Perspect. Sci. 8, 510–512 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Moses E. Ekpenyong
    • 1
    Email author
  • Udoinyang G. Inyang
    • 1
  • Victor E. Ekong
    • 1
  1. 1.Department of Computer ScienceUniversity of UyoUyoNigeria

Personalised recommendations