Speech Analytics Based on Machine Learning

  • Grazina Korvel
  • Adam Kurowski
  • Bozena KostekEmail author
  • Andrzej Czyzewski
Part of the Intelligent Systems Reference Library book series (ISRL, volume 149)


In this chapter, the process of speech data preparation for machine learning is discussed in detail. Examples of speech analytics methods applied to phonemes and allophones are shown. Further, an approach to automatic phoneme recognition involving optimized parametrization and a classifier belonging to machine learning algorithms is discussed. Feature vectors are built on the basis of descriptors coming from the music information retrieval (MIR) domain. Then, phoneme classification beyond the typically used techniques is extended towards exploring Deep Neural Networks (DNNs). This is done by combining Convolutional Neural Networks (CNNs) with audio data converted to the time-frequency space domain (i.e. spectrograms) and then exported as images. In this way a two-dimensional representation of speech feature space is employed. When preparing the phoneme dataset for CNNs, zero padding and interpolation techniques are used. The obtained results show an improvement in classification accuracy in the case of allophones of the phoneme /l/, when CNNs coupled with spectrogram representation are employed. Contrarily, in the case of vowel classification, the results are better for the approach based on pre-selected features and a conventional machine learning algorithm.



Research partially sponsored by the Polish National Science Centre, Dec. No. 2015/17/B/ST6/01874. This work has also been partially supported by Statutory Funds of Electronics, Telecommunications and Informatics Faculty, Gdansk University of Technology.


  1. 1.
    Badshah. A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network International Conference on Platform Technology and Service (PlatCon), pp. 1–5 (2017)Google Scholar
  2. 2.
    Noroozi, F., Kaminska, D., Sapinski, T., Anbarjafari, G.: Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost. J. Audio Eng. Soc. 65(7/8), 562–572 (2017). Scholar
  3. 3.
    Sainath, T.N., Mohamed, A.-R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8614–8618 (2013)Google Scholar
  4. 4.
    Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23(1), 7–19 (2015)CrossRefGoogle Scholar
  5. 5.
    Alam, M.J., Kenny, P., O’Shaughnessy, D.: Low-variance multitaper mel-frequency cepstral coefficient features for speech and speaker recognition systems. Cognit. Comput. 5(4), 533–544 (2013)CrossRefGoogle Scholar
  6. 6.
    Lerch, A.: An Introduction To Audio Content Analysis: Applications in Signal Processing and Music Informatics, p. 248. Wiley, Hoboken, N.J (2012)CrossRefGoogle Scholar
  7. 7.
    Biswas, A., Sahu, P., Chandra, M.: Multiple camera in car audio–visual speech recognition using phonetic and visemic information. Comput. Electr. Eng. 47, 35–50 (2015). Scholar
  8. 8.
    Ziółko, B., Ziółko, M.: Time durations of phonemes in Polish language for speech and speaker recognition. In: Human Language Technology, Challenges for Computer Science and Linguistics. Lecture Notes in Computer Science, vol. 6562, pp. 105–114. Springer (2011)CrossRefGoogle Scholar
  9. 9.
    Czyżewski, A., Kostek, B., Bratoszewski, P., Kotus, J., Szykulski, M.: An audio-visual corpus for multimodal automatic speech recognition. J. Intell. Inf. Syst. 1, 1–27 (2017).
  10. 10.
    Rosner, A., Kostek, B.: Automatic music genre classification based on musical instrument track separation. J. Intell. Inf. Syst. 5 (2017).
  11. 11.
    Korvel, G., Kostek, B.: Examining feature vector for phoneme recognition. In: Proceeding of IEEE International Symposium on Signal Processing and Information Technology, ISSPIT 2017. Bilbao, Spain (2017)Google Scholar
  12. 12.
    Korvel, G., Kostek, B.: Voiceless stop consonant modelling and synthesis framework based on MISO dynamic system. Arch. Acoust. 42(3), 375–383 (2017). Scholar
  13. 13.
    Plewa, M., Kostek, B.: Music mood visualization using self-organizing maps. Arch. Acoust. 40(4), 513–525 (2015). Scholar
  14. 14.
    Kostek, B., Kupryjanow, A., Zwan, P., Jiang, W., Raś, Z., Wojnarski, M., Swietlicka, J.: Report of the ISMIS 2011 contest: music information retrieval. Found. Intell. Syst. 715–724 (2011)Google Scholar
  15. 15.
    Gold, B., Morgan, N., Ellis, D.: Speech and Audio Signal Processing: Processing and Perception of Speech and Music, 2nd edn, 688 pp. Wiley, Inc., (2011)CrossRefGoogle Scholar
  16. 16.
    Prabhu, K.M.M.: Window Functions and Their Applications in Signal Processing. CRC Press (2013)CrossRefGoogle Scholar
  17. 17.
    Heinzel, G., Rudiger, A., Schilling, R.: Spectrum and spectral density estimation by the discrete Fourier transform (DFT), including a comprehensive list of window functions and some new flat-top windows. Internal Report, Max-Planck-Institut fur Gravitations physik, Hannover (2002)Google Scholar
  18. 18.
    Gillet, O., Richard, G.: Automatic transcription of drum loops. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP ‘04) (2004)Google Scholar
  19. 19.
    Hyungsuk, K., Heo, S.W.: Time-domain calculation of spectral centroid from backscattered ultrasound signals. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 59(6) (2012)Google Scholar
  20. 20.
    Hyoung-Gook, K., Moreau, N., Sikora, T.: MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval. Wiley, Hoboken (2005)Google Scholar
  21. 21.
    Manjunath, B.S., Salembier, P., Sikora T.: Introduction to MPEG-7: Multimedia Content Description Interface. Wiley (2002)Google Scholar
  22. 22.
    Ma, Y., Nishihara, A.: Efficient voice activity detection algorithm using long-term spectral flatness measure. EURASIP J. Audio, Speech, Music Process 1–18 (2013)Google Scholar
  23. 23.
    Moattar, M.H., Homayounpour, M.M.: A simple but efficient real-time voice activity detection algorithm. In: 17th European Signal Processing Conference (EUSIPCO 2009). Glasgow, Scotland, Aug 24–28 (2009)Google Scholar
  24. 24.
    Mermelstein, P.: Distance measures for speech recognition, psychological and instrumental. Pattern Recognition and Artificial Intelligence, pp. 374–388. Academic, New York (1976)Google Scholar
  25. 25.
    Logan, B.: Mel frequency cepstral coefficients for music modeling. In: Proceedings of 1st International Symposium on Music Information Retrieval (ISMIR). Plymouth, Massachusetts, USA (2000)Google Scholar
  26. 26.
    Nijhawan, G., Soni, M.K.: Speaker recognition using MFCC and vector quantisation. J. Recent Trends Eng. Technol. 11(1), 211–218 (2014)Google Scholar
  27. 27.
    Wang, Y., Lawlor, B.: Speaker recognition based on MFCC and BP neural networks. In: 28th Irish Signals and Systems Conference (2017)Google Scholar
  28. 28.
    Ahmad, K.S., Thosar, A.S., Nirmal, J.H., Pande, V.S.: A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR), 4–7 Jan 2015, pp. 1–6 (2015)Google Scholar
  29. 29.
    Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classification in mismatched conditions. Signal Process. Lett. IEEE 18(2), 130–133 (2011)CrossRefGoogle Scholar
  30. 30.
    Leonard, F.: Phase spectrogram and frequency spectrogram as new diagnostic tools. Mech. Syst. Signal Process. 21(1), 125–137 (2007)CrossRefGoogle Scholar
  31. 31.
    Lawrence, J.R., Borden, G.J., Harris K.S.: Speech Science Primer: Physiology, Acoustics, and Perception of Speech, 6th edn, 334 pp. Lippincott Williams & Wilkins (2011)Google Scholar
  32. 32.
    Steuer, R., Daub, C.O., Selbig, J., Kurths, J.: Measuring distances between variables by mutual information. In: Innovations in Classification, Data Science, and Information Systems, pp. 81–90 (2005)Google Scholar
  33. 33.
    Pohjalainen, J., Rasanen, O., Kadioglu, S.: Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits. Comput. Speech Lang. 29(1), 145–171 (2015)CrossRefGoogle Scholar
  34. 34.
    Manocha, S., Girolami, M.A.: An empirical analysis of the probabilistic K-nearest neighbour classifier. Pattern Recogn. Lett. 28, 1818–1824 (2007)CrossRefGoogle Scholar
  35. 35.
    Palaniappan, R., Sundaraj, K., Sundaraj, S.: A comparative study of the SVM and k-nn machine learning algorithms for the diagnosis of respiratory pathologies using pulmonary acoustic signals. BMC Bioinf. 15, 1–8 (2014)CrossRefGoogle Scholar
  36. 36.
    Czyżewski, A., Piotrowska, M., Kostek, B.: Analysis of allophones based on audio signal recordings and parameterization. J. Acoust. Soc. Am. 141(5), 3521 (2017). Scholar
  37. 37.
    Kostek, B., Piotrowska, M., Czyżewski, A.: Comparative study of self-organizing maps versus subjective evaluation of quality of allophone pronunciation for nonnative english speakers. In: 143rd Audio Engineering Society Convention, Preprint 9847. New York (2017)Google Scholar
  38. 38.
    Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. In: The Morgan Kaufmann Series in Data Management Systems, 2nd edn, 761 pp. Morgan Kaufmann (2006)Google Scholar
  39. 39.
    Kingma, P.D., Ba, J.L.: ADAM: a method for stochastic optimization. In: International Conference on Learning Representations, ICLR 2015 (2015). Accessed Jan 2018
  40. 40.
    Keras library Keras Documentation Website. Accessed Jan 2018
  41. 41.
    TensorFlow library. TensorFlow Documentation Website. Accessed Jan 2018
  42. 42.
    TIMIT: Acoustic-Phonetic Continuous Speech Corpus. Accessed Jan 2018

Copyright information

© Springer International Publishing AG, part of Springer Nature 2019

Authors and Affiliations

  • Grazina Korvel
    • 1
    • 2
  • Adam Kurowski
    • 1
    • 3
  • Bozena Kostek
    • 4
    Email author
  • Andrzej Czyzewski
    • 1
  1. 1.Faculty of Electronics, Telecommunications and Informatics, Multimedia Systems DepartmentGdańsk University of TechnologyGdańskPoland
  2. 2.Institute of Data Science and Digital TechnologiesVilnius UniversityVilniusLithuania
  3. 3.Faculty of Electronics, Telecommunications and InformaticsGdańsk University of TechnologyGdańskPoland
  4. 4.Faculty of Electronics, Telecommunications and Informatics, Audio Acoustics LaboratoryGdańsk University of TechnologyGdańskPoland

Personalised recommendations