Skip to main content

Speech Analytics Based on Machine Learning

  • Chapter
  • First Online:
Machine Learning Paradigms

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 149 ))

Abstract

In this chapter, the process of speech data preparation for machine learning is discussed in detail. Examples of speech analytics methods applied to phonemes and allophones are shown. Further, an approach to automatic phoneme recognition involving optimized parametrization and a classifier belonging to machine learning algorithms is discussed. Feature vectors are built on the basis of descriptors coming from the music information retrieval (MIR) domain. Then, phoneme classification beyond the typically used techniques is extended towards exploring Deep Neural Networks (DNNs). This is done by combining Convolutional Neural Networks (CNNs) with audio data converted to the time-frequency space domain (i.e. spectrograms) and then exported as images. In this way a two-dimensional representation of speech feature space is employed. When preparing the phoneme dataset for CNNs, zero padding and interpolation techniques are used. The obtained results show an improvement in classification accuracy in the case of allophones of the phoneme /l/, when CNNs coupled with spectrogram representation are employed. Contrarily, in the case of vowel classification, the results are better for the approach based on pre-selected features and a conventional machine learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Badshah. A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network International Conference on Platform Technology and Service (PlatCon), pp. 1–5 (2017)

    Google Scholar 

  2. Noroozi, F., Kaminska, D., Sapinski, T., Anbarjafari, G.: Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost. J. Audio Eng. Soc. 65(7/8), 562–572 (2017). https://doi.org/10.17743/jaes.2017.0022

    Article  Google Scholar 

  3. Sainath, T.N., Mohamed, A.-R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8614–8618 (2013)

    Google Scholar 

  4. Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23(1), 7–19 (2015)

    Article  Google Scholar 

  5. Alam, M.J., Kenny, P., O’Shaughnessy, D.: Low-variance multitaper mel-frequency cepstral coefficient features for speech and speaker recognition systems. Cognit. Comput. 5(4), 533–544 (2013)

    Article  Google Scholar 

  6. Lerch, A.: An Introduction To Audio Content Analysis: Applications in Signal Processing and Music Informatics, p. 248. Wiley, Hoboken, N.J (2012)

    Book  Google Scholar 

  7. Biswas, A., Sahu, P., Chandra, M.: Multiple camera in car audio–visual speech recognition using phonetic and visemic information. Comput. Electr. Eng. 47, 35–50 (2015). https://doi.org/10.1016/j.compeleceng.2015.08.009

    Article  Google Scholar 

  8. Ziółko, B., Ziółko, M.: Time durations of phonemes in Polish language for speech and speaker recognition. In: Human Language Technology, Challenges for Computer Science and Linguistics. Lecture Notes in Computer Science, vol. 6562, pp. 105–114. Springer (2011)

    Chapter  Google Scholar 

  9. Czyżewski, A., Kostek, B., Bratoszewski, P., Kotus, J., Szykulski, M.: An audio-visual corpus for multimodal automatic speech recognition. J. Intell. Inf. Syst. 1, 1–27 (2017). https://doi.org/10.1007/s10844-016-0438-z

  10. Rosner, A., Kostek, B.: Automatic music genre classification based on musical instrument track separation. J. Intell. Inf. Syst. 5 (2017). https://doi.org/10.1007/s10844-017-0464-5

  11. Korvel, G., Kostek, B.: Examining feature vector for phoneme recognition. In: Proceeding of IEEE International Symposium on Signal Processing and Information Technology, ISSPIT 2017. Bilbao, Spain (2017)

    Google Scholar 

  12. Korvel, G., Kostek, B.: Voiceless stop consonant modelling and synthesis framework based on MISO dynamic system. Arch. Acoust. 42(3), 375–383 (2017). https://doi.org/10.1515/aoa-2017-0039

    Article  Google Scholar 

  13. Plewa, M., Kostek, B.: Music mood visualization using self-organizing maps. Arch. Acoust. 40(4), 513–525 (2015). https://doi.org/10.1515/aoa-2015-0051

    Article  Google Scholar 

  14. Kostek, B., Kupryjanow, A., Zwan, P., Jiang, W., Raś, Z., Wojnarski, M., Swietlicka, J.: Report of the ISMIS 2011 contest: music information retrieval. Found. Intell. Syst. 715–724 (2011)

    Google Scholar 

  15. Gold, B., Morgan, N., Ellis, D.: Speech and Audio Signal Processing: Processing and Perception of Speech and Music, 2nd edn, 688 pp. Wiley, Inc., (2011)

    Book  Google Scholar 

  16. Prabhu, K.M.M.: Window Functions and Their Applications in Signal Processing. CRC Press (2013)

    Book  Google Scholar 

  17. Heinzel, G., Rudiger, A., Schilling, R.: Spectrum and spectral density estimation by the discrete Fourier transform (DFT), including a comprehensive list of window functions and some new flat-top windows. Internal Report, Max-Planck-Institut fur Gravitations physik, Hannover (2002)

    Google Scholar 

  18. Gillet, O., Richard, G.: Automatic transcription of drum loops. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP ‘04) (2004)

    Google Scholar 

  19. Hyungsuk, K., Heo, S.W.: Time-domain calculation of spectral centroid from backscattered ultrasound signals. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 59(6) (2012)

    Google Scholar 

  20. Hyoung-Gook, K., Moreau, N., Sikora, T.: MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval. Wiley, Hoboken (2005)

    Google Scholar 

  21. Manjunath, B.S., Salembier, P., Sikora T.: Introduction to MPEG-7: Multimedia Content Description Interface. Wiley (2002)

    Google Scholar 

  22. Ma, Y., Nishihara, A.: Efficient voice activity detection algorithm using long-term spectral flatness measure. EURASIP J. Audio, Speech, Music Process 1–18 (2013)

    Google Scholar 

  23. Moattar, M.H., Homayounpour, M.M.: A simple but efficient real-time voice activity detection algorithm. In: 17th European Signal Processing Conference (EUSIPCO 2009). Glasgow, Scotland, Aug 24–28 (2009)

    Google Scholar 

  24. Mermelstein, P.: Distance measures for speech recognition, psychological and instrumental. Pattern Recognition and Artificial Intelligence, pp. 374–388. Academic, New York (1976)

    Google Scholar 

  25. Logan, B.: Mel frequency cepstral coefficients for music modeling. In: Proceedings of 1st International Symposium on Music Information Retrieval (ISMIR). Plymouth, Massachusetts, USA (2000)

    Google Scholar 

  26. Nijhawan, G., Soni, M.K.: Speaker recognition using MFCC and vector quantisation. J. Recent Trends Eng. Technol. 11(1), 211–218 (2014)

    Google Scholar 

  27. Wang, Y., Lawlor, B.: Speaker recognition based on MFCC and BP neural networks. In: 28th Irish Signals and Systems Conference (2017)

    Google Scholar 

  28. Ahmad, K.S., Thosar, A.S., Nirmal, J.H., Pande, V.S.: A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR), 4–7 Jan 2015, pp. 1–6 (2015)

    Google Scholar 

  29. Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classification in mismatched conditions. Signal Process. Lett. IEEE 18(2), 130–133 (2011)

    Article  Google Scholar 

  30. Leonard, F.: Phase spectrogram and frequency spectrogram as new diagnostic tools. Mech. Syst. Signal Process. 21(1), 125–137 (2007)

    Article  Google Scholar 

  31. Lawrence, J.R., Borden, G.J., Harris K.S.: Speech Science Primer: Physiology, Acoustics, and Perception of Speech, 6th edn, 334 pp. Lippincott Williams & Wilkins (2011)

    Google Scholar 

  32. Steuer, R., Daub, C.O., Selbig, J., Kurths, J.: Measuring distances between variables by mutual information. In: Innovations in Classification, Data Science, and Information Systems, pp. 81–90 (2005)

    Google Scholar 

  33. Pohjalainen, J., Rasanen, O., Kadioglu, S.: Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits. Comput. Speech Lang. 29(1), 145–171 (2015)

    Article  Google Scholar 

  34. Manocha, S., Girolami, M.A.: An empirical analysis of the probabilistic K-nearest neighbour classifier. Pattern Recogn. Lett. 28, 1818–1824 (2007)

    Article  Google Scholar 

  35. Palaniappan, R., Sundaraj, K., Sundaraj, S.: A comparative study of the SVM and k-nn machine learning algorithms for the diagnosis of respiratory pathologies using pulmonary acoustic signals. BMC Bioinf. 15, 1–8 (2014)

    Article  Google Scholar 

  36. Czyżewski, A., Piotrowska, M., Kostek, B.: Analysis of allophones based on audio signal recordings and parameterization. J. Acoust. Soc. Am. 141(5), 3521 (2017). https://doi.org/10.1121/1.4987415

    Article  Google Scholar 

  37. Kostek, B., Piotrowska, M., Czyżewski, A.: Comparative study of self-organizing maps versus subjective evaluation of quality of allophone pronunciation for nonnative english speakers. In: 143rd Audio Engineering Society Convention, Preprint 9847. New York (2017)

    Google Scholar 

  38. Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. In: The Morgan Kaufmann Series in Data Management Systems, 2nd edn, 761 pp. Morgan Kaufmann (2006)

    Google Scholar 

  39. Kingma, P.D., Ba, J.L.: ADAM: a method for stochastic optimization. In: International Conference on Learning Representations, ICLR 2015 (2015). https://arxiv.org/pdf/1412.6980.pdf. Accessed Jan 2018

  40. Keras library Keras Documentation Website. http://keras.io. Accessed Jan 2018

  41. TensorFlow library. TensorFlow Documentation Website. https://www.tensorflow.org/. Accessed Jan 2018

  42. TIMIT: Acoustic-Phonetic Continuous Speech Corpus. https://catalog.ldc.upenn.edu/ldc93s1. Accessed Jan 2018

Download references

Acknowledgements

Research partially sponsored by the Polish National Science Centre, Dec. No. 2015/17/B/ST6/01874. This work has also been partially supported by Statutory Funds of Electronics, Telecommunications and Informatics Faculty, Gdansk University of Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bozena Kostek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Korvel, G., Kurowski, A., Kostek, B., Czyzewski, A. (2019). Speech Analytics Based on Machine Learning. In: Tsihrintzis, G., Sotiropoulos, D., Jain, L. (eds) Machine Learning Paradigms. Intelligent Systems Reference Library, vol 149 . Springer, Cham. https://doi.org/10.1007/978-3-319-94030-4_6

Download citation

Publish with us

Policies and ethics