Speech Analytics Based on Machine Learning

Korvel, Grazina; Kurowski, Adam; Kostek, Bozena; Czyzewski, Andrzej

doi:10.1007/978-3-319-94030-4_6

Grazina Korvel^6,7,
Adam Kurowski^6,8,
Bozena Kostek⁹ &
…
Andrzej Czyzewski⁶

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 149 ))

1650 Accesses
4 Citations

Abstract

In this chapter, the process of speech data preparation for machine learning is discussed in detail. Examples of speech analytics methods applied to phonemes and allophones are shown. Further, an approach to automatic phoneme recognition involving optimized parametrization and a classifier belonging to machine learning algorithms is discussed. Feature vectors are built on the basis of descriptors coming from the music information retrieval (MIR) domain. Then, phoneme classification beyond the typically used techniques is extended towards exploring Deep Neural Networks (DNNs). This is done by combining Convolutional Neural Networks (CNNs) with audio data converted to the time-frequency space domain (i.e. spectrograms) and then exported as images. In this way a two-dimensional representation of speech feature space is employed. When preparing the phoneme dataset for CNNs, zero padding and interpolation techniques are used. The obtained results show an improvement in classification accuracy in the case of allophones of the phoneme /l/, when CNNs coupled with spectrogram representation are employed. Contrarily, in the case of vowel classification, the results are better for the approach based on pre-selected features and a conventional machine learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Badshah. A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network International Conference on Platform Technology and Service (PlatCon), pp. 1–5 (2017)
Google Scholar
Noroozi, F., Kaminska, D., Sapinski, T., Anbarjafari, G.: Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost. J. Audio Eng. Soc. 65(7/8), 562–572 (2017). https://doi.org/10.17743/jaes.2017.0022
Article Google Scholar
Sainath, T.N., Mohamed, A.-R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8614–8618 (2013)
Google Scholar
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23(1), 7–19 (2015)
Article Google Scholar
Alam, M.J., Kenny, P., O’Shaughnessy, D.: Low-variance multitaper mel-frequency cepstral coefficient features for speech and speaker recognition systems. Cognit. Comput. 5(4), 533–544 (2013)
Article Google Scholar
Lerch, A.: An Introduction To Audio Content Analysis: Applications in Signal Processing and Music Informatics, p. 248. Wiley, Hoboken, N.J (2012)
Book Google Scholar
Biswas, A., Sahu, P., Chandra, M.: Multiple camera in car audio–visual speech recognition using phonetic and visemic information. Comput. Electr. Eng. 47, 35–50 (2015). https://doi.org/10.1016/j.compeleceng.2015.08.009
Article Google Scholar
Ziółko, B., Ziółko, M.: Time durations of phonemes in Polish language for speech and speaker recognition. In: Human Language Technology, Challenges for Computer Science and Linguistics. Lecture Notes in Computer Science, vol. 6562, pp. 105–114. Springer (2011)
Chapter Google Scholar
Czyżewski, A., Kostek, B., Bratoszewski, P., Kotus, J., Szykulski, M.: An audio-visual corpus for multimodal automatic speech recognition. J. Intell. Inf. Syst. 1, 1–27 (2017). https://doi.org/10.1007/s10844-016-0438-z
Rosner, A., Kostek, B.: Automatic music genre classification based on musical instrument track separation. J. Intell. Inf. Syst. 5 (2017). https://doi.org/10.1007/s10844-017-0464-5
Korvel, G., Kostek, B.: Examining feature vector for phoneme recognition. In: Proceeding of IEEE International Symposium on Signal Processing and Information Technology, ISSPIT 2017. Bilbao, Spain (2017)
Google Scholar
Korvel, G., Kostek, B.: Voiceless stop consonant modelling and synthesis framework based on MISO dynamic system. Arch. Acoust. 42(3), 375–383 (2017). https://doi.org/10.1515/aoa-2017-0039
Article Google Scholar
Plewa, M., Kostek, B.: Music mood visualization using self-organizing maps. Arch. Acoust. 40(4), 513–525 (2015). https://doi.org/10.1515/aoa-2015-0051
Article Google Scholar
Kostek, B., Kupryjanow, A., Zwan, P., Jiang, W., Raś, Z., Wojnarski, M., Swietlicka, J.: Report of the ISMIS 2011 contest: music information retrieval. Found. Intell. Syst. 715–724 (2011)
Google Scholar
Gold, B., Morgan, N., Ellis, D.: Speech and Audio Signal Processing: Processing and Perception of Speech and Music, 2nd edn, 688 pp. Wiley, Inc., (2011)
Book Google Scholar
Prabhu, K.M.M.: Window Functions and Their Applications in Signal Processing. CRC Press (2013)
Book Google Scholar
Heinzel, G., Rudiger, A., Schilling, R.: Spectrum and spectral density estimation by the discrete Fourier transform (DFT), including a comprehensive list of window functions and some new flat-top windows. Internal Report, Max-Planck-Institut fur Gravitations physik, Hannover (2002)
Google Scholar
Gillet, O., Richard, G.: Automatic transcription of drum loops. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP ‘04) (2004)
Google Scholar
Hyungsuk, K., Heo, S.W.: Time-domain calculation of spectral centroid from backscattered ultrasound signals. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 59(6) (2012)
Google Scholar
Hyoung-Gook, K., Moreau, N., Sikora, T.: MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval. Wiley, Hoboken (2005)
Google Scholar
Manjunath, B.S., Salembier, P., Sikora T.: Introduction to MPEG-7: Multimedia Content Description Interface. Wiley (2002)
Google Scholar
Ma, Y., Nishihara, A.: Efficient voice activity detection algorithm using long-term spectral flatness measure. EURASIP J. Audio, Speech, Music Process 1–18 (2013)
Google Scholar
Moattar, M.H., Homayounpour, M.M.: A simple but efficient real-time voice activity detection algorithm. In: 17th European Signal Processing Conference (EUSIPCO 2009). Glasgow, Scotland, Aug 24–28 (2009)
Google Scholar
Mermelstein, P.: Distance measures for speech recognition, psychological and instrumental. Pattern Recognition and Artificial Intelligence, pp. 374–388. Academic, New York (1976)
Google Scholar
Logan, B.: Mel frequency cepstral coefficients for music modeling. In: Proceedings of 1st International Symposium on Music Information Retrieval (ISMIR). Plymouth, Massachusetts, USA (2000)
Google Scholar
Nijhawan, G., Soni, M.K.: Speaker recognition using MFCC and vector quantisation. J. Recent Trends Eng. Technol. 11(1), 211–218 (2014)
Google Scholar
Wang, Y., Lawlor, B.: Speaker recognition based on MFCC and BP neural networks. In: 28th Irish Signals and Systems Conference (2017)
Google Scholar
Ahmad, K.S., Thosar, A.S., Nirmal, J.H., Pande, V.S.: A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR), 4–7 Jan 2015, pp. 1–6 (2015)
Google Scholar
Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classification in mismatched conditions. Signal Process. Lett. IEEE 18(2), 130–133 (2011)
Article Google Scholar
Leonard, F.: Phase spectrogram and frequency spectrogram as new diagnostic tools. Mech. Syst. Signal Process. 21(1), 125–137 (2007)
Article Google Scholar
Lawrence, J.R., Borden, G.J., Harris K.S.: Speech Science Primer: Physiology, Acoustics, and Perception of Speech, 6th edn, 334 pp. Lippincott Williams & Wilkins (2011)
Google Scholar
Steuer, R., Daub, C.O., Selbig, J., Kurths, J.: Measuring distances between variables by mutual information. In: Innovations in Classification, Data Science, and Information Systems, pp. 81–90 (2005)
Google Scholar
Pohjalainen, J., Rasanen, O., Kadioglu, S.: Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits. Comput. Speech Lang. 29(1), 145–171 (2015)
Article Google Scholar
Manocha, S., Girolami, M.A.: An empirical analysis of the probabilistic K-nearest neighbour classifier. Pattern Recogn. Lett. 28, 1818–1824 (2007)
Article Google Scholar
Palaniappan, R., Sundaraj, K., Sundaraj, S.: A comparative study of the SVM and k-nn machine learning algorithms for the diagnosis of respiratory pathologies using pulmonary acoustic signals. BMC Bioinf. 15, 1–8 (2014)
Article Google Scholar
Czyżewski, A., Piotrowska, M., Kostek, B.: Analysis of allophones based on audio signal recordings and parameterization. J. Acoust. Soc. Am. 141(5), 3521 (2017). https://doi.org/10.1121/1.4987415
Article Google Scholar
Kostek, B., Piotrowska, M., Czyżewski, A.: Comparative study of self-organizing maps versus subjective evaluation of quality of allophone pronunciation for nonnative english speakers. In: 143rd Audio Engineering Society Convention, Preprint 9847. New York (2017)
Google Scholar
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. In: The Morgan Kaufmann Series in Data Management Systems, 2nd edn, 761 pp. Morgan Kaufmann (2006)
Google Scholar
Kingma, P.D., Ba, J.L.: ADAM: a method for stochastic optimization. In: International Conference on Learning Representations, ICLR 2015 (2015). https://arxiv.org/pdf/1412.6980.pdf. Accessed Jan 2018
Keras library Keras Documentation Website. http://keras.io. Accessed Jan 2018
TensorFlow library. TensorFlow Documentation Website. https://www.tensorflow.org/. Accessed Jan 2018
TIMIT: Acoustic-Phonetic Continuous Speech Corpus. https://catalog.ldc.upenn.edu/ldc93s1. Accessed Jan 2018

Download references

Acknowledgements

Research partially sponsored by the Polish National Science Centre, Dec. No. 2015/17/B/ST6/01874. This work has also been partially supported by Statutory Funds of Electronics, Telecommunications and Informatics Faculty, Gdansk University of Technology.

Author information

Authors and Affiliations

Faculty of Electronics, Telecommunications and Informatics, Multimedia Systems Department, Gdańsk University of Technology, G. Narutowicza 11/12, 80-233, Gdańsk, Poland
Grazina Korvel, Adam Kurowski & Andrzej Czyzewski
Institute of Data Science and Digital Technologies, Vilnius University, Akademijos str. 4, LT-04812, Vilnius, Lithuania
Grazina Korvel
Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology, G. Narutowicza 11/12, 80-233, Gdańsk, Poland
Adam Kurowski
Faculty of Electronics, Telecommunications and Informatics, Audio Acoustics Laboratory, Gdańsk University of Technology, G. Narutowicza 11/12, 80-233, Gdańsk, Poland
Bozena Kostek

Authors

Grazina Korvel
View author publications
You can also search for this author in PubMed Google Scholar
Adam Kurowski
View author publications
You can also search for this author in PubMed Google Scholar
Bozena Kostek
View author publications
You can also search for this author in PubMed Google Scholar
Andrzej Czyzewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bozena Kostek .

Editor information

Editors and Affiliations

University of Piraeus , Piraeus, Greece
George A. Tsihrintzis
University of Piraeus , Piraeus, Greece
Dionisios N. Sotiropoulos
Faculty of Engineering and Information Technology, Centre for Artificial Intelligence, University of Technology, Sydney, New South Wales, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Korvel, G., Kurowski, A., Kostek, B., Czyzewski, A. (2019). Speech Analytics Based on Machine Learning. In: Tsihrintzis, G., Sotiropoulos, D., Jain, L. (eds) Machine Learning Paradigms. Intelligent Systems Reference Library, vol 149 . Springer, Cham. https://doi.org/10.1007/978-3-319-94030-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-94030-4_6
Published: 04 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94029-8
Online ISBN: 978-3-319-94030-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics