Advertisement

Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters

  • N. S. Sai SrinivasEmail author
  • N. Sugan
  • Niladri Kar
  • L. S. Kumar
  • Malaya Kumar Nath
  • Aniruddha Kanhe
Article
  • 93 Downloads

Abstract

Spoken language identification (LID) or spoken language recognition (LR) is defined as the process of recognizing the language from speech utterance. In this paper, a new Fourier parameter (FP) model is proposed for the task of speaker-independent spoken language recognition. The performance of the proposed FP features is analyzed and compared with the legacy mel-frequency cepstral coefficient (MFCC) features. Two multilingual databases, namely Indian Institute of Technology Kharagpur Multilingual Indian Language Speech Corpus (IITKGP-MLILSC) and Oriental Language Recognition Speech Corpus (AP18-OLR), are used to extract FP and MFCC features. Spoken LID/LR models are developed with the extracted FP and MFCC features using three classifiers, namely support vector machines, feed-forward artificial neural networks, and deep neural networks. Experimental results show that the proposed FP features can effectively recognize different languages from speech signals. It can also be observed that the recognition performance is significantly improved when compared to MFCC features. Further, the recognition performance is enhanced when MFCC and FP features are combined.

Keywords

AP18-OLR database AP16-OL7 database AP17-OL3 database Artificial neural networks (ANN) Deep neural networks (DNN) Fourier parameters (FP) IITKGP-MLILSC database Indian languages Language identification (LID) Language recognition (LR) Long short-term memory networks (LSTM) Mel-frequency cepstral coefficients (MFCC) Oriental languages Recurrent neural networks (RNN) ReliefF feature selection Speech signal processing Supervised learning and classification Support vector machines (SVM) 

Notes

Acknowledgements

The authors would like to thank all anonymous reviewers for providing their valuable comments on earlier drafts of this manuscript. Their comments and suggestions are very useful and helped us to improve the quality of the final manuscript. The authors express out appreciation to Prof. Dr. K. Sreenivasa Rao and his research team for sharing IITKGP-MLILSC database with us during the course of this research. The authors also express out appreciation to Dr. Zhiyuan Tang for sharing AP18-OLR (AP16-OL7 and AP17-OL3) multilingual database with us during the course of this research. The authors would also like to thank MathWorks®, Inc., for providing MATLAB® tool and NCH®, Inc., for providing WavePad® Sound Editor Tool. Any correspondence should be made to N. S. Sai Srinivas.

References

  1. 1.
    F. Adeeba, S. Hussain, Acoustic feature analysis and discriminative modeling for language identification of closely related South-Asian languages. Circuits Syst. Signal Process. 37(8), 3589–3604 (2018).  https://doi.org/10.1007/s00034-017-0724-1 CrossRefGoogle Scholar
  2. 2.
    E. Ambikairajah, H. Li, L. Wang, B. Yin, V. Sethu, Language identification: a tutorial. IEEE Circuits Syst. Mag. 11(2), 82–108 (2011).  https://doi.org/10.1109/MCAS.2011.941081 CrossRefGoogle Scholar
  3. 3.
    J.C. Ang, A. Mirzal, H. Haron, H.N.A. Hamed, Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(5), 971–989 (2016).  https://doi.org/10.1109/TCBB.2015.2478454 CrossRefGoogle Scholar
  4. 4.
    M.E. Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011).  https://doi.org/10.1016/j.patcog.2010.09.020 CrossRefzbMATHGoogle Scholar
  5. 5.
    J. Balleda, H.A. Murthy, Language identification from short segment of speech, in Proc. ICSLP-2000, vol. 3, pp. 1033–1036 (2000)Google Scholar
  6. 6.
    J. Benesty, M.M. Sondhi, Y. Huang, Springer Handbook of Speech Processing (Springer, Berlin, 2008).  https://doi.org/10.1007/978-3-540-49127-9 CrossRefGoogle Scholar
  7. 7.
    C.C. Bhanja, M.A. Laskar, R.H. Laskar, A pre-classification-based language identification for Northeast Indian languages using prosody and spectral features. Circuits Syst. Signal Process. (2018).  https://doi.org/10.1007/s00034-018-0962-x Google Scholar
  8. 8.
    C. Busso, S. Mariooryad, A. Metallinou, S. Narayanan, Iterative feature normalization scheme for automatic emotion detection from speech. IEEE Trans. Affect. Comput. 4(4), 386–397 (2013).  https://doi.org/10.1109/T-AFFC.2013.26 CrossRefGoogle Scholar
  9. 9.
    C. Cortes, V. Vapnik, Support-vector network. Mach. Learn. 20, 273–297 (1995).  https://doi.org/10.1007/BF00994018 zbMATHGoogle Scholar
  10. 10.
    S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980).  https://doi.org/10.1109/TASSP.1980.1163420 CrossRefGoogle Scholar
  11. 11.
    A. Geron, Hands-on Machine Learning with Scikit-Learn and TensorFlow (O’Reilly Media, Newton, 2017)Google Scholar
  12. 12.
    J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. Gonzalez-Rodriguez, P.J. Moreno, Automatic language identification using long short-term memory recurrent neural networks, in Interspeech 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 2155–2159 (2014). https://www.isca-speech.org/archive/interspeech_2014/i14_2155.html
  13. 13.
    M.T. Hagan, H.B. Demuth, M.H. Beale, O.D. Jesus, Neural Network Design, 2nd edn. (Martin Hagan, Boston, 2014)Google Scholar
  14. 14.
    C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002).  https://doi.org/10.1109/72.991427 CrossRefGoogle Scholar
  15. 15.
    S. Jothilakshmi, V. Ramalingam, S. Palanivel, A hierarchical language identification system for Indian languages. Digit. Signal Process. 22(3), 544–553 (2012).  https://doi.org/10.1016/j.dsp.2011.11.008 MathSciNetCrossRefGoogle Scholar
  16. 16.
    V. Kecman, T.M. Huang, M. Vogt, Iterative Single Data Algorithm for Training Kernel Machines from Huge Data Sets: Theory and Performance (Springer, Berlin, 2005), pp. 255–274.  https://doi.org/10.1007/10984697_12 Google Scholar
  17. 17.
    D.P. Kingma, J.L. Ba, ADAM: A method for stochastic optimization. Computing Research Repository (CoRR) abs/1412.6980, arXiv:1412.6980 (2014)
  18. 18.
    S.G. Koolagudi, A. Bharadwaj, Y.V.S. Murthy, N. Reddy, P. Rao, Dravidian language classification from speech signal using spectral and prosodic features. Int. J. Speech Technol. 20(4), 1005–1016 (2017).  https://doi.org/10.1007/s10772-017-9466-5 CrossRefGoogle Scholar
  19. 19.
    M. Leena, K.S. Rao, B. Yegnanarayana, Neural network classifiers for language identification using phonotactic and prosodic features, in Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, pp. 404–408 (2005).  https://doi.org/10.1109/ICISIP.2005.1529486
  20. 20.
    H. Li, B. Ma, K.A. Lee, Spoken language recognition: from fundamentals to practice. Proc. IEEE 101(5), 1136–1159 (2013).  https://doi.org/10.1109/JPROC.2012.2237151 CrossRefGoogle Scholar
  21. 21.
    S. Maity, A.K. Vuppala, K.S. Rao, D. Nandi, IITKGP-MLILSC speech database for language identification, in 2012 National Conference on Communications (NCC), pp. 1–5 (2012).  https://doi.org/10.1109/NCC.2012.6176831, https://ieeexplore.ieee.org/document/6176831/
  22. 22.
    K.E. Manjunath, K.S. Rao, Improvement of phone recognition accuracy using articulatory features. Circuits Syst. Signal Process. 37(2), 704–728 (2018).  https://doi.org/10.1007/s00034-017-0568-8 CrossRefGoogle Scholar
  23. 23.
    M.F. Møller, A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 6(4), 525–533 (1993).  https://doi.org/10.1016/S0893-6080(05)80056-5 CrossRefGoogle Scholar
  24. 24.
    K.V. Mounika, A. Sivanand, H.R. Lakshmi, V.G. Suryakanth, V.A. Kumar, An investigation of deep neural network architectures for language recognition in indian languages, in Interspeech 2016, pp. 2930–2933 (2016).  https://doi.org/10.21437/Interspeech.2016-910
  25. 25.
    T. Nagarajan, H.A. Murthy, Language identification using spectral vector distribution across languages, in Proceedings of International Conference on Natural Language Processing (2002)Google Scholar
  26. 26.
    D. Nandi, D. Pati, K.S. Rao, Language identification using Hilbert envelope and phase information of linear prediction residual, in 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–6 (2013).  https://doi.org/10.1109/ICSDA.2013.6709864, https://ieeexplore.ieee.org/document/6709864
  27. 27.
    A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice Hall, Upper Saddle River, NJ, 1999)zbMATHGoogle Scholar
  28. 28.
    K.S. Rao, Application of prosody models for developing speech systems in Indian languages. Int. J. Speech Technol. 14(1), 19–33 (2011).  https://doi.org/10.1007/s10772-010-9086-9 CrossRefGoogle Scholar
  29. 29.
    K.S. Rao, S. Sarkar, Robust Speaker Recognition in Noisy Environments (Springer, Berlin, 2014).  https://doi.org/10.1007/978-3-319-07130-5 CrossRefGoogle Scholar
  30. 30.
    V.R. Reddy, S. Maity, K.S. Rao, Identification of Indian languages using multi-level spectral and prosodic features. Int. J. Speech Technol. 16(4), 489–511 (2013).  https://doi.org/10.1007/s10772-013-9198-0 CrossRefGoogle Scholar
  31. 31.
    F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015).  https://doi.org/10.1109/LSP.2015.2420092 CrossRefGoogle Scholar
  32. 32.
    M. Robnik-Šikonja, I. Kononenko, Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 53(1), 23–69 (2003).  https://doi.org/10.1023/A:1025667309714 CrossRefzbMATHGoogle Scholar
  33. 33.
    K.G. Sheela, S.N. Deepa, Review on methods to fix number of hidden neurons in neural networks. Math. Probl. Eng. 2013(425740), 1–11 (2013).  https://doi.org/10.1155/2013/425740 CrossRefGoogle Scholar
  34. 34.
    M. Siu, X. Yang, H. Gish, Discriminatively trained GMMs for language classification using boosting methods. IEEE Trans. Audio Speech Lang. Process. 17(1), 187–197 (2009).  https://doi.org/10.1109/TASL.2008.2006653 CrossRefGoogle Scholar
  35. 35.
    Sreevani, C.A. Murthy, Bridging feature selection and extraction: Compound feature generation. IEEE Trans. Knowl. Data Eng. 29(4), 757–770 (2017).  https://doi.org/10.1109/TKDE.2016.2619712 CrossRefGoogle Scholar
  36. 36.
    N.S.S. Srinivas, N. Sugan, L.S. Kumar, M.K. Nath, A. Kanhe, Speaker-independent Japanese isolated speech word recognition using TDRC features, in 2018 International CET Conference on Control, Communication, and Computing (IC4), pp. 278–283 (2018).  https://doi.org/10.1109/CETIC4.2018.8530947, https://ieeexplore.ieee.org/document/8530947
  37. 37.
    N. Sugan, N.S.S. Srinivas, N. Kar, L.S. Kumar, M.K. Nath, A. Kanhe, Performance comparison of different cepstral features for speech emotion recognition, in 2018 International CET Conference on Control, Communication, and Computing (IC4), pp. 266–271 (2018).  https://doi.org/10.1109/CETIC4.2018.8531065, https://ieeexplore.ieee.org/document/8531065
  38. 38.
    Z. Tang, D. Wang, Y. Chen, Q. Chen, AP17-OLR challenge: Data, plan, and baseline. in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 749–753 (2017).  https://doi.org/10.1109/APSIPA.2017.8282134, https://ieeexplore.ieee.org/document/8282134
  39. 39.
    Z. Tang, D. Wang, Q. Chen, AP18-OLR challenge: three tasks and their baselines, in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 596–600 (2018).  https://doi.org/10.23919/APSIPA.2018.8659714
  40. 40.
    V.N. Vapnik, Statistical Learning Theory (Wiley, New York, 2001)zbMATHGoogle Scholar
  41. 41.
    M.K. Veera, R.K. Vuddagiri, S.V. Gangashetty, A.K. Vuppala, Combining evidences from excitation source and vocal tract system features for Indian language identification using deep neural networks. Int. J. Speech Technol. 21(3), 501–508 (2018).  https://doi.org/10.1007/s10772-017-9481-6 CrossRefGoogle Scholar
  42. 42.
    R.K. Vuddagiri, K. Gurugubelli, P. Jain, H.K. Vydana, A.K. Vuppala, IIITH-ILSC speech database for Indian language identification, in The 6th International Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 56–60 (2018)Google Scholar
  43. 43.
    D. Wang, L. Li, D. Tang, Q. Chen, AP16-OL7: a multilingual database for oriental languages and a language recognition baseline, in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–5 (2016).  https://doi.org/10.1109/APSIPA.2016.7820796, https://ieeexplore.ieee.org/document/7820796
  44. 44.
    K. Wang, N. An, B.N. Li, Y. Zhang, L. Li, Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 6(1), 69–75 (2015).  https://doi.org/10.1109/TAFFC.2015.2392101 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Electronics and Communication EngineeringNational Institute of Technology Puducherry KaraikalKaraikalIndia

Personalised recommendations