Audio Classification

  • Soumya Sen
  • Anjan Dutta
  • Nilanjan Dey
Part of the SpringerBriefs in Applied Sciences and Technology book series (BRIEFSAPPLSCIENCES)


Classification falls under supervised learning. Supervised learning is a learning process from a given dataset or training dataset where both input and mapping output data are provided. The decision rules are designed by observing the training dataset to determine the category or class for future decision-making. Classification is the process of assigning an individual item or dataset to one of the number of existing categories or classes depending on the characteristics or features of the input data.


  1. 1.
    Retrieved September 26, 2018, from
  2. 2.
  3. 3.
  4. 4.
    Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.CrossRefGoogle Scholar
  5. 5.
    Hellman, M. E. (1970). The nearest neighbor classification rule with a reject option. IEEE Transactions on Systems Science and Cybernetics, 3, 179–185.CrossRefGoogle Scholar
  6. 6.
    Fukunaga, K., & Hostetler, L. (1975). k-nearest-neighbor bayes-risk estimation. IEEE Transactions on Information Theory, 21(3), 285–293.MathSciNetCrossRefGoogle Scholar
  7. 7.
    Dudani, S. A. (1976). The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems Science and Cybernetics, SMC-6:325–327.CrossRefGoogle Scholar
  8. 8.
    Bailey, T., & Jain, A. (1978). A note on distance-weighted k-nearest neighbor rules. IEEE Transactions on Systems, Man, Cybernetics, 8, 311–313.Google Scholar
  9. 9.
    Bermejo, S., & Cabestany, J. (2000). Adaptive soft k-nearest-neighbour classifiers. Pattern Recognition, 33, 1999–2005.CrossRefGoogle Scholar
  10. 10.
    Jozwik, A. (1983). A learning scheme for a fuzzy k-nn rule. Pattern Recognition Letters, 1, 287–289.CrossRefGoogle Scholar
  11. 11.
    Pao, T. L., Liao, W. Y., & Chen, Y. T. (2007). Audio-visual speech recognition with weighted KNN-based classification in mandarin database. In 2007 Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIHMSP 2007 (Vol. 1, pp. 39–42). IEEE.Google Scholar
  12. 12.
    Kacur, J., Vargic, R., & Mulinka, P. (2011). Speaker identification by K-nearest neighbors: Application of PCA and LDA prior to KNN. In 2011 18th International Conference on Systems, Signals and Image Processing (IWSSIP) (pp. 1–4). IEEE.Google Scholar
  13. 13.
    Feraru, M., & Zbancioc, M. (2013). Speech emotion recognition for SROL database using weighted KNN algorithm. In 2013 International Conference on Electronics, Computers and Artificial Intelligence (ECAI) (pp. 1–4). IEEE.Google Scholar
  14. 14.
    Rizwan, M., & Anderson, D. V. (2014). Using k-Nearest Neighbor and speaker ranking for phoneme prediction. In 2014 13th International Conference on Machine Learning and Applications (ICMLA) (pp. 383–387). IEEE.Google Scholar
  15. 15.
  16. 16.
    Russell, S., & Norvig, P. (2003). Artificial intelligence: A modern approach (2nd ed.). Prentice Hall. ISBN 978-0137903955. [1995].Google Scholar
  17. 17.
    Fu, Z., Lu, G., Ting, K. M., & Zhang, D. (2010). Learning Naïve Bayes classifiers for music classification and retrieval. In 2010 20th International Conference on Pattern Recognition (ICPR) (pp. 4589–4592). IEEE.Google Scholar
  18. 18.
    Sanchis, A., Juan, A., & Vidal, E. (2012). A word-based Naïve Bayes classifier for confidence estimation in speech recognition. IEEE Transactions on Audio, Speech, and Language Processing20(2), 565–574.Google Scholar
  19. 19.
    Bhakre, S. K., & Bang, A. (2016). Emotion recognition on the basis of audio signal using Naïve Bayes classifier. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 2363–2367). IEEE.Google Scholar
  20. 20.
    Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81–106.Google Scholar
  21. 21.
    Retrieved October 11, 2018, from
  22. 22.
    Navada, A., Ansari, A. N., Patil, S., & Sonkamble, B. A. (2011). Overview of use of decision tree algorithms in machine learning. In Control and system graduate research colloquium (icsgrc), 2011 IEEE (pp. 37–42). IEEE.Google Scholar
  23. 23.
    Akamine, M., & Ajmera, J. (2012). Decision tree-based acoustic models for speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2012(1), 10.CrossRefGoogle Scholar
  24. 24.
    Telaar, D., & Fuhs, M. C. (2013). Accent-and speaker-specific polyphone decision trees for non-native speech recognition. In INTERSPEECH (pp. 3313–3316).Google Scholar
  25. 25.
    Hinton, G., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRefGoogle Scholar
  26. 26.
    Mohamed, A. R., Dahl, G., & Hinton, G. (2009). Deep belief networks for phone recognition. In Nips workshop on deep learning for speech recognition and related applications (Vol. 1, No. 9, p. 39).Google Scholar
  27. 27.
    Mohamed, A. R., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech & Language Processing20(1), 14–22.CrossRefGoogle Scholar
  28. 28.
    Jaitly, N., Nguyen, P., Senior, A., & Vanhoucke, V. (2012). Application of pretrained deep neural networks to large vocabulary speech recognition. In Thirteenth Annual Conference of the International Speech Communication Association.Google Scholar
  29. 29.
    Seide, F., Li, G., & Yu, D. (2011). Conversational speech transcription using context-dependent deep neural networks. In Twelfth Annual Conference of the International Speech Communication Association.Google Scholar
  30. 30.
    Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 20(1), 30–42.CrossRefGoogle Scholar
  31. 31.
    Li, X., & Wu, X. (2014). Decision tree based state tying for speech recognition using DNN derived embeddings. In 2014 9th International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 123–127). IEEE.Google Scholar
  32. 32.
    Bressan, G. M., de Azevedo, B. C., & ElisangelaAp, S. L. (2017). A decision tree approach for the musical genres classification. Applied Mathematics, 11(6), 1703–1713.Google Scholar
  33. 33.
    Wang, Y., Cao, L., Dey, N., Ashour, A. S., & Shi, F. (2017). Mice liver cirrhosis microscopic image analysis using gray level co-occurrence matrix and support vector machines. Frontiers in artificial intelligence and applications. In Proceedings of ITITS (pp. 509–515).Google Scholar
  34. 34.
    Zemmal, N., Azizi, N., Dey, N., & Sellami, M. (2016). Adaptive semi supervised support vector machine semi supervised learning with features cooperation for breast cancer classification. Journal of Medical Imaging and Health Informatics, 6(1), 53–62.CrossRefGoogle Scholar
  35. 35.
    Wang, C., et al. (2018). Histogram of oriented gradient based plantar pressure image feature extraction and classification employing fuzzy support vector machine. Journal of Medical Imaging and Health Informatics, 8(4), 842–854.CrossRefGoogle Scholar
  36. 36.
  37. 37.
    Kowalczyk, A. (2017). Support vector machines succinctly.Google Scholar
  38. 38.
    Padrell-Sendra, J., Martin-Iglesias, D., & Diaz-de-Maria, F. (2006, September). Support vector machines for continuous speech recognition. In 2006 14th European Signal Processing Conference (pp. 1–4). IEEE.Google Scholar
  39. 39.
    Dey, N., & Ashour, A. S. (2018). Challenges and future perspectives in speech-sources direction of arrival estimation and localization. In Direction of arrival estimation and localization of multi-speech sources (pp. 49–52). Springer, Cham.Google Scholar
  40. 40.
    Dey, N., & Ashour, A. S. (2018). Direction of arrival estimation and localization of multi-speech sources. Springer International Publishing.Google Scholar
  41. 41.
    Dey, N., & Ashour, A. S. (2018). Applied examples and applications of localization and tracking problem of multiple speech sources. In Direction of arrival estimation and localization of multi-speech sources (pp. 35–48). Springer, Cham.Google Scholar
  42. 42.
    Dey, N., & Ashour, A. S. (2018). Microphone array principles. In Direction of arrival estimation and localization of multi-speech sources (pp. 5–22). Springer, Cham.Google Scholar
  43. 43.
    Shen, P., Changjun, Z., & Chen, X. (2011). Automatic speech emotion recognition using support vector machine. In 2011 International Conference on Electronic and Mechanical Engineering and Information Technology (EMEIT) (Vol. 2, pp. 621–625). IEEE.Google Scholar
  44. 44.
    Mahmoodi, D., Marvi, H., Taghizadeh, M., Soleimani, A., Razzazi, F., & Mahmoodi, M. (2011). Age estimation based on speech features and support vector machine. In 2011 3rd Computer Science and Electronic Engineering Conference (CEEC), (pp. 60–64). IEEE.Google Scholar
  45. 45.
    Matoušek, J., & Tihelka, D. (2013). SVM-based detection of misannotated words in read speech corpora. In International Conference on Text, Speech and Dialogue (pp. 457–464). Springer, Heidelberg.CrossRefGoogle Scholar
  46. 46.
    Aida-zade, K., Xocayev, A., & Rustamov, S. (2016). Speech recognition using support vector machines. In 2016 IEEE 10th International Conference on Application of Information and Communication Technologies (AICT), (pp. 1–4). IEEE.Google Scholar
  47. 47.
    Shi, W., & Fan, X. (2017). Speech classification based on cuckoo algorithm and support vector machines. In 2017 2nd IEEE International Conference on Computational Intelligence and Applications (ICCIA) (pp. 98–102). IEEE.Google Scholar
  48. 48.
    Chan, M. V., Feng, X., Heinen, J. A., & Niederjohn, R. J. (1994). Classification of speech accents with neural networks. In 1994 IEEE International Conference on Neural Networks, IEEE World Congress on Computational Intelligence (Vol. 7, pp. 4483–4486). IEEE.Google Scholar
  49. 49.
    Kohonen, T. (2012). Self-organization and associative memory (Vol. 8). Springer Science & Business Media, New York.Google Scholar
  50. 50.
    Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing: explorations in the microstructure of cognition. volume 1. foundations.Google Scholar
  51. 51.
    Hecht-Nielsen, R. (1990). Neurocomputing. Boston: Addison-Wesley.Google Scholar
  52. 52.
    Hansen, J. H., & Womack, B. D. (1996). Feature analysis and neural network-based classification of speech under stress. IEEE Transactions on Speech and Audio Processing, 4(4), 307–313.CrossRefGoogle Scholar
  53. 53.
    Polur, P. D., Zhou, R., Yang, J., Adnani, F., & Hobson, R. S. (2001). Isolated speech recognition using artificial neural networks. Virginia Commonwealth Univ Richmond School of Engineering.Google Scholar
  54. 54.
    Shao, C., & Bouchard, M. (2003). Efficient classification of noisy speech using neural networks. In 2003 Proceedings of Seventh International Symposium on Signal Processing and Its Applications (Vol. 1, pp. 357–360). IEEE.Google Scholar
  55. 55.
    Alexandre, E., Cuadra, L., Rosa-Zurera, M., & López-Ferreras, F. (2008). Speech/non-speech classification in hearing aids driven by tailored neural networks. In Speech, Audio, Image and Biomedical Signal Processing using Neural Networks (pp. 145–167). Springer, Heidelberg.Google Scholar
  56. 56.
    Hinton, G., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine, 29(6), 82–97. IEEE.Google Scholar
  57. 57.
    Wang, Y., et al. (2018). Classification of mice hepatic granuloma microscopic images based on a deep convolutional neural network. Applied Soft Computing.Google Scholar
  58. 58.
    Lan, K., Wang, D. T., Fong, S., Liu, L. S., Wong, K. K., & Dey, N. (2018). A survey of data mining and deep learning in bioinformatics. Journal of Medical Systems, 42(8), 139.CrossRefGoogle Scholar
  59. 59.
    Hu, S., Liu, M., Fong, S., Song, W., Dey, N., & Wong, R. (2018). Forecasting China future MNP by deep learning. In Behavior engineering and applications (pp. 169–210). Springer, Cham.Google Scholar
  60. 60.
    Dey, N., Fong, S., Song, W., & Cho, K. (2017). Forecasting energy consumption from smart home sensor network by deep learning. In International Conference on Smart Trends for Information Technology and Computer Communications (pp. 255–265). Springer, Singapore.Google Scholar
  61. 61.
    Dey, N., Ashour, A. S., & Nguyen, G. N. Recent advancement in multimedia content using deep learning.Google Scholar
  62. 62.
    Mohamed, A., Dahl, G.E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.CrossRefGoogle Scholar
  63. 63.
    Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.CrossRefGoogle Scholar
  64. 64.
    Rajanna, A. R., Aryafar, K., Shokoufandeh, A., &Ptucha, R. (2015). Deep neural networks: A case study for music genre classification. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) (pp. 655–660). IEEE.Google Scholar
  65. 65.
    Dumpala, S. H., & Kopparapu, S. K. (2017). Improved speaker recognition system for stressed speech using deep neural networks. In 2017 International Joint Conference on Neural Networks (IJCNN) (pp. 1257–1264). IEEE.Google Scholar

Copyright information

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Soumya Sen
    • 1
  • Anjan Dutta
    • 2
  • Nilanjan Dey
    • 3
  1. 1.A.K. Choudhury School of Information TechnologyUniversity of CalcuttaKolkataIndia
  2. 2.Department of Information TechnologyTechno India College of TechnologyKolkataIndia
  3. 3.Department of Information TechnologyTechno India College of TechnologyKolkataIndia

Personalised recommendations