An RNN-Based Speech-Music Discrimination Used for Hybrid Audio Coder

Yang, Wanzhao; Tu, Weiping; Zheng, Jiaxi; Zhang, Xiong; Yang, Yuhong; Song, Yucheng

doi:10.1007/978-3-319-73603-7_7

An RNN-Based Speech-Music Discrimination Used for Hybrid Audio Coder

Wanzhao Yang²¹,
Weiping Tu^21,22,
Jiaxi Zheng²¹,
Xiong Zhang²¹,
Yuhong Yang^21,22 &
…
Yucheng Song²¹

Conference paper
First Online: 13 January 2018

3152 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10704))

Abstract

Hybrid coders are able to select coding mode for different types of audio data and obtain high coding efficiency in a universal scheme with low bitrate. The selection accuracy is critical to maintain the quality of decoded audio signal and the computational complexity of hybrid coder is also important due to various applications on mobile devices. In this paper, a low complexity coding mode selection method based on Recurrent Neural Networks (RNN) has been investigated to improve the coding performance of the state-of-the-art hybrid coder, AMR-WB+. A constraint is composed on the outputs of RNN by sigmoid function to improve SNR of the decoded audio signal. The experimental results show that the proposed method achieves almost similar quality of decoded audio signal with closed-loop method and comparable complexity with open-loop method in AMR-WB+ coder, outperforming some latest classification methods for this standard.

W. Tu—This work is supported by National Nature Science Foundation of China (No. 61671335).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Wang, J., Wu, Q., Deng, H.: Real-time speech/music classification with a hierarchical oblique decision tree. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2033–2036 (2008)
Google Scholar
Fuchs, G.: A robust speech/music discriminator for switched audio coding. In: 23rd European Signal Processing Conference, pp. 569–573 (2015)
Google Scholar
Kim, J., Kim, N.: Improved frame mode selection for AMR-WB+ based on decision tree. IEICE Trans. Inf. Syst. 91(6), 1830–1833 (2008)
Article Google Scholar
Wang, M., Lee, M.: A neural network based coding mode selection scheme of hybrid audio coder. In: IEEE International Conference on Wireless Communications, Networking and Information Security, pp. 107–110 (2010)
Google Scholar
Khan, M., Al-Khatib, W., Moinuddin, M.: Automatic classification of speech and music using neural networks. In: ACM International Workshop on Multimedia databases, pp. 94–99 (2004)
Google Scholar
Pikrakis, A., Theodoridis, S.: Speech-music discrimination: a deep learning perspective. In: 22nd European Signal Processing Conference, pp. 616–620 (2014)
Google Scholar
3GPP TR 26.936: 3rd Generation Partnership Project; Technical Specification Group Services and System Aspects; Performance Characterization of Audio Codecs (Release 14) (2017)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
ETSI TS 126 290: Digital cellular telecommunications system (Phase 2+); Universal Mobile Telecommunications System (UMTS); Audio codec processing functions; Extended Adaptive Multi-Rate - Wideband (AMR-WB+) codec; Transcoding functions (Release 6) (2005)
Google Scholar
ISO/IEC 23003-3, Information Technology – MPEG Audio Technologies – Part 3: Unified Speech and Audio Coding, ed. 1, International Organization for Standardization (2011)
Google Scholar
GPP TS 26.441: 3^rd Generation Partnership Project; Technical Specification Group Services and System Aspects; Codec for Enhanced Voice Services (EVS); General Overview (Release 14) (2017)
Google Scholar
Lee, S., Kim, J., Lee, I.: Speech/audio signal classification using spectral flux pattern recogniton. In: IEEE Workshop on Signal Processing Systems, pp. 232–236 (2012)
Google Scholar
Khonglah, B., Sharma, R., Mahadeva, S.: Speech vs music discrimination using empirical mode decomposition. In: National Conference on Communications, pp. 1–6 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, National Engineering Research Center for Multimedia Software, Wuhan University, Wuhan, China
Wanzhao Yang, Weiping Tu, Jiaxi Zheng, Xiong Zhang, Yuhong Yang & Yucheng Song
Research Institute of Wuhan University in Shenzhen, Shenzhen, China
Weiping Tu & Yuhong Yang

Authors

Wanzhao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Weiping Tu
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxi Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuhong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yucheng Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiping Tu .

Editor information

Editors and Affiliations

Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria
Klaus Schoeffmann
Chulalongkorn University, Bangkok, Thailand
Thanarat H. Chalidabhongse
City University of Hong Kong, Hong Kong, China
Chong Wah Ngo
Chulalongkorn University, Bangkok, Thailand
Supavadee Aramvith
Dublin City University, Dublin, Ireland
Noel E. O’Connor
Gwangju Institute of Science and Technology, Gwangju, Korea (Republic of)
Yo-Sung Ho
Tampere University of Technology, Tampere, Finland
Moncef Gabbouj
Rutgers University, Piscataway, New Jersey, USA
Ahmed Elgammal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, W., Tu, W., Zheng, J., Zhang, X., Yang, Y., Song, Y. (2018). An RNN-Based Speech-Music Discrimination Used for Hybrid Audio Coder. In: Schoeffmann, K., et al. MultiMedia Modeling. MMM 2018. Lecture Notes in Computer Science(), vol 10704. Springer, Cham. https://doi.org/10.1007/978-3-319-73603-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-73603-7_7
Published: 13 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73602-0
Online ISBN: 978-3-319-73603-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics