Advertisement

Speech Coding pp 185-203 | Cite as

Voice Activity Detection

  • Christian Uhle
  • Tom BäckströmEmail author
Chapter
Part of the Signals and Communication Technology book series (SCT)

Abstract

Voice Activity Detection (VAD) provides the information whether an audio signal contains speech or not. Besides speech coding and transmission, there are many other applications in speech and audio processing that benefit from this information, and their performance is crucially dependent on the accuracy and robustness of the applied VAD. Various approaches to detect speech have been developed in the past, but when considering the challenging scenarios in which speech needs to be detected, e.g. hands-free communication in noisy environments or dialog in background music, there is still room for improvement. In this chapter, we describe the problem and the environments of VAD, discuss the procedure, examples for methods and their evaluation. Especially the more challenging application scenarios illustrate how superior human hearing can be compared to implementations of audio signal processing.

Keywords

Speech Signal Gaussian Mixture Model Audio Signal Automatic Speech Recognition Voice Over Internet Protocol 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Anemüller, J., Schmidt, D., Bach, J.-H.: Detection of speech embedded in real acoustic background based on amplitude modulation spectrogram features. In: Proceedings of the Interspeech (2008)Google Scholar
  2. 2.
    Barbedo, J., Lopes, A.: A robust and computationally efficient speech/music discriminator. J. Audio Eng. Soc. 54(7), 571–588 (2006)Google Scholar
  3. 3.
    Benyassine, A., Shlomot, E., Su, H.-S., Massaloux, D., Lamblin, C., Petit, J.-P.: Itu-t recommandation g.729 annex b: a silence compression scheme for us with g.729 optimized for v. 70 digital simultaneous voice and data applications. IEEE Commun. Mag. 35(9), 64–73 (1997)CrossRefGoogle Scholar
  4. 4.
    Carey, M., Parris, E., Lloyd-Thomas, H.: A comparison of features for speech, music discrimination. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1999)Google Scholar
  5. 5.
    Cornu, E., Sheikhzadeh, H., Brennan, R.L., Abutalebi, H.R., Tam, E.C.Y., Iles, P., Wong, K.W.: Etsi amr-2 vad: Evaluation and ultra low-resource implementation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2003)Google Scholar
  6. 6.
    Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Proc. 28(4), 357–366 (1980)CrossRefGoogle Scholar
  7. 7.
    Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley, Chichester (2000)zbMATHGoogle Scholar
  8. 8.
    El-Maleh, K., Klein, M., Petrucci, G., Kabal, V.: Speech/music discrimination for multimedia applications. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2000)Google Scholar
  9. 9.
    Dietz, M., et al.: Overview of the EVS codec aarchitecture. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2015)Google Scholar
  10. 10.
    Neuendorf, M., et al.: A novel scheme for low bitrate unified speech and audio coding MPEG RM0. In: Proceedings of the AES 126th Convention (2009)Google Scholar
  11. 11.
    Freeman, D.K., Cosier, G., Southcott, C.B., Boyd, I.: The voice activity detector for the pan-european digital cellular mobile telephone service. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1989)Google Scholar
  12. 12.
    Fuchs, G.: A robust speech/music discriminator for switched audio coding. In: Proceedings of the European Signal Processing Conference on (EUSIPCO) (2015)Google Scholar
  13. 13.
    Gray, A.H., Markel, J.D.: A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis. IEEE Trans. Acoust. Speech Sig. Proc. 22, 207–217 (1974)CrossRefGoogle Scholar
  14. 14.
    Harb, H., Chen, L.: Robust speech music discrimination using spectrum’s first order statistics and neural networks. In: Proceedings of the International Symposium on Signal Processing and It’s Applications (2003)Google Scholar
  15. 15.
    Hellmuth, O., Allamanche, E., Herre, J., Kastner, T., Cremer, M., Hirsch, W.: Advanced audio identification using MPEG-7 content description. In: Proceedings of the AES 111th Convection (2001)Google Scholar
  16. 16.
    Hermansky, H.: Perceptual linear predictive (PLP) analysis for speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRefGoogle Scholar
  17. 17.
    Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)CrossRefGoogle Scholar
  18. 18.
    Hoyt, J., Wechsler, H.: Detection of human speech in structured noise. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1994)Google Scholar
  19. 19.
    Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Analysis Mach. Intell. 22, 4–37 (2000)CrossRefGoogle Scholar
  20. 20.
    Jarina, R., O’Connor, N., Marlow, S., Murphy, N.: Rhythm detection for speech-music discrimination in MPEG compressed domain. In: Proceedings of the 14th International Conference on Digital Signal Processing (2002)Google Scholar
  21. 21.
    Karnebäck, S.: Discrimination between speech and music based on a low frequency modulation feature. In: Proceedings of the Eurospeech, Aalborg, Denmark (2001)Google Scholar
  22. 22.
    Lehner, B., Widmer, W., Sonnleitner, R.: On the reduction of false positives in singing voice detection. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2014)Google Scholar
  23. 23.
    Loizou, P.C.: Speech quality assessment. In: Lin, W., et al. (eds.) Multimedia Analysis, Processing and Communications. Springer, Heidelberg (2011)Google Scholar
  24. 24.
    Malenovsky, V., Jelinek, M.: Improving the detection efficiency of the VMR-WB VAD algorithm on music signals. In: Proceedings of the European Signal Processing Conference on (EUSIPCO) (2008)Google Scholar
  25. 25.
    Martin, R.: Spectral subtraction based on minimum statistics. In: Proceedings of the European Signal Processing Conference (EUSIPCO) (1994)Google Scholar
  26. 26.
    Masri, P.: Computer modelling of sound for transformation and synthesis of musical signals. Ph.D. thesis, University of Bristol (1996)Google Scholar
  27. 27.
    Mesgarani, N., Slaney, M., Shamma, S.: Discrimination of speech from non-speech based on multiscale spectro-temporal modulations. IEEE Trans. Audio Speech Lang. Process. 14(3), 920–930 (2006)CrossRefGoogle Scholar
  28. 28.
    Moattar, M.H., Homayounpour, M.M.: A simple but efficient real-time voice activity detection algorithm. In: Proceedings of the 17th European Signal Processing Conference on (EUSIPCO) (2009)Google Scholar
  29. 29.
    Pinquier, J., Rouas, J.-L., André-Obrecht, R.: A fusion study in speech/music classification. In: Proceedings of the International Conference on Multimedia and Expo, ICME (2003)Google Scholar
  30. 30.
    Ramirez, J., Gorriz, J.M., Segura, J.C.: Voice activity detection. fundamentals and speech recognition system robustness. In: Grimm, M., Kroschel, K. (eds.) Robust Speech Recognition and Understanding. I-Tech (2007)Google Scholar
  31. 31.
    Ross, M.J., Shaffer, H.L., Cohen, A., Freudenberg, R., Manley, H.J.: Average magnitude difference function pitch extractor. IEEE Trans. Acoustics Speech Signal Proc., 22(5) (1974)Google Scholar
  32. 32.
    Saunders, J.: Real-time discrimination of broadcast speech/music. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1996)Google Scholar
  33. 33.
    Scheirer, E., Slaney, M.: Construction and evaluation of a robust multifeature speech/music discriminator. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1997)Google Scholar
  34. 34.
    Skovenborg, E., Lund, T.: Level normalization of feature films using loudness versus speech. In: Proceedings of the AES 135th Convection (2013)Google Scholar
  35. 35.
    Sonnleitner, R., Niedermayer, B., Widmer, G., Schlueter, J.: A simple and effective spectral feature for speech detection in mixed audio signals. In: Proceedings of the International Conference on Digital Audio Effects (DAFx) (2012)Google Scholar
  36. 36.
    Srinivasan, K., Gersho, A.: Voice activity detection for cellular networks. In: Proceedings of the IEEE Workshop on Speech Coding (1993)Google Scholar
  37. 37.
    Tancerel, L., Ragot, S., Ruoppila, V.T., Lefebvre, R.: Combined speech and audio coding by discrimination. In: Proceedings of the IEEE Workshop on Speech Coding (2000)Google Scholar
  38. 38.
    Tchorz, J., Kollmeier, B.: Speech detection and SNR prediction basing on amplitude modulation pattern recognition. In: Proceedings of the Eurospeech (1999)Google Scholar
  39. 39.
    Thoshkahna, B., Sudha, V., Ramakrishnan, K.: A speech-music discriminator using HILN-features. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2006)Google Scholar
  40. 40.
    Tong, S., Chen, N., Qian, Y., Yu, K.: Evaluating VAD for automatic speech recognition. In: Proceedings of the International Conference on Signal Proceesing (ICSP) (2014)Google Scholar
  41. 41.
    Tucker, R.: Voice activity detection using a periodicity measure. In: IEE Proceedings I - Communications, Speech and Vision (1992)Google Scholar
  42. 42.
    Uhle, C.: An investigation of low-level signal descriptor characterizing the noise nature of an audio signal. In: Proceedings of the AES 128th Convection (2010)Google Scholar
  43. 43.
    Uhle, C., Hellmuth, O., Weigel, J.: Speech enhancement of movie sound. In: Proceedings of the AES 125th Convection (2008)Google Scholar
  44. 44.
    Williams, G., Ellis, D.: Speech/music discrimination based on posterior probability features. In: Proceedings of the Eurospeech (1999)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.International Audio Laboratories Erlangen (AudioLabs)Friedrich-Alexander University Erlangen-Nürnberg (FAU)ErlangenGermany

Personalised recommendations