Skip to main content

Statistical Model-Based Voice Activity Detection Using Spatial Cues and Log Energy for Dual-Channel Noisy Speech Recognition

  • Conference paper
Communication and Networking (FGCN 2010)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 120))

Abstract

In this paper, a voice activity detection (VAD) method for dual-channel noisy speech recognition is proposed on the basis of statistical models constructed by spatial cues and log energy. In particular, spatial cues are composed of the interaural time differences and interaural level differences of dual-channel speech signals, and the statistical models for speech presence and absence are based on a Gaussian kernel density. In order to evaluate the performance of the proposed VAD method, speech recognition is performed using only speech signals segmented by the proposed VAD method. The performance of the proposed VAD method is then compared with those of conventional methods such as a signal-to-noise ratio variance based method and a phase vector based method. It is shown from the experiments that the proposed VAD method outperforms conventional methods, providing the relative word error rate reductions of 19.5% and 12.2%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Junqua, J.C., Mak, B., Reaves, B.: A robust algorithm for word boundary detection in the presence of noise. IEEE Transactions on Speech and Audio Processing 2(3), 406–412 (1994)

    Article  Google Scholar 

  2. ETSI TS 101 707, V7.5.0: Digital Cellular Telecommunications System (Phase 2+); Discontinuous Transmission (DTX) for Adaptive Multi-Rate (AMR) Speech Traffic Channels (2000)

    Google Scholar 

  3. Rabiner, R., Sambur, M.R.: An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal 54(2), 297–315 (1975)

    Article  Google Scholar 

  4. Tuker, R.: Voice activity detection using a periodicity measure. IEE Proceedings-I, Communications, Speech and Vision 139(4), 377–380 (1992)

    Article  Google Scholar 

  5. Haigh, J.A., Mason, J.S.: Robust voice activity detection using cepstral features. In: Proceedings of the IEEE TENCON, pp. 321–324 (1993)

    Google Scholar 

  6. Ramirez, J., Segura, J.C., Benitez, C., Torre, A., Rubio, A.: Efficient voice activity detection algorithms using long-term speech information. Speech Communication 42(3-4), 271–287 (2004)

    Article  Google Scholar 

  7. Welch, P.D.: The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on Audio Electroacoustics 15(2), 70–73 (1967)

    Article  Google Scholar 

  8. Davis, A., Nordholm, S., Tognery, R.: Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold. IEEE Transactions on Audio, Speech, and Language Processing 14(2), 412–424 (2006)

    Article  Google Scholar 

  9. Kim, G., Cho, N.I.: Voice activity detection using phase vector in microphone array. Electronic Letters 43(14), 783–784 (2007)

    Article  Google Scholar 

  10. Patterson, R.D., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An Efficient Auditory Filterbank Based on the Gammatone Functions. APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K (1998)

    Google Scholar 

  11. Glasberg, B.R., Moore, B.C.J.: Derivation of auditory filter shapes from notched–noise data. Hearing Research 47(1-2), 103–138 (1990)

    Article  Google Scholar 

  12. Parzen, E.: On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3), 1065–1076 (1962)

    Article  MathSciNet  MATH  Google Scholar 

  13. Kim, S., Oh, S., Jung, H.-Y., Jeong, H.-B., Kim, J.-S.: Common speech database collection. Proceedings of the Acoustical Society of Korea 21(1), 21–24 (2002)

    Google Scholar 

  14. Gardner, W.G., Martin, K.D.: HRTF measurements of a KEMAR. The Journal of the Acoustical Society of America 97(6), 3907–3908 (1995)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Park, J.H., Shin, M.H., Kim, H.K. (2010). Statistical Model-Based Voice Activity Detection Using Spatial Cues and Log Energy for Dual-Channel Noisy Speech Recognition. In: Kim, Th., Vasilakos, T., Sakurai, K., Xiao, Y., Zhao, G., Ślęzak, D. (eds) Communication and Networking. FGCN 2010. Communications in Computer and Information Science, vol 120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17604-3_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-17604-3_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-17603-6

  • Online ISBN: 978-3-642-17604-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics