Binaural Speech Separation Using Recurrent Timing Neural Networks for Joint F0-Localisation Estimation

Wrigley, Stuart N.; Brown, Guy J.

doi:10.1007/978-3-540-78155-4_24

Stuart N. Wrigley¹ &
Guy J. Brown¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4892))

Included in the following conference series:

International Workshop on Machine Learning for Multimodal Interaction

1022 Accesses
4 Citations

Abstract

A speech separation system is described in which sources are represented in a joint interaural time difference-fundamental frequency (ITD-F0) cue space. Traditionally, recurrent timing neural networks (RTNNs) have been used only to extract periodicity information; in this study, this type of network is extended in two ways. Firstly, a coincidence detector layer is introduced, each node of which is tuned to a particular ITD; secondly, the RTNN is extended to become two-dimensional to allow periodicity analysis to be performed at each best-ITD. Thus, one axis of the RTNN represents F0 and the other ITD allowing sources to be segregated on the basis of their separation in ITD-F0 space. Source segregation is performed within individual frequency channels without recourse to across-channel estimates of F0 or ITD that are commonly used in auditory scene analysis approaches. The system is evaluated on spatialised speech signals using energy-based metrics and automatic speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bregman, A.S.: Auditory Scene Analysis. The Perceptual Organization of Sound. MIT Press, Cambridge (1990)
Google Scholar
Wang, D., Brown, G.J. (eds.): Computational Auditory Scene Analysis: Principles, Algorithms and Applications. IEEE Press / Wiley-Interscience (2006)
Google Scholar
Cooke, M., Green, P., Josifovski, L., Vizinho, A.: Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 34(3), 267–285 (2001)
Article MATH Google Scholar
Brokx, J.P.L., Nooteboom, S.G.: Intonation and the perceptual separation of simultaneous voices. J. Phonetics 10, 23–36 (1982)
Google Scholar
Scheffers, M.T.M.: Sifting Vowels: Auditory Pitch Analysis and Sound Segregation. PhD thesis, Groningen University, The Netherlands (1983)
Google Scholar
Bird, J., Darwin, C.J.: Effects of a difference in fundamental frequency in separating two sentences. In: Palmer, A.R., Rees, A., Summerfield, A.Q., Meddis, R. (eds.) Psychophysical and physiological advances in hearing, Whurr, pp. 263–269 (1997)
Google Scholar
Blauert, J.: Spatial Hearing — The Psychophysics of Human Sound Localization. MIT Press, Cambridge (1997)
Google Scholar
Lyon, R.F.: A computational model of binaural localization and separation. In: Proc. Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 1148–1151 (1983)
Google Scholar
Roman, N., Wang, D., Brown, G.J.: Speech segregation based on sound localization. J. Acoust. Soc. Am. 114, 2236–2252 (2003)
Article Google Scholar
Edmonds, B.A., Culling, J.F.: The spatial unmasking of speech: Evidence for within-channel processing of interaural time delay. J. Acoust. Soc. Am. 117, 3069–3078 (2005)
Article Google Scholar
Cariani, P.A.: Neural timing nets. Neural Networks 14, 737–753 (2001)
Article Google Scholar
Cariani, P.A.: Recurrent timing nets for auditory scene analysis. In: Proc. Intl. Conf. on Neural Networks (IJCNN) (2003)
Google Scholar
Jeffress, L.A.: A place theory of sound localization. J. Comp. Physiol. Psychol. 41, 35–39 (1948)
Article Google Scholar
Patterson, R.D., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An efficient auditory filterbank based on the gammatone function. Technical Report 2341, Applied Psychology Unit, University of Cambridge, UK (1988)
Google Scholar
Glasberg, B.R., Moore, B.C.J.: Derivation of auditory filter shapes from notched-noise data. Hearing Res. 47, 103–138 (1990)
Article Google Scholar
Leonard, R.G.: A database for speaker-independent digit recognition. In: Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP). vol. 3 (1984)
Google Scholar
Gardner, W.G., Martin, K.D.: HRTF measurements of a KEMAR. J. Acoust. Soc. Am. 97(6), 3907–3908 (1995)
Article Google Scholar
Hu, G., Wang, D.: Monaural speech segregation based on pitch tracking and amplitude modulation. Neural Networks 15(5), 1135–1150 (2004)
Article Google Scholar
Cooke, M.P.: Modelling auditory processing and organisation. Cambridge University Press, Cambridge (1991/1993)
Google Scholar
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.3). Cambridge University Engineering Department (2005)
Google Scholar
Murata, N., Ikeda, S., Ziehe, A.: An approach to blind source separation based on temporal structure of speech signals. Neurocomputing 41, 1–24 (2001)
Article MATH Google Scholar
Wrigley, S.N., Brown, G.J.: A computational model of auditory selective attention. IEEE Trans. Neural Networks 15(5), 1151–1163 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Sheffield, 211 Portobello Street, Sheffield, S1 4DP, United Kingdom
Stuart N. Wrigley & Guy J. Brown

Authors

Stuart N. Wrigley
View author publications
You can also search for this author in PubMed Google Scholar
Guy J. Brown
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Andrei Popescu-Belis Steve Renals Hervé Bourlard

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wrigley, S.N., Brown, G.J. (2008). Binaural Speech Separation Using Recurrent Timing Neural Networks for Joint F0-Localisation Estimation. In: Popescu-Belis, A., Renals, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2007. Lecture Notes in Computer Science, vol 4892. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78155-4_24

Download citation

DOI: https://doi.org/10.1007/978-3-540-78155-4_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78154-7
Online ISBN: 978-3-540-78155-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics