Size Matters in Hearing: How the Auditory System Normalizes the Sounds of Speech and Music for Source Size

  • Roy D. Patterson
  • Toshio Irino
Part of the Springer Handbook of Auditory Research book series (SHAR, volume 50)


The sounds that mammals use to communicate, including the voiced parts of speech, have a very special “pulse resonance” form. In 1992, we drew attention to the fascinating time-interval patterns that these sounds produce at the output of a gammatone auditory filter bank (GT-AFB), and we described how to construct stabilized auditory images (SAIs) in which the time-interval patterns appear and evolve as distinctive auditory events. Since that time, the filter bank work has been extended to determine the “optimal” form of level-dependent AFB, and the SAI work has been extended to demonstrate that the stabilized time-interval patterns play a role in auditory perception. These two streams of research are presented as appendices in Sections 5 and 4 of this chapter, respectively.

The mathematics of the optimal AFB drew our attention to the fact that auditory perception is largely scale invariant; humans can understand people no matter what their size. We describe why size invariance is important in Section 1, and show how the auditory system might construct a scale invariant version of the SAI in Section 2. In Section 3, we describe research intended to demonstrate the value of scale invariance in the perception of speech and music, and to argue that machine processing of speech and music would be enhanced if feature extraction were based on a size-invariant SAI rather than a spectrographic representation of sound.


Vocal Tract Auditory Perception Just Noticeable Difference Auditory Event Auditory Filter 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



 We thank Jess Monaghan and Etienne Gaudrain for their contributions to the model of speaker-size estimation. We thank Tom Walters for assistance with Fig. 23.1 and Ralph van Dinther for assistance with Figs. 23.4, 23.5, and 23.6.


  1. Fitch, W. T., & Giedd, J. (1999). Morphology and development of the human vocal tract: A study using magnetic resonance imaging. Journal of the Acoustical Society of America, 106, 1511–1522.PubMedCrossRefGoogle Scholar
  2. Gabor, D. (1946). Theory of communication. Journal of the Institute of Electronic Engineers (London), 93, 429–457.Google Scholar
  3. Irino, T., & Kawahara, H. (1993). Signal reconstruction from modified auditory wavelet transform. IEEE Transactions of Signal Processing, 41, 3549–3554.CrossRefGoogle Scholar
  4. Irino, T., & Patterson, R. D. (1996). Temporal asymmetry in the auditory system. Journal of the Acoustical Society of America, 99, 2316–2331.PubMedCrossRefGoogle Scholar
  5. Irino, T., & Patterson, R. D. (1997). A time-domain level-dependent auditory filter: The gammachirp. Journal of the Acoustical Society of America, 101, 412–419.Google Scholar
  6. Irino, T., & Patterson, R. D. (2002). Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform. Speech Communication, 36, 181–203.CrossRefGoogle Scholar
  7. Irino, T., & Patterson, R. D. (2006). A dynamic compressive gammachirp auditory filterbank. IEEE Transactions of Audio Speech & Language Processing, 14, 2222–2232.Google Scholar
  8. Irino, T., Aoki, Y., Kawahara, H., & Patterson, R. D. (2012). Comparison of performance with voiced and whispered speech in word recognition and mean-formant-frequency discrimination. Speech Communication, 54, 998–1013.CrossRefGoogle Scholar
  9. Ives, D. T., Smith, D. R. R., & Patterson, R. D. (2005). Discrimination of speaker size from syllable phrases. Journal of the Acoustical Society of America, 118, 3186–3822.CrossRefGoogle Scholar
  10. Lee, S., Potamianos, A., & Narayanan, S. (1999). Acoustics of children’s speech: Developmental changes of temporal and spectral parameters. Journal of the Acoustical Society of America, 105, 1455–1468.PubMedCrossRefGoogle Scholar
  11. Patterson, R. D. (1994). The sound of a sinusoid: Time-interval models. Journal of the Acoustical Society of America, 96, 1419–1428.CrossRefGoogle Scholar
  12. Patterson, R. D., & Irino, T. (1998). Modeling temporal asymmetry in the auditory system. Journal of the Acoustical Society of America, 104, 2967–2979.PubMedCrossRefGoogle Scholar
  13. Patterson, R. D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C., & Allerhand, M. (1992). Complex sounds and auditory images. In Y. Cazals, L. Demany, & K. Horner (Eds.), Auditory physiology and perception (pp. 429–446). Oxford: Pergamon Press.CrossRefGoogle Scholar
  14. Patterson, R. D., Allerhand, M. H., & Giguère, C. (1995). Time-domain modeling of peripheral auditory processing: A modular architecture and a software platform. Journal of the Acoustical Society of America, 98, 1890–1894.PubMedCrossRefGoogle Scholar
  15. Patterson, R. D., Uppenkamp, S., Johnsrude, I., & Griffiths, T. D. (2002). The processing of temporal pitch and melody information in auditory cortex. Neuron, 36, 767–776.PubMedCrossRefGoogle Scholar
  16. Patterson, R. D., Unoki, M., & Irino, T. (2003). Extending the domain of center frequencies for the compressive gammachirp auditory filter. Journal of the Acoustical Society of America, 114, 1529–1542.PubMedCrossRefGoogle Scholar
  17. Patterson, R. D., van Dinther, R., & Irino, T. (2007). The robustness of bio-acoustic communication and the role of normalization. In Proceedings of the 19th International Congress on Acoustics (Madrid), pp. a-07–011.Google Scholar
  18. Patterson, R. D., Smith, D. R. R., van Dinther, R., & Walters, T. C. (2008). Size information in the production and perception of communication sounds. In W. A. Yost, A. N. Popper, & R. R. Fay (Eds.), Auditory perception of sound sources (pp. 43–75). New York: Springer Science + Business Media.Google Scholar
  19. Patterson, R. D., Gaudrain, E. & Walters, T. C. (2010). The perception of family and register in musical tones. In M. R. Jones, R. R. Fay, & A. N. Popper (Eds.), Music perception (pp. 13–50). New York: Springer Science + Business Media.CrossRefGoogle Scholar
  20. Smith, D. R. R., & Patterson, R. D. (2005). The interaction of glottal-pulse rate and vocal-tract length in judgements of speaker size, sex and age. Journal of the Acoustical Society of America, 118, 3177–3186.PubMedCentralPubMedCrossRefGoogle Scholar
  21. Smith, D. R. R., Patterson, R. D., Turner, R. E., Kawahara, H., & Irino, T. (2005). The processing and perception of size information in speech sounds. Journal of the Acoustical Society of America, 117, 305–318.PubMedCentralPubMedCrossRefGoogle Scholar
  22. Turner, R. E., Walters, T. C., Monaghan, J. J. M., & Patterson, R. D. (2009). A statistical formant-pattern model for estimating vocal-tract length from formant frequency data. Journal of the Acoustical Society of America, 125, 2374–2386.PubMedCentralPubMedCrossRefGoogle Scholar
  23. Walters, T. C. (2011). Auditory-based processing of communication sounds. Ph.D. dissertation, University of Cambridge.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Centre for the Neural Basis of Hearing, PDNUniversity of CambridgeCambridgeUK
  2. 2.Faculty of Systems EngineeringWakayama UniversityWakayamaJapan

Personalised recommendations