Acoustic features of speech include various spectral and temporal cues. It is known that temporal envelope plays a critical role for speech recognition by human listeners, while automated speech recognition (ASR) heavily relies on spectral analysis. This study compared sentence-recognition scores of humans and an ASR software, Dragon, when spectral and temporal-envelope cues were manipulated in background noise. Temporal fine structure of meaningful sentences was reduced by noise or tone vocoders. Three types of background noise were introduced: a white noise, a time-reversed multi-talker noise, and a fake-formant noise. Spectral information was manipulated by changing the number of frequency channels. With a 20-dB signal-to-noise ratio (SNR) and four vocoding channels, white noise had a stronger disruptive effect than the fake-formant noise. The same observation with 22 channels was made when SNR was lowered to 0 dB. In contrast, ASR was unable to function with four vocoding channels even with a 20-dB SNR. Its performance was least affected by white noise and most affected by the fake-formant noise. Increasing the number of channels, which improved the spectral resolution, generated non-monotonic behaviors for the ASR with white noise but not with colored noise. The ASR also showed highly improved performance with tone vocoders. It is possible that fake-formant noise affected the software’s performance by disrupting spectral cues, whereas white noise affected performance by compromising speech segmentation. Overall, these results suggest that human listeners and ASR utilize different listening strategies in noise.
This is a preview of subscription content, log in to check access.
We thank L. Carney for offering significant input on the manuscript. We thank L. Calandruccio for providing the sentences. We also thank the reviewers for providing tremendous help and insights to the manuscript.
Compliance with Ethical Standards
Conflict of Interest
The authors declare that they have no conflict of interest.
Ali A (1999) Auditory-based acoustic-phonetic signal processing for robust continuous speech recognition. PhD thesis, University of PennsylvaniaGoogle Scholar
Allen JB (1995) How do humans process and recognize speech? In: Ramachandran RP, Mammone RJ (eds) Modern methods of speech processing. Springer US, Boston, pp 251–275CrossRefGoogle Scholar
Davis SB, Mermelstein P (1990) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: Alex W, Kai-Fu L (eds) Readings in speech recognition. Morgan Kaufmann Publishers Inc., pp 65–74Google Scholar
Do CT, Pastor D, Goalic A (2010) On the recognition of cochlear implant-like spectrally reduced speech with MFCC and HMM-based ASR. IEEE Transactions on Audio, Speech, and Language Processing 18:1065–1068CrossRefGoogle Scholar
Dorman MF, Loizou PC, Rainey D (1997a) Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs. J Acoust Soc Am 102:2403–2411PubMedCrossRefPubMedCentralGoogle Scholar
Eisenberg LS, Shannon RV, Martinez AS, Wygonski J, Boothroyd A (2000) Speech recognition with reduced spectral cues as a function of age. J Acoust Soc Am 107:2704–2710PubMedCrossRefPubMedCentralGoogle Scholar
Friesen LM, Shannon RV, Baskent D, Wang X (2001) Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants. J Acoust Soc Am 110:1150–1163PubMedCrossRefPubMedCentralGoogle Scholar
Gelfer MP, Mikos VA (2005) The relative contributions of speaking fundamental frequency and formant frequencies to gender identification based on isolated vowels. J Voice 19:544–554PubMedCrossRefPubMedCentralGoogle Scholar
Juneja A, Espy-Wilson C (2008) A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition. J Acoust Soc Am 123:1154–1168PubMedCrossRefPubMedCentralGoogle Scholar
Mao J, Carney LH (2014) Binaural detection with narrowband and wideband reproducible noise maskers. IV. Models using interaural time, level, and envelope differences. J Acoust Soc Am 135:824–837PubMedPubMedCentralCrossRefGoogle Scholar
Mao J, Carney LH (2015) Tone-in-noise detection using envelope cues: comparison of signal-processing-based and physiological models. J Assoc Res Otolaryngol 16:121–133PubMedCrossRefPubMedCentralGoogle Scholar
Mao J, Koch KJ, Doherty KA, Carney LH (2015) Cues for diotic and dichotic detection of a 500-Hz tone in noise vary with hearing loss. J Assoc Res Otolaryngol 16:507–521PubMedPubMedCentralCrossRefGoogle Scholar
Roberts B, Summers RJ, Bailey PJ (2011) The intelligibility of noise-vocoded speech: spectral information available from across-channel comparison of amplitude envelopes. Proc Biol Sci 278:1595–1600PubMedCrossRefPubMedCentralGoogle Scholar
Shannon RV, Fu QJ, Galvin J, 3rd (2004) The number of spectral channels required for speech recognition depends on the difficulty of the listening situation. Acta Otolaryngol Suppl:50–54CrossRefGoogle Scholar
Swaminathan J, Reed CM, Desloge JG, Braida LD, Delhorne LA (2014) Consonant identification using temporal fine structure and recovered envelope cues. J Acoust Soc Am 135:2078–2090PubMedPubMedCentralCrossRefGoogle Scholar
Whitmal NA, Poissant SF, Freyman RL, Helfer KS (2007) Speech intelligibility in cochlear implant simulations: effects of carrier type, interfering noise, and subject experience. J Acoust Soc Am 122:2376–2388PubMedCrossRefPubMedCentralGoogle Scholar
Zeng FG, Nie K, Liu S, Stickney G, Del Rio E, Kong YY, Chen H (2004) On the dichotomy in auditory perception between temporal envelope and fine structure cues. J Acoust Soc Am 116:1351–1354PubMedCrossRefPubMedCentralGoogle Scholar
Zeng FG, Nie K, Stickney GS, Kong YY, Vongphoe M, Bhargave A, Wei C, Cao K (2005) Speech recognition with amplitude and frequency modulations. Proc Natl Acad Sci U S A 102:2293–2298PubMedPubMedCentralCrossRefGoogle Scholar