Skip to main content

Cognition of Phones

  • Chapter
  • First Online:
Time Domain Representation of Speech Sounds
  • 233 Accesses

Abstract

The places of articulation of vowels, plosives, and affricates are generally believed to be perceived on the basis of the formant frequencies, particularly of the first two formats and their dynamic behavior. Bangla, like most of the major Indian languages, have a large number plosive/stop sounds, around 20 in number. These are organized into four groups based on place of articulation, generally named as velar, alveolar (retroflexed in most language), dental, and labial. Each group again has five different manners of production, namely, unaspirated unvoiced, aspirated unvoiced, unaspirated voiced, aspirated voiced, and nasal. There has been a considerable interest in the recognition of place of articulation since early 50s for both human cognition and Automatic Speech Recognition (ASR) in Indian Statistical Institute. The chapter presents details of this endeavor. For this investigation, 400 segments of each of the seven Bangla vowels extracted from most frequently used Bangla words embedded in a neutral carrier sentence and spoken by four male and four female native informants were used. For objective studies, attention was mainly limited to the first two formant frequencies, namely, F1 and F2. The distribution of different Bangla vowels in F1–F2 plane and their spreads are described. These show considerable overlap. A careful perusal of the distribution reveals that there are regions where more than one vowel coexist. However, the clusters though not disjoint show reasonable power of discrimination (about 85% recognition rate has been reported for Bangla vowels). This chapter also presents a detailed study in the cognitive domain to test whether the aforesaid formant-related parameters generally used for machine recognition are also cognitively relevant. For this purpose, the hypothesis tested is that the first two formants necessary and sufficient for human cognition of the place of articulation of plosives are determined by the transition of the first two formants and the place of articulation of vowels are determined by the steady-state values of the first two formants. The significance of the hypothesis tested was not good enough. Specially prepared signals from actual speech sounds are used for the purpose of cognitive experiments. The technique of preparation is described in detail. Listening tests are conducted with 30 native listeners. Experimental procedure and corresponding results on objective identification of vowels and plosives on a large speech database collected from native informants of both sexes are presented in details. The study on the cognition of place of articulation required fine-scale manipulations at the signal level which are also described in necessary details. The results of cognition with these manipulated signals are discussed. The chapter also includes an in-depth study of the spectral cues for nasal/oral distinction in Bangla vowels. This acquires significance simply for the fact that in Bangla; in contrast with English, nasality is phonemic. A comprehensive study in the cognitive domain identified that only one or two cognitive cues, usually not referred to in literature, are really necessary and sufficient for nasal/oral distinction. The paradigm of analysis through synthesis has been used for this investigation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Chistovitch, L., Sheiken, R., & Lublinskaya, V. (1970). Center of gravity and spectral peaks as the determinants of vowel quality. In B. Lindblom & S. Ohman (Eds.), Frontiers of speech communication research (pp. 143–157). Academic press.

    Google Scholar 

  • Dan, T., Mukherjee, B., & Datta, A. K. (1993). Temporal approach for synthesis of singing (SopranoI). In Proceedings of the Stockholm Music Acoustics Conference (SMAC93) (pp. 282–287).

    Google Scholar 

  • Datta, A. K. (1988). Acoustic phonetics of non-nasal standard Bengali vowels: A spectrographic study. Journal of the Institution of Electronics and Telecommunication Engineers, 34.

    Google Scholar 

  • Datta, A. K. (1993). Do ear perceive through analysis of formants alone? In Proceedings of 3rd European Conference on Speech Communication and Technology, Genova, Italy, September 21–23, 1993.

    Google Scholar 

  • Datta, A. K., & Ganguly, N. R. (1981). Terminal frequencies in CV combination in multisyllabic words. Acustica, 47(4), 314–324.

    Google Scholar 

  • Datta, A. K., & Saha, A. (2012). Detection of glottal closure indicator from speech signal using mathematical morphology. In Proceedings of International Conference on Speech Database and Assessments, December 9–12, 2012, Macau, China (pp. 269–273).

    Google Scholar 

  • Datta, A. K., Ganguly, N. R., Mukherjee, B., Ray, S., & Dutta Majumder D. (1978a). Formant transition as a cue for automatic recognition of plosives. In Proceedings of All India Interdisciplinary Symposium on Recent Trends of Research and Development in Digital Technique and Pattern Recognition, ISI, Cal, February 1978.

    Google Scholar 

  • Datta, A. K., Ganguli, N. R., & Ray, S. (1978b). Transition—A cue for identification of plosives. Journal of Acoustics Society of India, VI(4), 124–131.

    Google Scholar 

  • Datta, A. K., Ganguly, N. R., & Ray, S. (1980). Recognition of unaspirated plosives: A statistical approach. IEEE Transactions on Acoustics Speech and Signal Processing, ASSP-28(1), 85–91.

    Article  Google Scholar 

  • Datta, A. K., Ganguly, N. R., & Dutta Majumder, D. (1981). Acoustic features of consonants; a study based on Telugu speech sounds. Acustica, 47(2).

    Google Scholar 

  • Datta, A. K., Ganguly, N. R., & Mukherjee, B. (1990). Intonation in segment-concatenated speech. In Proceedings of ESCA Workshop on Speech Synthesis, Autrans, France (pp. 153–156), September 1990.

    Google Scholar 

  • Datta, A. K., Sengupta, R., Dey, N., Mukherjee, B., & Dipali, N. (1997). Necessary and sufficient spectral cues for perceptual nasal/oral contrast in Bengali vowels’. In Proceedings of National Symposium on Acoustics, Vishakhapatanam, India.

    Google Scholar 

  • Dellatre, P. C., Liberman, A. M., & Cooper, F. S. (1955). Acoustic loci and transitional cues for consonants. Journal of the American Statistical Association, 27, 769–773.

    Google Scholar 

  • Dellugate, B., & Kiang, N. (1984). Speech coding in auditory nerve. Journal of the American Statistical Association, 75, 866–978.

    Google Scholar 

  • Fant, G. (1960). Acoustic theory of speech production. Moulton & Co.’s-Gravenhage.

    Google Scholar 

  • Pal, S. K., Datta, A. K., & Dutta Majumder, D. (1980). Self-supervised vowel recognition system. Pattern Recognition, 22, 27–32.

    Article  Google Scholar 

  • Sachs, M., & Young, A. D. (1986). Effects of non-linearity in speech encoding in auditory nerve. Journal of the American Statistical Association, 68, 858–875.

    Google Scholar 

  • Sachs, M., Young, E., & Miller, M. (1983). Speech encoding in auditory nerve: Implication for cochlear implants. In C. Parkins & S. Anderson (Eds.), Cochlear prosthesis. Annals N Y Academy of Sciences (pp. 94–113).

    Google Scholar 

  • Siney, D., & Geisler, C. (1884). Comparison of responses of auditory nerve fibres to consonant-vowel syllabus with prediction from linear models. Journal of the American Statistical Association, 76, 116–121.

    Google Scholar 

  • Spoendlin, H. (1972). Innervation densities in Cochlea. Acta Otolar, 73, 235–248.

    Article  Google Scholar 

  • Stevens, K., & Blumstein, S. E. (1978). Invariant cues for place of articulation in stop consonants. Journal of the American Statistical Association, 64, 1358–1368.

    Google Scholar 

  • Young, E., & Barter, P. (1986). Rate responses to auditory nerves to tones in noise near masked threshold. Journal of the American Statistical Association, 70, 426–442.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Asoke Kumar Datta .

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Datta, A.K. (2018). Cognition of Phones. In: Time Domain Representation of Speech Sounds. Springer, Singapore. https://doi.org/10.1007/978-981-13-2303-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-2303-4_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-2302-7

  • Online ISBN: 978-981-13-2303-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics