Skip to main content

Why Is the Recognition of Spontaneous Speech so Hard?

  • Conference paper
Text, Speech and Dialogue (TSD 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3658))

Included in the following conference series:

Abstract

Although speech, derived from reading texts, and similar types of speech, e.g. that from reading newspapers or that from news broadcast, can be recognized with high accuracy, recognition accuracy drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This paper reports analysis and recognition of spontaneous speech using a large-scale spontaneous speech database “Corpus of Spontaneous Japanese (CSJ)”. Recognition results in this experiment show that recognition accuracy significantly increases as a function of the size of acoustic as well as language model training data and the improvement levels off at approximately 7M words of training data. This means that acoustic and linguistic variation of spontaneous speech is so large that we need a very large corpus in order to encompass the variations. Spectral analysis using various styles of utterances in the CSJ shows that the spectral distribution/difference of phonemes is significantly reduced in spontaneous speech compared to read speech. Experimental results also show that there is a strong correlation between mean spectral distance between phonemes and phoneme recognition accuracy. This indicates that spectral reduction is one major reason for the decrease of recognition accuracy of spontaneous speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Furui, S.: Recent advances in spontaneous speech recognition and understanding. In: Proc. IEEE Workshop on SSPR, Tokyo, pp. 1–6 (2003)

    Google Scholar 

  2. Furui, S.: Toward spontaneous speech recognition and understanding. In: Chou, W., Juang, B.-H. (eds.) Pattern Recognition in Speech and Language Processing, pp. 191–227. CRC Press, New York (2003)

    Google Scholar 

  3. Shinozaki, T., Hori, C., Furui, S.: Towards automatic transcription of spontaneous presentations. In: Proc. Eurospeech, Aalborg, Denmark, pp. 491–494 (2001)

    Google Scholar 

  4. Sankar, A., Gadde, V.R.R., Stolcke, A., Weng, F.: Improved modeling and efficiency for automatic transcription of broadcast news. Speech Communication 37, 133–158 (2002)

    Article  MATH  Google Scholar 

  5. Gauvain, J.-L., Lamel, L.: Large vocabulary speech recognition based on statistical methods. In: Chou, W., Juang, B.-H. (eds.) Pattern Recognition in Speech and Language Processing, pp. 149–189. CRC Press, New York (2003)

    Google Scholar 

  6. Evermann, G., et al.: Development of the, CU-HTK conversational telephone speech transcription system. In: Proc. IEEE ICASSP, Montreal, pp. I-249–252 (2003)

    Google Scholar 

  7. Schwartz, R., et al.: Speech recognition in multiple languages and domains: the, BBN/LIMSI EARS system. In: Proc. IEEE ICASSP, Montreal, pp. III-753–756 (2003)

    Google Scholar 

  8. van Son, R.J.J.H., Pols, L.C.W.: An acoustic description of consonant reduction. Speech Communication 28(2), 125–140 (1999)

    Article  Google Scholar 

  9. Duez, D.: On spontaneous French speech: aspects of the reduction and contextual assimilation of voiced stops. J. Phonetics 23, 407–427 (1995)

    Article  Google Scholar 

  10. Maekawa, K.: Corpus of Spontaneous Japanese: Its design and evaluation. In: Proc. IEEEWorkshop on SSPR, Tokyo, pp. 7–12 (2003)

    Google Scholar 

  11. Maekawa, K., Kikuchi, H., Tsukahara, W.: Corpus of spontaneous Japanese: design, annotation and XML representation. In: Proc. International Symposium on Large-scale Knowledge Resources, Tokyo, pp. 19–24 (2004)

    Google Scholar 

  12. Uchimoto, K., Nobata, C., Yamada, A., Sekine, S., Isahara, H.: Morphological analysis of the Corpus of Spontaneous Japanese. In: Proc. IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo, pp. 159–162 (2003)

    Google Scholar 

  13. Venditti, J.: Japanese ToBI labeling guidelines. OSU Working Papers in Linguistics 50, 127–162 (1997)

    Google Scholar 

  14. Maekawa, K., Kikuchi, H., Igarashi, Y., Venditti, J.: X-JToBI: an extended J-ToBI for spontaneous speech. In: Proc. ICSLP, Denver, CO, pp. 1545–1548 (2002)

    Google Scholar 

  15. Kawahara, T., Nanjo, H., Shinozaki, T., Furui, S.: Benchmark test for speech recognition using the corpus of spontaneous Japanese. In: Proc. IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo, pp. 135–138 (2003)

    Google Scholar 

  16. Shinozaki, T., Furui, S.: Analysis on individual differences in automatic transcription of spontaneous presentations. In: Proc. IEEE ICASSP, Orlando, pp. I-729–732 (2002)

    Google Scholar 

  17. Ichiba, T., Iwano, K., Furui, S.: Relationships between training data size and recognition accuracy in spontaneous speech recognition. Proc. Acoustical Society of Japan Fall Meeting, 2-pp. 1–9 (2004) (in Japanese)

    Google Scholar 

  18. Ueberla, J.: Analysing a simple language model – some general conclusion for language models for speech recognition. Computer Speech & Language 8(2), 153–176 (1994)

    Article  Google Scholar 

  19. Nakamura, M., Iwano, K., Furui, S.: Comparison of acoustic characteristics between spontaneous speech and reading speech in Japanese. In: Proc. Acoustical Society of Japan Fall Meeting, 2-P-25 (2004) (in Japanese)

    Google Scholar 

  20. Lussier, L., Whittaker, E.W.D., Furui, S.: Combinations of language model adaptation methods applied to spontaneous speech. In: Proc. Third Spontaneous Speech Science & Technology Workshop, Tokyo, pp. 73–78 (2004)

    Google Scholar 

  21. Nanjo, H., Kawahara, T.: Unsupervised language model adaptation for lecture speech recognition. In: Proc. IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo, pp. 75–78 (2003)

    Google Scholar 

  22. Shinozaki, T., Furui, S.: Spontaneous speech recognition using a massively parallel decoder. In: Proc. Interspeech-ICSLP, Jeju, Korea, vol. 3, pp. 1705–1708 (2004)

    Google Scholar 

  23. Furui, S.: Overview of the 21st century COE program “Framework for Systematization and Application of Large-scale Knowledge Resources”. In: Proc. International Symposium on Large-scale Knowledge Resources, Tokyo, pp. 1–8 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Furui, S., Nakamura, M., Ichiba, T., Iwano, K. (2005). Why Is the Recognition of Spontaneous Speech so Hard?. In: Matoušek, V., Mautner, P., Pavelka, T. (eds) Text, Speech and Dialogue. TSD 2005. Lecture Notes in Computer Science(), vol 3658. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551874_3

Download citation

  • DOI: https://doi.org/10.1007/11551874_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28789-6

  • Online ISBN: 978-3-540-31817-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics