Skip to main content
Log in

Leveraging topical and positional cues for language modeling in speech recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper investigates language modeling with topical and positional information for large vocabulary continuous speech recognition. We first compare among a few topic models both theoretically and empirically, including document topic models and word topic models. On the other hand, since for some spoken documents such as broadcast news stories, the composition and the word usage of documents of the same style are usually similar, the documents hence can be separated into partitions consisting of identical rhetoric or topic styles by the literary structures, like introductory remarks, elucidations of methodology or affairs, conclusions of the articles, references or footnotes of reporters, etc. We hence present two position-dependent language models for speech recognition by integrating word positional information into the exiting n-gram and topic models. The experiments conducted on broadcast news transcription seem to indicate that such position-dependent models obtain comparable results to the existing n-gram and topic models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Aubert XL (2002) An overview of decoding techniques for large vocabulary continuous speech recognition. Comput Speech Lang 16:89–114

    Article  Google Scholar 

  2. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  3. Bellegarda JR (1998) A multi-span language modeling framework for large vocabulary speech recognition. IEEE Trans Speech Audio Process 6(5):456–467

    Article  Google Scholar 

  4. Bellegarda JR (2004) Statistical language model adaptation: review and perspectives. Speech Comm 42(11):93–108

    Article  Google Scholar 

  5. Brown PF, deSouza P, Mercer RL, Pietra VJD, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479

    Google Scholar 

  6. Chelba C, Jelinek F (2000) Structured language modeling. Comput Speech Lang 14(4):283–332

    Article  Google Scholar 

  7. Chen B (2009) Word topic models for spoken document retrieval and transcription. ACM Trans Asian Lang Inf Process 8(1):2:1–2:27

    Google Scholar 

  8. Chen B, Kuo JW, Tsai WH (2004) Lightly supervised and data-driven approaches to mandarin broadcast news transcription. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2004), pp 777–780

  9. Chen B, Lin SH (2012) A risk-aware modeling framework for speech summarization. IEEE Trans Audio Speech Lang Process 20(1):199–210

    Google Scholar 

  10. Chen B, Liu JW (2011) Discriminative language modeling for speech recognition with relevance information. In: Proc. IEEE International Conference on Multimedia & Expo (ICME 2011), pp 1–4

  11. Chen B, Liu SH, Chu FH (2009) Training data selection for improving discriminative training of acoustic models. Pattern Recognit Lett 30(13):1228–1235

    Article  Google Scholar 

  12. Chen B, Wang HM, Lee LS (2002) Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese. IEEE Trans Speech Audio Process 10(5):303–314

    Article  Google Scholar 

  13. Chen KY, Chen B (2011) Relevance language modeling for speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2011), pp 5568–5571

  14. Chen KY, Chiu HS, Chen B (2010) Latent topic modeling of word vicinity information for speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2010), pp 5394–5397

  15. Chen YT, Chen B, Wang HM (2009) A probabilistic generative framework for extractive broadcast news speech summarization. IEEE Trans Audio Speech Lang Process 17(1):95–106

    Article  Google Scholar 

  16. Chiu HS, Chen GY, Lee CJ, Chen B (2008) Position information for language modeling in speech recognition, In: Proc. 6th International Symposium on Chinese Spoken Language Processing (ISCSLP 2008), pp 101–104

  17. Clarkson PR, Robinson AJ (1997) Language model adaptation using mixtures and an exponentially decaying cache. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 1997), pp 799–802

  18. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38

    MATH  MathSciNet  Google Scholar 

  19. Gildea D, Hofmann T (1999) Topic-based language models using EM. In: Proc. European Conference on Speech Communication and Technology (Eurospeech 1999), pp 2167–2170

  20. Good IJ (1953) The population frequencies of species and estimation of population parameters. Biometrika 40(3–4):237–264

    Article  MATH  MathSciNet  Google Scholar 

  21. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196

    Article  MATH  Google Scholar 

  22. Kneser R, Ney H (1995) Improved backing-off for m-gram language modeling. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 1995), vol. I, pp 181–184

  23. Koshinaka T, Iso K, Okumura A (2005) An HMM-based text segmentation method using variational Bayes approach and its application to LVCSR for broadcast news. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2005), pp 485–488

  24. Lau R, Rosenfeld R, Roukos S (1993) Trigger-based language models: a maximum entropy approach. Proc IEEE Int Conf Acoust Speech Signal Process 2:45–48

    Article  Google Scholar 

  25. Lee HS, Chen B (2009) Generalized likelihood ratio discriminant analysis. In: Proc. IEEE workshop on Automatic Speech Recognition and Understanding (ASRU 2009), pp 158–163

  26. Roark B, Saraclar M, Collins M (2007) Discriminative n-gram language modeling. Comput Speech Lang 21:373–392

    Article  Google Scholar 

  27. Rosenfeld R (2000) Two decades of statistical language modeling: where do we go from here. Proc IEEE 88(8):1270–1278

    Article  Google Scholar 

  28. Ortmanns S, Ney H, Aubert X (1997) A word graph algorithm for large vocabulary continuous speech recognition. Comput Speech Lang 11:43–72

    Article  Google Scholar 

  29. Ostendorf M (2008) Speech technology and information access. IEEE Signal Process Mag 25(3):150–152

    Google Scholar 

  30. Pallett D, Fisher W, Fiscus J (1990) Tools for the analysis of benchmark speech recognition tests. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing, pp 97–100

  31. Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proc. the ACM International Conference on Research and Development in Information Retrieval (SIGIR 1998), pp 275–281

  32. Saul L, Pereira F, (1997) Aggregate and mixed-order Markov models for statistical language processing In: Proc. Empirical Methods on Natural Language Processing (EMNLP 1997), pp 81–89

  33. Stolcke A (2000) SRI language modeling toolkit. Version 1.3.3. http://www.speech.sri.com/projects/srilm/

  34. Tur G, Mori RD (eds) (2011) Spoken language understanding—systems for extracting semantic information from speech. John Wiley and Sons, New York, NY

  35. Wang HM, Chen B, Kuo JW, Cheng SS (2005) MATBN: a Mandarin Chinese broadcast news corpus. Int J Comput Linguist Chin Lang Process 10(1):219–235

    MATH  Google Scholar 

  36. Zhai CX (2008) Statistical language models for information retrieval. Morgan & Claypool Publishers, United States

Download references

Acknowledgments

This work was sponsored in part by “Aim for the Top University Plan” of National Taiwan Normal University and Ministry of Education, Taiwan, and the National Science Council, Taiwan, under Grants NSC 101-2221-E-003-024-MY3, NSC 101-2511-S-003-057-MY3, NSC 101-2511-S-003-047-MY3, NSC 99-2221-E-003-017-MY3, and NSC 98-2221-E-003-011-MY3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Berlin Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chiu, HS., Chen, KY. & Chen, B. Leveraging topical and positional cues for language modeling in speech recognition. Multimed Tools Appl 72, 1465–1481 (2014). https://doi.org/10.1007/s11042-013-1456-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-013-1456-2

Keywords

Navigation