Abstract
This paper investigates language modeling with topical and positional information for large vocabulary continuous speech recognition. We first compare among a few topic models both theoretically and empirically, including document topic models and word topic models. On the other hand, since for some spoken documents such as broadcast news stories, the composition and the word usage of documents of the same style are usually similar, the documents hence can be separated into partitions consisting of identical rhetoric or topic styles by the literary structures, like introductory remarks, elucidations of methodology or affairs, conclusions of the articles, references or footnotes of reporters, etc. We hence present two position-dependent language models for speech recognition by integrating word positional information into the exiting n-gram and topic models. The experiments conducted on broadcast news transcription seem to indicate that such position-dependent models obtain comparable results to the existing n-gram and topic models.
Similar content being viewed by others
References
Aubert XL (2002) An overview of decoding techniques for large vocabulary continuous speech recognition. Comput Speech Lang 16:89–114
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Bellegarda JR (1998) A multi-span language modeling framework for large vocabulary speech recognition. IEEE Trans Speech Audio Process 6(5):456–467
Bellegarda JR (2004) Statistical language model adaptation: review and perspectives. Speech Comm 42(11):93–108
Brown PF, deSouza P, Mercer RL, Pietra VJD, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479
Chelba C, Jelinek F (2000) Structured language modeling. Comput Speech Lang 14(4):283–332
Chen B (2009) Word topic models for spoken document retrieval and transcription. ACM Trans Asian Lang Inf Process 8(1):2:1–2:27
Chen B, Kuo JW, Tsai WH (2004) Lightly supervised and data-driven approaches to mandarin broadcast news transcription. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2004), pp 777–780
Chen B, Lin SH (2012) A risk-aware modeling framework for speech summarization. IEEE Trans Audio Speech Lang Process 20(1):199–210
Chen B, Liu JW (2011) Discriminative language modeling for speech recognition with relevance information. In: Proc. IEEE International Conference on Multimedia & Expo (ICME 2011), pp 1–4
Chen B, Liu SH, Chu FH (2009) Training data selection for improving discriminative training of acoustic models. Pattern Recognit Lett 30(13):1228–1235
Chen B, Wang HM, Lee LS (2002) Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese. IEEE Trans Speech Audio Process 10(5):303–314
Chen KY, Chen B (2011) Relevance language modeling for speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2011), pp 5568–5571
Chen KY, Chiu HS, Chen B (2010) Latent topic modeling of word vicinity information for speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2010), pp 5394–5397
Chen YT, Chen B, Wang HM (2009) A probabilistic generative framework for extractive broadcast news speech summarization. IEEE Trans Audio Speech Lang Process 17(1):95–106
Chiu HS, Chen GY, Lee CJ, Chen B (2008) Position information for language modeling in speech recognition, In: Proc. 6th International Symposium on Chinese Spoken Language Processing (ISCSLP 2008), pp 101–104
Clarkson PR, Robinson AJ (1997) Language model adaptation using mixtures and an exponentially decaying cache. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 1997), pp 799–802
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
Gildea D, Hofmann T (1999) Topic-based language models using EM. In: Proc. European Conference on Speech Communication and Technology (Eurospeech 1999), pp 2167–2170
Good IJ (1953) The population frequencies of species and estimation of population parameters. Biometrika 40(3–4):237–264
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196
Kneser R, Ney H (1995) Improved backing-off for m-gram language modeling. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 1995), vol. I, pp 181–184
Koshinaka T, Iso K, Okumura A (2005) An HMM-based text segmentation method using variational Bayes approach and its application to LVCSR for broadcast news. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2005), pp 485–488
Lau R, Rosenfeld R, Roukos S (1993) Trigger-based language models: a maximum entropy approach. Proc IEEE Int Conf Acoust Speech Signal Process 2:45–48
Lee HS, Chen B (2009) Generalized likelihood ratio discriminant analysis. In: Proc. IEEE workshop on Automatic Speech Recognition and Understanding (ASRU 2009), pp 158–163
Roark B, Saraclar M, Collins M (2007) Discriminative n-gram language modeling. Comput Speech Lang 21:373–392
Rosenfeld R (2000) Two decades of statistical language modeling: where do we go from here. Proc IEEE 88(8):1270–1278
Ortmanns S, Ney H, Aubert X (1997) A word graph algorithm for large vocabulary continuous speech recognition. Comput Speech Lang 11:43–72
Ostendorf M (2008) Speech technology and information access. IEEE Signal Process Mag 25(3):150–152
Pallett D, Fisher W, Fiscus J (1990) Tools for the analysis of benchmark speech recognition tests. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing, pp 97–100
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proc. the ACM International Conference on Research and Development in Information Retrieval (SIGIR 1998), pp 275–281
Saul L, Pereira F, (1997) Aggregate and mixed-order Markov models for statistical language processing In: Proc. Empirical Methods on Natural Language Processing (EMNLP 1997), pp 81–89
Stolcke A (2000) SRI language modeling toolkit. Version 1.3.3. http://www.speech.sri.com/projects/srilm/
Tur G, Mori RD (eds) (2011) Spoken language understanding—systems for extracting semantic information from speech. John Wiley and Sons, New York, NY
Wang HM, Chen B, Kuo JW, Cheng SS (2005) MATBN: a Mandarin Chinese broadcast news corpus. Int J Comput Linguist Chin Lang Process 10(1):219–235
Zhai CX (2008) Statistical language models for information retrieval. Morgan & Claypool Publishers, United States
Acknowledgments
This work was sponsored in part by “Aim for the Top University Plan” of National Taiwan Normal University and Ministry of Education, Taiwan, and the National Science Council, Taiwan, under Grants NSC 101-2221-E-003-024-MY3, NSC 101-2511-S-003-057-MY3, NSC 101-2511-S-003-047-MY3, NSC 99-2221-E-003-017-MY3, and NSC 98-2221-E-003-011-MY3.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chiu, HS., Chen, KY. & Chen, B. Leveraging topical and positional cues for language modeling in speech recognition. Multimed Tools Appl 72, 1465–1481 (2014). https://doi.org/10.1007/s11042-013-1456-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1456-2