Class-Based Language Model Adaptation

Emele, Martin C.; Valsan, Zica; Lam, Yin Hay; Goronzy, Silke

doi:10.1007/3-540-36678-4_7

Class-Based Language Model Adaptation

Martin C. Emele⁴,
Zica Valsan⁴,
Yin Hay Lam⁴ &
…
Silke Goronzy⁴

Chapter

665 Accesses

Part of the book series: Cognitive Technologies ((COGTECH))

Summary

In this paper we introduce and evaluate two class-based language model adaptation techniques for adapting general n-gram-based background language models to a specific spoken dialogue task. The required background language models are derived from available newspaper corpora and Internet newsgroup collections. We followed a standard mixture-based approach for language model adaptation by generating several clusters of topic-specific language models and combined them into a specific target language model using different weights depending on the chosen application domain. In addition, we developed a novel word n-gram pruning technique for domain adaptation and proposed a new approach for thematic text clustering. This method relies on a new discriminative n-gram-based key term selection process for document clustering. These key terms are then used to automatically cluster the whole document collection. By selecting only relevant text clusters for language model training, we addressed the problem of generating task-specific language models. Different key term selection methods are investigated using perplexity as the evaluation measure. Automatically computed clusters are compared with manually labeled genre clusters, and the results provide a significant performance improvement depending on the chosen key term selection method.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

H. Baayen, R. Piepenbrock, and H. van Rijn. The CELEX Lexical Database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, 1993.
Google Scholar
L.R. Bahl, F. Jelinek, and R.L. Mercer. A Maximum Likelihood Approach to Continuous Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(2), 1983.
Google Scholar
J.R. Bellegarda, J. Butzberger, W. Chow, N. Coccarao, and D. Naik. A Novel Word Clustering Algorithm Based on Latent Semantic Analysis. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP-96), vol. 1, pp. 172–175, Atlanta, GA, 1996.
Google Scholar
P.R. Clarkson and A.J. Robinson. Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP-97), Munich, Germany, 1997.
Google Scholar
A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood From Incomplete Data Using the EM Algorithm. In: Annals of the Royal Statistical Society, vol. 39, pp. 1–38, London, UK, 1977. Royal Statistical Society.
MATH MathSciNet Google Scholar
T. Goodman. A Bit of Progress in Language Modeling. Computer Speech and Language, 15(403–434), 2001.
Article Google Scholar
S. Goronzy, S. Rapp, and M. Emele. The Dynamic Lexicon, 2006. In this volume.
Google Scholar
R. Iyer and M. Ostendorf. Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models. IEEE Transactions on Speech and Audio Processing, 7(1):30–39, January 1999.
Article Google Scholar
F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1997.
Google Scholar
F. Jelinek and D. Lafferty. Computation of the Probability of Initial Substring Generation by Stochastic Context Free Grammars. Computational Linguistics, 17(3):315–323, 1991.
Google Scholar
S. Kaski. Dimensionality Reduction by Random Mapping Fast Similarity Computation for Clustering. In: Proc. IJCNN’98, vol. 1, pp. 413–418, Piscataway, NJ, 1998.
MathSciNet Google Scholar
R. Kneser and H. Ney. Improved Clustering Techniques for Class-Based Statistical Language Modeling. In: Proc. EUROSPEECH-93, pp. 973–976, Berlin, Germany, 1993.
Google Scholar
T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and A. Saarela. Self Organization of a Massive Document Collection. IEEE Transactions on Neural Networks, 11(3):574–585, 2000.
Article Google Scholar
R. Kuhn and R. De Mori. A Cache-Based Natural Language Model for Speech Reproduction. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 12(6):570–583, 1990.
Article Google Scholar
Y.H. Lam and M.C. Emele. Application-Specific Language Model Adaptation Using Internet Resources. In: Proc. 11th Sony Research Forum’ 01, 2001.
Google Scholar
Y. Linde, A. Buzo, and R.M. Gray. An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications, pp. 702–710, 1980.
Google Scholar
D.R.H. Miller, T. Leek, and R.M. Schwartz. A Hidden Markov Model Information Retrieval System. In: Proc. 22nd Int. Conf. on Research and Development in Information Retrieval, pp. 214–221, Berkley, CA, 1999.
Google Scholar
T.R. Niesler and P.C. Woodland. Modeling Word-Pair Relations in a Category-Based Language Model. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP-97), Munich, Germany, 1997.
Google Scholar
E. Rasmussen. Clustering Algorithms. In: W.B. Frakes and R. Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms, pp. 419–442, Englewood Cliffs, NJ, 1992. Prentice Hall.
Google Scholar
S.E. Robertson and J.K. Spärck. Relevance Weighting of Search Terms. Journal of American Society for Information Science, 27:129–146, 1976.
Google Scholar
R. Rosenfeld and P. Clarkson. Statistical Language Modeling Using the CMU-Cambridge Toolkit. In: Proc. EUROSPEECH-97, Rhodes, Greece, 1997.
Google Scholar
G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
MATH Google Scholar
F. Schiel and U. Türk. Wizard-of-Oz Recordings, 2006. In this volume.
Google Scholar
K. Seymore and R. Rosenfeld. Large-Scale Topic Detection and Language Model Adaptation. Technical report, School of Computer Science, Carnegie Mellon University, June 1997.
Google Scholar
R. Srihari and C. Baltus. Combining Statistical and Syntactic Methods in Recognizing Handwritten Sentences. In: Proc. AAAI Symposium: Probabilistic Approaches to Natural Language, pp. 121–127, Cambridge, MA, 1992.
Google Scholar
Z. Valsan and M. Emele. Thematic Text Clustering for Domain Specific Language Model Adaptation. In: Proc. IEEE Automatic Speech Recognition and Understanding ASRU 2003, St. Thomas Island, USA, 2003.
Google Scholar
Y.Y. Wang. Robust Spoken Language Understanding. In: Proc. EUROSPEECH-01, Aalborg, Denmark, 2001.
Google Scholar

Download references

Author information

Authors and Affiliations

Sony Corporate Laboratories Europe, Advanced Software Laboratory, Sony International (Europe] GmbH, Stuttgart, Germany
Martin C. Emele, Zica Valsan, Yin Hay Lam & Silke Goronzy

Authors

Martin C. Emele
View author publications
You can also search for this author in PubMed Google Scholar
Zica Valsan
View author publications
You can also search for this author in PubMed Google Scholar
Yin Hay Lam
View author publications
You can also search for this author in PubMed Google Scholar
Silke Goronzy
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

German Research Center for AI, DFKI GmbH, Stuhlsatzenhausweg 3, 66123, Saarbrücken, Germany
Wolfgang Wahlster

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Emele, M.C., Valsan, Z., Lam, Y.H., Goronzy, S. (2006). Class-Based Language Model Adaptation. In: Wahlster, W. (eds) SmartKom: Foundations of Multimodal Dialogue Systems. Cognitive Technologies. Springer, Berlin, Heidelberg . https://doi.org/10.1007/3-540-36678-4_7

Download citation

DOI: https://doi.org/10.1007/3-540-36678-4_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23732-7
Online ISBN: 978-3-540-36678-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics