Skip to main content

Class-Based Language Model Adaptation

  • Chapter
  • 665 Accesses

Part of the book series: Cognitive Technologies ((COGTECH))

Summary

In this paper we introduce and evaluate two class-based language model adaptation techniques for adapting general n-gram-based background language models to a specific spoken dialogue task. The required background language models are derived from available newspaper corpora and Internet newsgroup collections. We followed a standard mixture-based approach for language model adaptation by generating several clusters of topic-specific language models and combined them into a specific target language model using different weights depending on the chosen application domain. In addition, we developed a novel word n-gram pruning technique for domain adaptation and proposed a new approach for thematic text clustering. This method relies on a new discriminative n-gram-based key term selection process for document clustering. These key terms are then used to automatically cluster the whole document collection. By selecting only relevant text clusters for language model training, we addressed the problem of generating task-specific language models. Different key term selection methods are investigated using perplexity as the evaluation measure. Automatically computed clusters are compared with manually labeled genre clusters, and the results provide a significant performance improvement depending on the chosen key term selection method.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • H. Baayen, R. Piepenbrock, and H. van Rijn. The CELEX Lexical Database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, 1993.

    Google Scholar 

  • L.R. Bahl, F. Jelinek, and R.L. Mercer. A Maximum Likelihood Approach to Continuous Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(2), 1983.

    Google Scholar 

  • J.R. Bellegarda, J. Butzberger, W. Chow, N. Coccarao, and D. Naik. A Novel Word Clustering Algorithm Based on Latent Semantic Analysis. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP-96), vol. 1, pp. 172–175, Atlanta, GA, 1996.

    Google Scholar 

  • P.R. Clarkson and A.J. Robinson. Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP-97), Munich, Germany, 1997.

    Google Scholar 

  • A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood From Incomplete Data Using the EM Algorithm. In: Annals of the Royal Statistical Society, vol. 39, pp. 1–38, London, UK, 1977. Royal Statistical Society.

    MATH  MathSciNet  Google Scholar 

  • T. Goodman. A Bit of Progress in Language Modeling. Computer Speech and Language, 15(403–434), 2001.

    Article  Google Scholar 

  • S. Goronzy, S. Rapp, and M. Emele. The Dynamic Lexicon, 2006. In this volume.

    Google Scholar 

  • R. Iyer and M. Ostendorf. Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models. IEEE Transactions on Speech and Audio Processing, 7(1):30–39, January 1999.

    Article  Google Scholar 

  • F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1997.

    Google Scholar 

  • F. Jelinek and D. Lafferty. Computation of the Probability of Initial Substring Generation by Stochastic Context Free Grammars. Computational Linguistics, 17(3):315–323, 1991.

    Google Scholar 

  • S. Kaski. Dimensionality Reduction by Random Mapping Fast Similarity Computation for Clustering. In: Proc. IJCNN’98, vol. 1, pp. 413–418, Piscataway, NJ, 1998.

    MathSciNet  Google Scholar 

  • R. Kneser and H. Ney. Improved Clustering Techniques for Class-Based Statistical Language Modeling. In: Proc. EUROSPEECH-93, pp. 973–976, Berlin, Germany, 1993.

    Google Scholar 

  • T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and A. Saarela. Self Organization of a Massive Document Collection. IEEE Transactions on Neural Networks, 11(3):574–585, 2000.

    Article  Google Scholar 

  • R. Kuhn and R. De Mori. A Cache-Based Natural Language Model for Speech Reproduction. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 12(6):570–583, 1990.

    Article  Google Scholar 

  • Y.H. Lam and M.C. Emele. Application-Specific Language Model Adaptation Using Internet Resources. In: Proc. 11th Sony Research Forum’ 01, 2001.

    Google Scholar 

  • Y. Linde, A. Buzo, and R.M. Gray. An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications, pp. 702–710, 1980.

    Google Scholar 

  • D.R.H. Miller, T. Leek, and R.M. Schwartz. A Hidden Markov Model Information Retrieval System. In: Proc. 22nd Int. Conf. on Research and Development in Information Retrieval, pp. 214–221, Berkley, CA, 1999.

    Google Scholar 

  • T.R. Niesler and P.C. Woodland. Modeling Word-Pair Relations in a Category-Based Language Model. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP-97), Munich, Germany, 1997.

    Google Scholar 

  • E. Rasmussen. Clustering Algorithms. In: W.B. Frakes and R. Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms, pp. 419–442, Englewood Cliffs, NJ, 1992. Prentice Hall.

    Google Scholar 

  • S.E. Robertson and J.K. Spärck. Relevance Weighting of Search Terms. Journal of American Society for Information Science, 27:129–146, 1976.

    Google Scholar 

  • R. Rosenfeld and P. Clarkson. Statistical Language Modeling Using the CMU-Cambridge Toolkit. In: Proc. EUROSPEECH-97, Rhodes, Greece, 1997.

    Google Scholar 

  • G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.

    MATH  Google Scholar 

  • F. Schiel and U. Türk. Wizard-of-Oz Recordings, 2006. In this volume.

    Google Scholar 

  • K. Seymore and R. Rosenfeld. Large-Scale Topic Detection and Language Model Adaptation. Technical report, School of Computer Science, Carnegie Mellon University, June 1997.

    Google Scholar 

  • R. Srihari and C. Baltus. Combining Statistical and Syntactic Methods in Recognizing Handwritten Sentences. In: Proc. AAAI Symposium: Probabilistic Approaches to Natural Language, pp. 121–127, Cambridge, MA, 1992.

    Google Scholar 

  • Z. Valsan and M. Emele. Thematic Text Clustering for Domain Specific Language Model Adaptation. In: Proc. IEEE Automatic Speech Recognition and Understanding ASRU 2003, St. Thomas Island, USA, 2003.

    Google Scholar 

  • Y.Y. Wang. Robust Spoken Language Understanding. In: Proc. EUROSPEECH-01, Aalborg, Denmark, 2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Emele, M.C., Valsan, Z., Lam, Y.H., Goronzy, S. (2006). Class-Based Language Model Adaptation. In: Wahlster, W. (eds) SmartKom: Foundations of Multimodal Dialogue Systems. Cognitive Technologies. Springer, Berlin, Heidelberg . https://doi.org/10.1007/3-540-36678-4_7

Download citation

  • DOI: https://doi.org/10.1007/3-540-36678-4_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23732-7

  • Online ISBN: 978-3-540-36678-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics