Skip to main content

Neural Probabilistic Language Models

  • Chapter
Innovations in Machine Learning

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 194))

Abstract

A central goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on several methods to speed-up both training and probability computation, as well as comparative experiments to evaluate the improvements brought by these techniques. We finally describe the incorporation of this new language model into a state-of-the-art speech recognizer of conversational speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Automatically tuned linear algebra software, https://sourceforge.net/projects/mathatlas/atlas

    Google Scholar 

  • Baker, D. and McCallum, A. (1998). Distributional clustering of words for text classification. In SIGIR’98.

    Google Scholar 

  • Bellegarda, J. (1997). A latent semantic analysis framework for large-span language modeling. In Proceedings of Eurospeech 97, pages 1451–1454, Rhodes, Greece.

    Google Scholar 

  • Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transactions on Neural Networks, special issue on Data Mining and Knowledge Discovery, 11(3), 550–557.

    Google Scholar 

  • Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-layer neural networks. In S. Solla, T. Leen, and K.-R. Muller, editors, Advances in Neural Information Processing Systems 12, pages 400–406. MIT Press.

    Google Scholar 

  • Bengio, Y. and Senecal, J.-S. (2003). Quick training of probabilistic neural nets by sampling. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, volume 9, Key West, Florida. AI and Statistics. 38 Authors Suppressed Due to Excessive Length

    Google Scholar 

  • Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.

    Article  Google Scholar 

  • Berger, A., Della Pietra, S., and Della Pietra, V. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22, 39–71.

    Google Scholar 

  • Bilmes, J., Asanovic, K., Chin, C.-W., and Demmel, J. (1997). Using phipac to speed error back-propagation learning. In International Conference on Acoustics, Speech, and Signal Processing, pages V: 4153–4156.

    Google Scholar 

  • Blitzer, J., K.Q. Weinberger, Saul, L., and Pereira, F. (2005). Hierarchical distributed representations for statistical language models. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17. MIT Press.

    Google Scholar 

  • Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140.

    Google Scholar 

  • Brown, A. and Hinton, G. (2000). Products of hidden markov models. Technical Report GCNU TR 2000-004, Gatsby Unit, University College London.

    Google Scholar 

  • Brown, P., Pietra, V. D., DeSouza, P., Lai, J., and Mercer, R. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18, 467–479.

    Google Scholar 

  • Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for language modeling. Computer, Speech and Language, 13(4), 359–393.

    Article  Google Scholar 

  • Cheng, J. and Druzdzel, M. J. (2000). Ais-bn: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks. Journal of Artificial Intelligence Research, 13, 155–188.

    MathSciNet  Google Scholar 

  • Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.

    Article  Google Scholar 

  • Elman, J. (1990). Finding structure in time. Cognitive Science, 14, 179–211.

    Article  Google Scholar 

  • Emami, A., Xu, P., and Jelinek, F. (2003). Using a connectionist model in a syntactical based language model. In International Conference on Acoustics, Speech, and Signal Processing, pages I: 272–375.

    Google Scholar 

  • Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. MIT Press.

    Google Scholar 

  • Fiscus, J., Garofolo, J., Lee, A., Martin, A., Pallett, D., Przybocki, M., and Sanders, G. (Nov 2004). Results of the fall 2004 STT and MDE evaluation. In DARPA Rich Transcription Workshop, Palisades NY.

    Google Scholar 

  • Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256–285.

    Article  MATH  MathSciNet  Google Scholar 

  • Gauvain, J.-L., Lamel, L., Schwenk, H., Adda, G., Chen, L., and Lefevre, F. (2003). Conversational telephone speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, pages I: 212–215.

    Google Scholar 

  • Goodman, J. (2001a). A bit of progress in language modeling. Technical Report MSR-TR-2001-72, Microsoft Research.

    Google Scholar 

  • Goodman, J. (2001b). Classes for fast maximum entropy training. In International Conference on Acoustics, Speech, and Signal Processing, Utah.

    Google Scholar 

  • Hinton, G. (1986). Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1–12, Amherst 1986. Lawrence Erlbaum, Hillsdale.

    Google Scholar 

  • Hinton, G. (2000). Training products of experts by minimizing contrastive divergence. Technical Report GCNU TR 2000-004, Gatsby Unit, University College London.

    Google Scholar 

  • Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800. 1 Neural Probabilistic Language Models 39

    Article  MATH  Google Scholar 

  • Hinton, G. and Roweis, S. (2003). Stochastic neighbor embedding. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15. MIT Press, Cambridge, MA.

    Google Scholar 

  • Jelinek, F. and Mercer, R. (2000). Interpolated estimation of markov source parameters from sparse data. Pattern Recognition in Practice, pages 381–397.

    Google Scholar 

  • Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice. North-Holland, Amsterdam.

    Google Scholar 

  • Jensen, K. and Riis, S. (2000). Self-organizing letter code-book for text-to-phoneme neural network model. In Proceedings ICSLP.

    Google Scholar 

  • Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-35(3), 400–401.

    Article  Google Scholar 

  • Kneser, R. and Ney, H. (1995). Improved backing-off for m-gram language modeling. In International Conference on Acoustics, Speech, and Signal Processing, pages 181–184.

    Google Scholar 

  • Kong, A. (1992). A note on importance sampling using standardized weights. Technical Report 348, Department of Statistics, University of Chicago.

    Google Scholar 

  • Kong, A., Liu, J. S., and Wong, W. H. (1994). Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89, 278–288.

    Article  Google Scholar 

  • Lee, A., Fiscus, J., Garofolo, J., Przybocki, M., Martin, A., Sanders, G., and Pallett, D. (May 2003). Spring speech-to-text transcription evaluation results. In Rich Transcription Workshop, Boston.

    Google Scholar 

  • Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer.

    Google Scholar 

  • Luis, O. and Leslie, K. (2000). Adaptive importance sampling for estimation in structured domains. In Proceedings of the 16th Annual Conference on Uncertainty in Artificial Intelligence (UAI-00), pages 446–454.

    Google Scholar 

  • Intel math kernel library (2004)., http://www.intel.com/software/products/mkl/.

    Google Scholar 

  • Miikkulainen, R. and Dyer, M. (1991). Natural language processing with modular neural networks and distributed lexicon. Cognitive Science, 15, 343–399.

    Article  Google Scholar 

  • Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical language modeling. In European Conference on Speech Communication and Technology (Eurospeech), pages 973–976, Berlin.

    Google Scholar 

  • Niesler, T., Whittaker, E., and Woodland, P. (1998). Comparison of part-of-speech and automatically derived category-based language models for speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, pages 177–180.

    Google Scholar 

  • Paccanaro, A. and Hinton, G. (2000). Extracting distributed representations of concepts and relations from positive and negative propositions. In Proceedings of the International Joint Conference on Neural Network, IJCNN’2000, Como, Italy. IEEE, New York.

    Google Scholar 

  • Pereira, F., Tishby, N., and Lee, L. (1993). Distributional clustering of English words. In 30th Annual Meeting of the Association for Computational Linguistics, pages 183–190, Columbus, Ohio.

    Google Scholar 

  • Riis, S. and Krogh, A. (1996). Improving protein secondary structure prediction using structured neural networks and multiple sequence profiles. Journal of Computational Biology, pages 163–183. 40 Authors Suppressed Due to Excessive Length

    Google Scholar 

  • Robert, C. P. and Casella, G. (2000). Monte Carlo Statistical Methods. Springer. Springer texts in statistics.

    Google Scholar 

  • Salton, G. and Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.

    Article  Google Scholar 

  • Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on Neural Networks, 7(1), 142–146.

    Article  MathSciNet  Google Scholar 

  • Schutze, H. (1993). Word space. In C. Giles, S. Hanson, and J. Cowan, editors, Advances in Neural Information Processing Systems 5, pages pp. 895–902, San Mateo CA. Morgan Kaufmann.

    Google Scholar 

  • Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, pages I: 765–768.

    Article  Google Scholar 

  • Schwenk, H. and Gauvain, J.-L. (2003). Using continuous space language models for conversational speech recognition. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo.

    Google Scholar 

  • Schwenk, H. (2004). Efficient training of large neural networks for language modeling. In IEEE joint conference on neural networks, pages 3059–3062.

    Google Scholar 

  • Schwenk, H. and Gauvain, J.-L. (2004). Neural network language models for conversational speech recognition. In International Conference on Speech and Language Processing, pages 1215–1218.

    Google Scholar 

  • Schwenk, H. and Gauvain, J.-L. (2005). Building Continuous Language Models for Transcribing European Languages. In Eurospeech. To appear.

    Google Scholar 

  • Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Proceedings of the International Conference on Statistical Language Processing, Denver, Colorado.

    Google Scholar 

  • Xu, W. and Rudnicky, A. (2000). Can artificial neural network learn language models? In International Conference on Statistical Language Processing, pages M1–13, Beijing, China.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Bengio, Y., Schwenk, H., Senécal, JS., Morin, F., Gauvain, JL. (2006). Neural Probabilistic Language Models. In: Holmes, D.E., Jain, L.C. (eds) Innovations in Machine Learning. Studies in Fuzziness and Soft Computing, vol 194. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33486-6_6

Download citation

  • DOI: https://doi.org/10.1007/3-540-33486-6_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30609-2

  • Online ISBN: 978-3-540-33486-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics