Optimizing Local Probability Models for Statistical Parsing

  • Kristina Toutanova
  • Mark Mitchell
  • Christopher D. Manning
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2837)


This paper studies the properties and performance of models for estimating local probability distributions which are used as components of larger probabilistic systems — history-based generative parsing models. We report experimental results showing that memory-based learning outperforms many commonly used methods for this task (Witten-Bell, Jelinek-Mercer with fixed weights, decision trees, and log-linear models). However, we can connect these results with the commonly used general class of deleted interpolation models by showing that certain types of memory-based learning, including the kind that performed so well in our experiments, are instances of this class. In addition, we illustrate the divergences between joint and conditional data likelihood and accuracy performance achieved by such models, suggesting that smoothing based on optimizing accuracy directly might greatly improve performance.


Parse Tree Derivation Tree Computational Linguistics Word Error Rate Conditional Likelihood 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high-performance learning name-finder. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194–201 (1997)Google Scholar
  2. 2.
    Black, E., Jelinek, F., Lafferty, J., Magerman, D.M., Mercer, R., Roukos, S.: Towards history-based grammars: Using richer models for probabilistic parsing. In: Proceedings of the 31st Meeting of the Association for Computational Linguistics, pp. 31–37 (1993)Google Scholar
  3. 3.
    Charniak, E.: A maximum entropy inspired parser. In: NAACL (2000)Google Scholar
  4. 4.
    Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics, pp. 310–318 (1996)Google Scholar
  5. 5.
    Collins, M.: Three generative, lexicalised models for statistical parsing. In: Proceedings of the 35th Meeting of the Association for Computational Linguistics and the 7th Conference of the European Chapter of the ACL, pp. 16–23 (1997)Google Scholar
  6. 6.
    Daelemans, W.: Introduction to the special issue on memory-based language processing. Journal of Experimental and Theoretical Artificial Intelligence 11(3), 287–292 (1999)CrossRefGoogle Scholar
  7. 7.
    Daelemans, W., van den Bosch, A., Zavrel, J.: Forgetting exceptions is harmful in language learning. Machine Learning 34(1/3), 11–43 (1999)zbMATHCrossRefGoogle Scholar
  8. 8.
    Dagan, I., Lee, L., Pereira, F.: Similarity-based models of cooccurrence probabilities. Machine Learning 34(1-3), 43–69 (1999)zbMATHCrossRefGoogle Scholar
  9. 9.
    Friedman, J.: On bias variance 0/1-loss and the curse-of-dimensionality. Journal of Data Mining and Knowledge Discovery 1(1) (1996)Google Scholar
  10. 10.
    Goodman, J.T.: A bit of progress in language modeling: Extended version. In MSR Technical Report MSR-TR-2001-72 (2001)Google Scholar
  11. 11.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (2003)Google Scholar
  12. 12.
    Lee, L.: Measures of distributional similarity. In: 37th Annual Meeting of the Association for Computational Linguistics, pp. 25–32 (1999)Google Scholar
  13. 13.
    Magerman, D.M.: Statistical decision-tree models for parsing. In: Proceedings of the 33rd Meeting of the Association for Computational Linguistics (1995)Google Scholar
  14. 14.
    Oepen, S., Toutanova, K., Shieber, S., Manning, C., Flickinger, D., Brants, T.: The LinGo Redwoods treebank: Motivation and preliminary applications. In: COLING 19 (2002)Google Scholar
  15. 15.
    Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. University of Chicago Press, Chicago (1994)Google Scholar
  16. 16.
    Ratnaparkhi, A.: A linear observed time statistical parser based on maximum entropy models. In: EMNLP, pp. 1—10 (1997)Google Scholar
  17. 17.
    Witten, I.H., Bell, T.C.: The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inform. Theory 37(4), 1085–1094 (1991)CrossRefGoogle Scholar
  18. 18.
    Zavrel, J., Daelemans, W.: Memory-based learning: Using similarity for smoothing. Joint ACL/EACL (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Kristina Toutanova
    • 1
  • Mark Mitchell
    • 2
  • Christopher D. Manning
    • 1
  1. 1.Computer Science DepartmentStanford UniversityStanfordUSA
  2. 2.CSLIStanford UniversityStanfordUSA

Personalised recommendations