Treebanks pp 333-349 | Cite as

Extracting Stochastic Grammars from Treebanks

  • Rens Bod
Part of the Text, Speech and Language Technology book series (TLTB, volume 20)


The Data-Oriented Parsing (DOP) model employs an annotated corpus or treebank directly as a stochastic grammar. New input is parsed by combining subtrees from the treebank. The most probable analysis is estimated on the basis of the occurrence-frequencies of the treebank-subtrees. The model as originally defined imposes no constraints on the size and complexity of the subtrees that may be invoked in parsing new input. Both from a theoretical and from a computational perspective we may therefore wonder whether it is possible to impose constraints on the subtrees that are used, in such a way that the performance of the model does not deteriorate or perhaps even improves. That is the main question addressed in the current paper. Moreover, by imposing different constraints on the subtree set, we can simulate several other stochastic grammars, ranging from stochastic context-free grammars to stochastic lexicalized grammars, thus allowing for a proper performance comparison. Experiments with the ATIS and Wall Street Journal treebanks indicate that very few constraints on the treebank- subtrees are warranted. We conclude with a brief discussion of the consequences of our results.


Data-oriented parsing Corpus-based grammars Stochastic grammars 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Berg, E. van den, R. Bod, R. Scha (1994). A Corpus-Based Approach to Semantic Interpretation, Proceedings Ninth Amsterdam Colloquium, Amsterdam, The Netherlands.Google Scholar
  2. Bod, R. (1992). Data Oriented Parsing (DOP), Proceedings COLING’92, Nantes, France.Google Scholar
  3. Bod, R. (1993a). Using an Annotated Language Corpus as a Virtual Stochastic Grammar, Proceedings AAAI’93, Morgan Kaufmann, Menlo Park, Ca.Google Scholar
  4. Bod, R. (1993b). Monte Carlo Parsing, Proceedings Third International Workshop on Parsing Technologies, Tilburg/Durbuy, The Netherlands/Belgium.Google Scholar
  5. Bod, R. (1995). Enriching Linguistics with Statistics: Performance Models of Natural Language, ILLC Dissertation Series 1995-14, University of Amsterdam.Google Scholar
  6. Bod, R. (1998a). Spoken Dialogue Interpretation with the DOP Model, Proceedings COLING-ACL’98, Montreal, Canada.Google Scholar
  7. Bod, R. (1998b). Beyond Grammar. Stanford: CSLI Publications.Google Scholar
  8. Bod, R. (2000). Parsing with the Shortest Derivation, Proceedings COL-ING’2000, Saarbrücken, Germany.Google Scholar
  9. Bod, R. (2001). What is the Minimal Set of Fragments which Achieves Maximal Parse Accuracy? Proceedings ACL’2001, Toulouse, France.Google Scholar
  10. Bod, R., R. Bonnema, R. Scha (1996). A Data-Oriented Approach to Semantic Interpretation, Proceedings Workshop on Corpus-Oriented Semantic Analysis, ECAI-96, Budapest, Hungary.Google Scholar
  11. Bod, R., R. Kaplan (1998). A Probabilistic Corpus-Driven Model for Lexical-Functional Analysis, Proceedings COLING-ACL’98, Montreal, Canada.Google Scholar
  12. Bonnema, R., R. Bod, R. Scha, (1997). A DOP Model for Semantic Interpretation, Proceedings ACL/EACL-97, Madrid, Spain.Google Scholar
  13. Carroll, J., D. Weir (1997). Encoding Frequency Information in Lexicalized Grammars, Proceedings 5th International Workshop on Parsing Technologies, MIT, Cambridge.Google Scholar
  14. Chappelier, J., M. Rajman (1998). Extraction stochastique d’arbres d’analyse pour le modle DOP, Proceedings TALN 1998, Paris, France.Google Scholar
  15. Charniak, E. (1996). Tree-bank Grammars, Proceedings AAAI’96, Portland, Oregon.Google Scholar
  16. Charniak, E. (1997). Statistical Techniques for Natural Language Parsing, AI Magazine, Winter 1997.Google Scholar
  17. Charniak, E. (2000). A Maximum-Entropy-Inspired Parser. Proceedings ANLPNAACL’2000, Seattle, Washington.Google Scholar
  18. Chiang, D. (2000). Statistical parsing with an automatically extracted tree adjoining grammar, Proceedings ACL’2000, Hong Kong, China.Google Scholar
  19. Coleman, J., J. Pierrehumbert (1997). Stochastic Phonological Grammars and Acceptability, Proceedings Computational Phonology, Third Meeting of the ACL Special Interest Group in Computational Phonology, Madrid, Spain.Google Scholar
  20. Collins, M. (1996). A new statistical parser based on bigram lexical dependencies, Proceedings ACL’96, Santa Cruz (Ca.).Google Scholar
  21. Collins, M. (1997). Three generative lexicalised models for statistical parsing, Proceedings ACL’97, Madrid, Spain.Google Scholar
  22. Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing, PhD-thesis, University of Pennsylvania, PA.Google Scholar
  23. Collins, M. (2000). Discriminative Reranking for Natural Language Parsing, Proceedings ICML-2000, Stanford, Ca.Google Scholar
  24. Collins, M., N. Duffy (2002). New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. Proceedings ACL’2002, Philadelphia, PA.Google Scholar
  25. Cormons, B. (1999). Analyse et désambiguisation: Une approche purement à base de corpus (Data-Oriented Parsing) pour le formalisme des Grammaires Lexicales Fonctionnelles, PhD thesis, Université de Rennes, France.Google Scholar
  26. Eisner, J. (1996). Three new probabilistic models for dependency parsing: an exploration, Proceedings COLING-96, Copenhagen, Denmark.Google Scholar
  27. Eisner, J. (1997). Bilexical Grammars and a Cubic-Time Probabilistic Parser, Proceedings Fifth International Workshop on Parsing Technologies, Boston, Mass.Google Scholar
  28. Frank, A., J. van Genabith, L. Sadler, A. Way (2003). From Treebank Resources to LFG F-Structures. This volume.Google Scholar
  29. Goodman, J. (1996). Efficient Algorithms for Parsing the DOP Model, Proceedings Empirical Methods in Natural Language Processing, Philadelphia, PA.Google Scholar
  30. Goodman, J. (1998). Parsing Inside-Out, Ph.D. thesis, Harvard University, Mass.Google Scholar
  31. Johnson, M. (1998). PCFG Models of Linguistic Tree Representations, Computational Linguistics 24(4), p. 613–632.Google Scholar
  32. Kaplan, R. (1996). A Probabilistic Approach to Lexical-Functional Analysis, Proceedings of the 1996 LFG Conference and Workshops. CSLI Publications, Stanford, CA.Google Scholar
  33. Magerman, D. (1995). Statistical Decision-Tree Models for Parsing, Proceedings ACL’95, Cambridge, Mass.Google Scholar
  34. Marcus, M., B. Santorini, M. Marcinkiewicz (1993). Building a Large Annotated Corpus of English: the Penn Treebank, Computational Linguistics 19(2).Google Scholar
  35. Neumann, G. (2002). A Uniform Method for Automatically Extracting Stochastic Lexicalized Tree Grammars from Treebanks and HPSG. This volume.Google Scholar
  36. Rajman, M. (1995a). Apports d’une approche à base de corpus aux techniques de traitement automatique du langage naturel, PhD thesis, Ecole Nationale Supérieure des Télécommunications, Paris.Google Scholar
  37. Rajman, M. (1995b). Approche Probabiliste de l’Analyse Syntaxique, Traitement Automatique des Langues, vol. 36(1-2).Google Scholar
  38. Scha, R. (1990). Taaltheorie en Taaltechnologie; Competence en Performance, in Q.A.M. de Kort and G.L.J. Leerdam (eds.), Computertoepassingen in de Neerlandistiek, Almere: Landelijke Vereniging van Neerlandici (LVVN-jaarboek).Google Scholar
  39. Scha, R. (1992). Virtuele Grammatica’s en Creatieve Algoritmen, Gramma/TTT 1(1).Google Scholar
  40. Scholtes, J. (1992). Resolving Linguistic Ambiguities with a Neural Data-Oriented Parsing (DOP) System, in I. Aleksander and J. Taylor (eds.), Artificial Neural Networks 2, Vol. 2, Elsevier Science Publishers.Google Scholar
  41. Scholtes, J., S. Bloembergen (1992a). The Design of a Neural Data-Oriented Parsing (DOP) System, Proceedings of the International Joint Conference on Neural Networks, (IJCNN), Baltimore, MD.Google Scholar
  42. Scholtes, J., S. Bloembergen (1992b). Corpus Based Parsing with a Self-Organizing Neural Net, Proceedings of the International Joint Conference on Neural Networks, (IJCNN), Bejing, China.Google Scholar
  43. Sekine, S., R. Grishman (1995). A Corpus-based Probabilistic Grammar with Only Two Non-terminals, Proceedings Fourth International Workshop on Parsing Technologies, Prague, Czech Republic.Google Scholar
  44. Sima’an, K., R. Bod, S. Krauwer, R. Scha (1994). Efficient Disambiguation by means of Stochastic Tree Substitution Grammars, Proceedings International Conference on New Methods in Language Processing, UMIST, Manchester, UK.Google Scholar
  45. Sima’an, K. (1995). An optimized algorithm for Data Oriented Parsing, Proceedings International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria.Google Scholar
  46. Sima’an, K. (1996a). An optimized algorithm for Data Oriented Parsing, in R. Mitkov and N. Nicolov (eds.), Recent Advances in Natural Language Processing 1995, volume 136 of Current Issues in Linguistic Theory. John Benjamins, Amsterdam.Google Scholar
  47. Sima’an, K. (1996b). Computational Complexity of Probabilistic Disambiguation by means of Tree Grammars, Proceedings COLING-96, Copenhagen, Denmark.Google Scholar
  48. Sima’an, K. (1997). Explanation-Based Learning of Data-Oriented Parsing, in T. Ellison (ed.) CoNLL97: Computational Natural Language Learning, ACL’97, Madrid, Spain.Google Scholar
  49. Srinivas, B., A. Joshi (1995). Some novel applications of explanation-based learning to parsing lexicalized tree-adjoining grammars, Proceedings ACL’95, Cambridge (Mass.).Google Scholar
  50. Way, A. (1999). A Hybrid Archtecture for Robust MT using LFG-DOP, Journal of Experimental and Theoretical Artificial Intelligence 11(4).Google Scholar
  51. Weischedel, R., M. Meteer, R, Schwarz, L. Ramshaw, J. Palmucci (1993). Coping with Ambiguity and Unknown Words through Probabilistic Models, Computational Linguistics, 19(2).Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2003

Authors and Affiliations

  • Rens Bod
    • 1
    • 2
  1. 1.School of ComputingUniversity of LeedsLeedsUK
  2. 2.Institute for Logic, Language and ComputationUniversity of AmsterdamThe Netherlands

Personalised recommendations