Skip to main content

Are Very Large Context-Free Grammars Tractable?

  • Chapter
  • First Online:

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 43))

Abstract

More and more often, in real-word natural language processing (NLP) applications based upon grammars, these grammars are no longer written by hand, but are automatically generated. This chapter will consider one of the consequences of this state of affairs: the generated grammars may be very large. Indeed, we aim to deal with grammars that might have over a million symbol occurrences and several hundred thousands rules. Traditional parsers are not usually prepared to handle them, either because these grammars are simply too big (the parser’s internal structures blow up) or the time spent to analyze a sentence becomes prohibitive.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    In the general case, the original TAG cannot be considered as a TIG and therefore converted into a strongly equivalent CFG. However, an over-generating TIG can be trivially extracted from the TAG. Parsing with the corresponding CFG, using if necessary the techniques described in this chapter, provides an efficient guide to the TAG parser, in the sense of Boullier (2003).

  2. 2.

    We may say that the canonical form of the empty reduced grammar is \((\{S\}, \emptyset, \emptyset, S)\) though the axiom S does not appear in any production.

  3. 3.

    The popular notion of shared forest mainly comes from Billot and Lang (1989). This notion generalizes straight-forwardly, when the input sentence, as assumed below, is not a linear string but a word DAG (see Section 12.2.3).

  4. 4.

    Instead of the classical definition without ε-transitions, we have chosen this definition for compatibility with our definition of ε-free DAGs (see Section 12.2.3). This difference in definitions has no impact on non-empty strings.

  5. 5.

    We may say that the canonical form of the empty reduced FSA is \((\{q_0\}, \emptyset, \emptyset, q_0, \emptyset)\) though the initial state q 0 does not appear in any transition.

  6. 6.

    This is particularly relevant in contexts such as speech processing, lexical ambiguity representation, ambiguous spelling correction, and others.

  7. 7.

    This constraint is not included in the usual definition of DAGs, but it does not decrease their expressive power. It has some drawbacks that we shall not discuss here, but allows us to generalize usual parsing algorithms to DAGs with virtually no effort, as described in the remainder of this section.

  8. 8.

    Where “ε-free” is to be understood according to the definition given above; if ε is in the language, the transition \((1,\varepsilon,f)\) is in δ.

  9. 9.

    It can be shown that the previous check can be performed on \((G^c, w)\) in worst-case time \({\cal O} (|G^c| \times |\varSigma|^3)\) (recall that \(|\varSigma| \leq n\)). This time reduces to \({\cal O} (|G^c| \times |\varSigma|^2)\) if the input sentence is not a DAG but a string.

  10. 10.

    This is equivalent to assuming the existence in the grammar of a super-production whose right-hand side has the form $ S $ .

  11. 11.

    This statement no longer holds if we exclude from P c the productions that have been previously erased during the current a-filter. In that case, an empty set indicates that the production \(Z \rightarrow \alpha U \beta\) can be erased.

  12. 12.

    The homepage of Syntax is http://syntax.gforge.inria.fr/. Syntax is freely available at http://gforge.inria.fr/projects/syntax

  13. 13.

    The name \(G^{T>N}\) comes from this construction process: the set T of terminal symbols of the original grammar has been turned into a subset of the non-terminal symbols N.

  14. 14.

    We use classes of sentences for smoothing purposes: they tackle the sparse data problem that arises when computing medians over sets of sentences of the same length. Indeed, given the modest size of the corpus we used, there are not so many sentences with a same given length, especially for large lengths.

  15. 15.

    As already defined in Section 12.2.1, the size of a CFG is the number of (terminal or non-terminal) symbol occurrences in the set of all productions of the grammar. For example, a grammar made up of n binary rules has size 3n.

  16. 16.

    As seen above, inflected forms are directly terminal symbols of \(G^{T>N}\), while \(G^{TIG}\) uses a lexicon to map these inflected forms into its own terminal symbols, thereby possibly introducing lexical ambiguity.

  17. 17.

    The measures presented in this section have been taken on a 1.7 GHz AMD Athlon PC with 1.5 Gb of RAM running Linux. All parsers are written in C and have been compiled with gcc 2.96 with the O2 optimization flag.

  18. 18.

    Contrarily to classical Earley parsers, its predictor phase uses a pre-computed structure which is roughly an LC relation. Note that this feature forces our filters to compute an LC relation on the generated sub-grammar. This also shows that LC parsers may also benefit from our filtering techniques.

References

  • Berwick, R.C. and A.S. Weinberg (1984, May). The Grammatical Basis of Linguistic Performance – Language Use and Acquisition. Cambridge, MA: MIT Press.

    Google Scholar 

  • Billot, S. and B. Lang (1989). The structure of shared forests in ambiguous parsing. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, pp. 143–151.

    Google Scholar 

  • Boullier, P. (2000). On TAG parsing. Traitement Automatique des Langues 41(3), 759–793.

    Google Scholar 

  • Boullier, P. (2003). Guided earley parsing. In Proceedings of the 7th International Workshop on Parsing Technologies, Nancy, pp. 43–54.

    Google Scholar 

  • Boullier, P. and B. Sagot (2005). Efficient and robust LFG parsing: sxlfg. In Proceedings of the 9th International Workshop on Parsing Technologies, Vancouver, pp. 1–10.

    Google Scholar 

  • Earley, J. (1970). An efficient context-free parsing algorithm. Communication of the ACM 13(2), 94–102.

    Article  Google Scholar 

  • Hopcroft, J.D. and J.E. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley.

    Google Scholar 

  • Joshi, A. (1997). Parsing techniques. In Survey of the State of the Art in Human Language Technology, pp. 351–356. New York, NY: Cambridge University Press.

    Google Scholar 

  • Kasami, T. (1967). An efficient recognition and syntax algorithm for context-free languages. Scientific Report AFCRL-65–758. Bedford, MA: Air Force Cambridge Research Laboratory.

    Google Scholar 

  • Moore, R.C. (2000). Improved left-corner chart parsing for large context-free grammars. In Proceedings of the 6th International Workshop on Parsing Technologies, Trento, Italy, pp. 171–182. Revised version at http://www.cogs.susx.ac.uk/lab/nlp/carroll/cfg-resources/iwpt2000-rev2.ps

  • Nederhof, M.-J. (1993). Generalized left-corner parsing. In Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics, Morristown, NJ, pp. 305–314.

    Google Scholar 

  • Nederhof, M.-J. and G. Satta (2000). Left-to-right parsing and bilexical context-free grammars. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics, San Francisco, CA, pp. 272–279. Morgan Kaufmann Publishers Inc.

    Google Scholar 

  • Sagot, B. and P. Boullier (2005). From raw corpus to word lattices: robust pre-parsing processing. In Proceedings of the 2nd Language and Technology Conference, Poznań, pp. 348–351.

    Google Scholar 

  • Sagot, B., L. Clément, E. Villemonte de La Clergerie, and P. Boullier (2006). The Lefff 2 syntactic lexicon for french: architecture, acquisition, use. In Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, pp. 1348–1351.

    Google Scholar 

  • Satta, G. (1992). Review of “Generalized LR parsing” by Masaru Tomita. Kluwer Academic Publishers 1991. Computational Linguistics 18(3), 377–381.

    Google Scholar 

  • Satta, G. and O. Stock (1994). Bidirectional context-free grammar parsing for natural language processing. Artificial Intelligence 69(1–2), 123–164.

    Article  Google Scholar 

  • Schabes, Y., A. Abeillé, and A. Joshi (1988). Parsing strategies with ‘lexicalized’ grammars: application to tree adjoining grammars. In Proceedings of the 12th International Conference on Computational Linguistics, Budapest, pp. 578–583.

    Google Scholar 

  • Schabes, Y. and R.C. Waters (1995). Tree insertion grammar: cubic-time, parsable formalism that lexicalizes context-free grammar without changing the trees produced. Computational Linguistics 21(4), 479–513.

    Google Scholar 

  • van Noord, G. (1997). An efficient implementation of the head-corner parser. Computational Linguistics 23(3), 425–456.

    Google Scholar 

  • Villemonte de La Clergerie, E. (2005). From metagrammars to factorized TAG/TIG parsers. In Proceedings of the 9th International Workshop on Parsing Technologies, Vancouver, pp. 190–191.

    Google Scholar 

  • Younger, D.H. (1967). Recognition and parsing of context-free languages in time \(n^3\). Information and Control 10(2), 189–208.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre Boullier .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Boullier, P., Sagot, B. (2010). Are Very Large Context-Free Grammars Tractable?. In: Bunt, H., Merlo, P., Nivre, J. (eds) Trends in Parsing Technology. Text, Speech and Language Technology, vol 43. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9352-3_12

Download citation

Publish with us

Policies and ethics