Are Very Large Context-Free Grammars Tractable?

Boullier, Pierre; Sagot, Benoît

doi:10.1007/978-90-481-9352-3_12

Are Very Large Context-Free Grammars Tractable?

Pierre Boullier⁴ &
Benoît Sagot⁴

Chapter
First Online: 01 January 2010

553 Accesses
1 Citations

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 43))

Abstract

More and more often, in real-word natural language processing (NLP) applications based upon grammars, these grammars are no longer written by hand, but are automatically generated. This chapter will consider one of the consequences of this state of affairs: the generated grammars may be very large. Indeed, we aim to deal with grammars that might have over a million symbol occurrences and several hundred thousands rules. Traditional parsers are not usually prepared to handle them, either because these grammars are simply too big (the parser’s internal structures blow up) or the time spent to analyze a sentence becomes prohibitive.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
In the general case, the original TAG cannot be considered as a TIG and therefore converted into a strongly equivalent CFG. However, an over-generating TIG can be trivially extracted from the TAG. Parsing with the corresponding CFG, using if necessary the techniques described in this chapter, provides an efficient guide to the TAG parser, in the sense of Boullier (2003).
2.
We may say that the canonical form of the empty reduced grammar is $(\{S\}, \emptyset, \emptyset, S)$ though the axiom S does not appear in any production.
3.
The popular notion of shared forest mainly comes from Billot and Lang (1989). This notion generalizes straight-forwardly, when the input sentence, as assumed below, is not a linear string but a word DAG (see Section 12.2.3).
4.
Instead of the classical definition without ε-transitions, we have chosen this definition for compatibility with our definition of ε-free DAGs (see Section 12.2.3). This difference in definitions has no impact on non-empty strings.
5.
We may say that the canonical form of the empty reduced FSA is $(\{q_0\}, \emptyset, \emptyset, q_0, \emptyset)$ though the initial state q ₀ does not appear in any transition.
6.
This is particularly relevant in contexts such as speech processing, lexical ambiguity representation, ambiguous spelling correction, and others.
7.
This constraint is not included in the usual definition of DAGs, but it does not decrease their expressive power. It has some drawbacks that we shall not discuss here, but allows us to generalize usual parsing algorithms to DAGs with virtually no effort, as described in the remainder of this section.
8.
Where “ε-free” is to be understood according to the definition given above; if ε is in the language, the transition $(1,\varepsilon,f)$ is in δ.
9.
It can be shown that the previous check can be performed on $(G^c, w)$ in worst-case time ${\cal O} (|G^c| \times |\varSigma|^3)$ (recall that $|\varSigma| \leq n$). This time reduces to ${\cal O} (|G^c| \times |\varSigma|^2)$ if the input sentence is not a DAG but a string.
10.
This is equivalent to assuming the existence in the grammar of a super-production whose right-hand side has the form $ S $ .
11.
This statement no longer holds if we exclude from P ^c the productions that have been previously erased during the current a-filter. In that case, an empty set indicates that the production $Z \rightarrow \alpha U \beta$ can be erased.
12.
The homepage of Syntax is http://syntax.gforge.inria.fr/. Syntax is freely available at http://gforge.inria.fr/projects/syntax
13.
The name $G^{T>N}$ comes from this construction process: the set T of terminal symbols of the original grammar has been turned into a subset of the non-terminal symbols N.
14.
We use classes of sentences for smoothing purposes: they tackle the sparse data problem that arises when computing medians over sets of sentences of the same length. Indeed, given the modest size of the corpus we used, there are not so many sentences with a same given length, especially for large lengths.
15.
As already defined in Section 12.2.1, the size of a CFG is the number of (terminal or non-terminal) symbol occurrences in the set of all productions of the grammar. For example, a grammar made up of n binary rules has size 3n.
16.
As seen above, inflected forms are directly terminal symbols of $G^{T>N}$, while $G^{TIG}$ uses a lexicon to map these inflected forms into its own terminal symbols, thereby possibly introducing lexical ambiguity.
17.
The measures presented in this section have been taken on a 1.7 GHz AMD Athlon PC with 1.5 Gb of RAM running Linux. All parsers are written in C and have been compiled with gcc 2.96 with the O2 optimization flag.
18.
Contrarily to classical Earley parsers, its predictor phase uses a pre-computed structure which is roughly an LC relation. Note that this feature forces our filters to compute an LC relation on the generated sub-grammar. This also shows that LC parsers may also benefit from our filtering techniques.

References

Berwick, R.C. and A.S. Weinberg (1984, May). The Grammatical Basis of Linguistic Performance – Language Use and Acquisition. Cambridge, MA: MIT Press.
Google Scholar
Billot, S. and B. Lang (1989). The structure of shared forests in ambiguous parsing. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, pp. 143–151.
Google Scholar
Boullier, P. (2000). On TAG parsing. Traitement Automatique des Langues 41(3), 759–793.
Google Scholar
Boullier, P. (2003). Guided earley parsing. In Proceedings of the 7th International Workshop on Parsing Technologies, Nancy, pp. 43–54.
Google Scholar
Boullier, P. and B. Sagot (2005). Efficient and robust LFG parsing: sxlfg. In Proceedings of the 9th International Workshop on Parsing Technologies, Vancouver, pp. 1–10.
Google Scholar
Earley, J. (1970). An efficient context-free parsing algorithm. Communication of the ACM 13(2), 94–102.
Article Google Scholar
Hopcroft, J.D. and J.E. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley.
Google Scholar
Joshi, A. (1997). Parsing techniques. In Survey of the State of the Art in Human Language Technology, pp. 351–356. New York, NY: Cambridge University Press.
Google Scholar
Kasami, T. (1967). An efficient recognition and syntax algorithm for context-free languages. Scientific Report AFCRL-65–758. Bedford, MA: Air Force Cambridge Research Laboratory.
Google Scholar
Moore, R.C. (2000). Improved left-corner chart parsing for large context-free grammars. In Proceedings of the 6th International Workshop on Parsing Technologies, Trento, Italy, pp. 171–182. Revised version at http://www.cogs.susx.ac.uk/lab/nlp/carroll/cfg-resources/iwpt2000-rev2.ps
Nederhof, M.-J. (1993). Generalized left-corner parsing. In Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics, Morristown, NJ, pp. 305–314.
Google Scholar
Nederhof, M.-J. and G. Satta (2000). Left-to-right parsing and bilexical context-free grammars. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics, San Francisco, CA, pp. 272–279. Morgan Kaufmann Publishers Inc.
Google Scholar
Sagot, B. and P. Boullier (2005). From raw corpus to word lattices: robust pre-parsing processing. In Proceedings of the 2nd Language and Technology Conference, Poznań, pp. 348–351.
Google Scholar
Sagot, B., L. Clément, E. Villemonte de La Clergerie, and P. Boullier (2006). The Lefff 2 syntactic lexicon for french: architecture, acquisition, use. In Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, pp. 1348–1351.
Google Scholar
Satta, G. (1992). Review of “Generalized LR parsing” by Masaru Tomita. Kluwer Academic Publishers 1991. Computational Linguistics 18(3), 377–381.
Google Scholar
Satta, G. and O. Stock (1994). Bidirectional context-free grammar parsing for natural language processing. Artificial Intelligence 69(1–2), 123–164.
Article Google Scholar
Schabes, Y., A. Abeillé, and A. Joshi (1988). Parsing strategies with ‘lexicalized’ grammars: application to tree adjoining grammars. In Proceedings of the 12th International Conference on Computational Linguistics, Budapest, pp. 578–583.
Google Scholar
Schabes, Y. and R.C. Waters (1995). Tree insertion grammar: cubic-time, parsable formalism that lexicalizes context-free grammar without changing the trees produced. Computational Linguistics 21(4), 479–513.
Google Scholar
van Noord, G. (1997). An efficient implementation of the head-corner parser. Computational Linguistics 23(3), 425–456.
Google Scholar
Villemonte de La Clergerie, E. (2005). From metagrammars to factorized TAG/TIG parsers. In Proceedings of the 9th International Workshop on Parsing Technologies, Vancouver, pp. 190–191.
Google Scholar
Younger, D.H. (1967). Recognition and parsing of context-free languages in time $n^3$. Information and Control 10(2), 189–208.
Article Google Scholar

Download references

Author information

Authors and Affiliations

INRIA-Rocquencourt, Domaine de Voluceau, 78153, Le Chesnay Cedex, France
Pierre Boullier & Benoît Sagot

Authors

Pierre Boullier
View author publications
You can also search for this author in PubMed Google Scholar
Benoît Sagot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre Boullier .

Editor information

Editors and Affiliations

Tilburg University, Warandelaan 2, Tilburg, 5000 LE, Netherlands
Harry Bunt
Dépt. Linguistique, Université de Genève, rue de Candolle 2, Genève, 1211, Switzerland
Paola Merlo
Pimpstensvägen 16, Uppsala, 752 67, Sweden
Joakim Nivre

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Boullier, P., Sagot, B. (2010). Are Very Large Context-Free Grammars Tractable?. In: Bunt, H., Merlo, P., Nivre, J. (eds) Trends in Parsing Technology. Text, Speech and Language Technology, vol 43. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9352-3_12

Download citation

DOI: https://doi.org/10.1007/978-90-481-9352-3_12
Published: 29 September 2010
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-9351-6
Online ISBN: 978-90-481-9352-3
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics