Towards High Speed Grammar Induction on Large Text Corpora

  • Pieter Adriaans
  • Marten Trautwein
  • Marco Vervoort
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1963)


In this paper we describe an efficient and scalable implementation for grammar induction based on the EMILE approach [2, 3, 4, 5, 6]. The current EMILE 4.1 implementation [11] is one of the first efficient grammar induction algorithms that work on free text. Although EMILE 4.1 is far from perfect, it enables researchers to do empirical grammar induction research on various types of corpora.

The EMILE approach is based on notions from categorial grammar (cf. [10]), which is known to generate the class of context-free languages. EMILE learns from positive examples only (cf. [1],[7],[9]). We describe the algorithms underlying the approach and some interesting practical results on small and large text collections. As shown in the articles mentioned above, in the limit EMILE learns the correct grammatical structure of a language from sentences of that language. The conducted experiments show that, put into practice, EMILE 4.1 is efficient and scalable. This current implementation learns a subclass of the shallow context-free languages. This subclass seems sufficiently rich to be of practical interest. Especially Emile seems to be a valuable tool in the context of syntactic and semantic analysis of large text corpora.


Free Text Characteristic Expression Categorial Grammar Perfect Sample Ambiguous Context 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    N. Abe, Learnability and locality of formal grammars, in Proceedings of the 26th Annual meeting of the Association of computational linguistics, 1988. 173Google Scholar
  2. 2.
    P. W. Adriaans, Language Learning from a Categorial Perspective, PhD thesis, University of Amsterdam, 1992. 173, 184Google Scholar
  3. 3.
    P. W. Adriaans, Bias in Inductive Language Learning, in Proceedings of the ML92 Workshop on Biases in Inductive Learning, Aberdeen, 1992. 173Google Scholar
  4. 4.
    P. W. Adriaans, Learning Shallow Context-Free Languages under Simple Distributions, ILLC Research Report PP-1999-13, Institute for Logic, Language and Computation, Amsterdam, 1999. 173Google Scholar
  5. 5.
    P. W. Adriaans, S. Janssen, E. Nomden, Effective identification of semantic categories in curriculum texts by means of cluster analysis, in workshop-notes on Machine Learning Techniques for Text Analysis, Vienna, 1993. 173Google Scholar
  6. 6.
    P. W. Adriaans, A. K. Knobbe, EMILE: Learning Context-free Grammars from Examples, in Proceedings of BENELEARN’96, 1996 173Google Scholar
  7. 7.
    W. Buszkowski, G. Penn, Categorial Grammars Determined from Linguistic Data by Unification, The University of Chicago, Technical Report 89–05, June 1989. 173Google Scholar
  8. 8.
    E. Dörnenburg, Extension of the EMILE algorithm for inductive learning of context-free grammars for natural languages, Master’s Thesis, University of Dortmund, 1997. 174Google Scholar
  9. 9.
    M. Kanazawa, Learnable Classes of Categorial Grammars, PhDthesis, University of Stanford, 1994. 173Google Scholar
  10. 10.
    R. Oehrle, E. Bach, D. Wheeler (Eds.), Categorial Grammars and Natural Language Structures, D. Reidel Publishing Company, Dordrecht, 1988. 173, 175Google Scholar
  11. 11.
    M. R. Vervoort, Games, Walks and Grammars: Problems I’ve Worked On, PhD thesis, University of Amsterdam, 2000. 173, 174, 178, 180, 184Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Pieter Adriaans
    • 1
    • 2
  • Marten Trautwein
    • 1
  • Marco Vervoort
    • 2
  1. 1.Perot Systems Nederland BVAmersfoortThe Netherlands
  2. 2.FdNWIUniversity of AmsterdamAmsterdamThe Netherlands

Personalised recommendations