Towards High Speed Grammar Induction on Large Text Corpora

Adriaans, Pieter; Trautwein, Marten; Vervoort, Marco

doi:10.1007/3-540-44411-4_11

Pieter Adriaans^7,8,
Marten Trautwein⁷ &
Marco Vervoort⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1963))

Included in the following conference series:

International Conference on Current Trends in Theory and Practice of Computer Science

444 Accesses
7 Citations

Abstract

In this paper we describe an efficient and scalable implementation for grammar induction based on the EMILE approach [2, 3, 4, 5, 6]. The current EMILE 4.1 implementation [11] is one of the first efficient grammar induction algorithms that work on free text. Although EMILE 4.1 is far from perfect, it enables researchers to do empirical grammar induction research on various types of corpora.

The EMILE approach is based on notions from categorial grammar (cf. [10]), which is known to generate the class of context-free languages. EMILE learns from positive examples only (cf. [1],[7],[9]). We describe the algorithms underlying the approach and some interesting practical results on small and large text collections. As shown in the articles mentioned above, in the limit EMILE learns the correct grammatical structure of a language from sentences of that language. The conducted experiments show that, put into practice, EMILE 4.1 is efficient and scalable. This current implementation learns a subclass of the shallow context-free languages. This subclass seems sufficiently rich to be of practical interest. Especially Emile seems to be a valuable tool in the context of syntactic and semantic analysis of large text corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

N. Abe, Learnability and locality of formal grammars, in Proceedings of the 26th Annual meeting of the Association of computational linguistics, 1988. 173
Google Scholar
P. W. Adriaans, Language Learning from a Categorial Perspective, PhD thesis, University of Amsterdam, 1992. 173, 184
Google Scholar
P. W. Adriaans, Bias in Inductive Language Learning, in Proceedings of the ML92 Workshop on Biases in Inductive Learning, Aberdeen, 1992. 173
Google Scholar
P. W. Adriaans, Learning Shallow Context-Free Languages under Simple Distributions, ILLC Research Report PP-1999-13, Institute for Logic, Language and Computation, Amsterdam, 1999. 173
Google Scholar
P. W. Adriaans, S. Janssen, E. Nomden, Effective identification of semantic categories in curriculum texts by means of cluster analysis, in workshop-notes on Machine Learning Techniques for Text Analysis, Vienna, 1993. 173
Google Scholar
P. W. Adriaans, A. K. Knobbe, EMILE: Learning Context-free Grammars from Examples, in Proceedings of BENELEARN’96, 1996 173
Google Scholar
W. Buszkowski, G. Penn, Categorial Grammars Determined from Linguistic Data by Unification, The University of Chicago, Technical Report 89–05, June 1989. 173
Google Scholar
E. Dörnenburg, Extension of the EMILE algorithm for inductive learning of context-free grammars for natural languages, Master’s Thesis, University of Dortmund, 1997. 174
Google Scholar
M. Kanazawa, Learnable Classes of Categorial Grammars, PhDthesis, University of Stanford, 1994. 173
Google Scholar
R. Oehrle, E. Bach, D. Wheeler (Eds.), Categorial Grammars and Natural Language Structures, D. Reidel Publishing Company, Dordrecht, 1988. 173, 175
Google Scholar
M. R. Vervoort, Games, Walks and Grammars: Problems I’ve Worked On, PhD thesis, University of Amsterdam, 2000. 173, 174, 178, 180, 184
Google Scholar

Download references

Author information

Authors and Affiliations

Perot Systems Nederland BV, P.O.Box 2729, NL-3800 GG, Amersfoort, The Netherlands
Pieter Adriaans & Marten Trautwein
FdNWI, University of Amsterdam, Plantage Muidergracht 24, NL-1018 TV, Amsterdam, The Netherlands
Pieter Adriaans & Marco Vervoort

Authors

Pieter Adriaans
View author publications
You can also search for this author in PubMed Google Scholar
Marten Trautwein
View author publications
You can also search for this author in PubMed Google Scholar
Marco Vervoort
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Cybernetics, Czech Technical University, Karlovo nám. 13, 121 35, Prague, Czech Republic
Václav Hlaváč
Information Technology Department, CLRC RAL, Chilton, Didcot, Oxfordshire, UK
Keith G. Jeffery
Insitute of Computer Science, Academy of Sciences of the Czech Republic, Pod vodárenskou věží 2, 182 07, Prague, Czech Republic
Jiří Wiedermann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Adriaans, P., Trautwein, M., Vervoort, M. (2000). Towards High Speed Grammar Induction on Large Text Corpora. In: Hlaváč, V., Jeffery, K.G., Wiedermann, J. (eds) SOFSEM 2000: Theory and Practice of Informatics. SOFSEM 2000. Lecture Notes in Computer Science, vol 1963. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44411-4_11

Download citation

DOI: https://doi.org/10.1007/3-540-44411-4_11
Published: 22 January 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41348-6
Online ISBN: 978-3-540-44411-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics