An Extendible Regular Expression Compiler for Finite-State Approaches in Natural Language Processing
Finite-state techniques are widely used in various areas of Natural Language Processing (NLP).As Kaplan and Kay  have argued, regular expressions are the appropriate level of abstraction for thinking about finite-state languages and finite-state relations.More complex finite-state operations (such as contexted replacement) are defined on the basis of basic operations (such as Kleene closure, complementation, composition).
In order to be able to experiment with such complex finite-state operations the FSA Utilities (version 5) provides an extendible regular expression compiler.The paper discusses the regular expression operations provided by the compiler, and the possibilities to create new regular expression operators.The benefits of such an extendible regular expression compiler are illustrated with a number of examples taken from recent publications in the area of finite-state approaches to NLP.
KeywordsOptimality Theory Natural Language Processing Regular Expression Regular Language Computational Linguistics
Unable to display preview. Download preview PDF.
- Steven Abney. Partial parsing via finite-state cascades. In John Carroll, editor, Workshop on Robust Parsing; Eight European Summer School in Logic, Language and Information, pages 8–15, 1995.Google Scholar
- Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The Design and Analysis of Computer Algorithms. Addison-W esley, 1974.Google Scholar
- Gosse Bouma.A modern computational linguistics course using dutch. In EACL 99: Computer and Internet Supported Education in Language and Speech Technology. Proceedings of a Workshop sponsored by ELSNET and The Association for Computational Linguistics, Bergen Norway, 1999.Google Scholar
- Christian S. Calude, Kai Salomaa, and Sheng Yu.Metric lexical analysis. In O. Boldt, H. Juergensen, and L. Robbins, editors, Workshop on Implementing Automata; WIA99 Pre-Proceedings, Potsdam Germany, 1999.Google Scholar
- Jean-Pierre Chanod and Pasi Tapanainen.A robust finite-state grammar for French. In John Carroll, editor, Workshop on Robust Parsing, Prague, 1996. These proceedings are also available as Cognitive Science Research Paper #435; School of Cognitive and Computing Sciences, University of Sussex.Google Scholar
- P.C. Uit den Boogaart. Woordfrequenties in geschreven en gesproken Nederlands. Oosthoek, Scheltema & Holkema, Utrecht, 1975. Werkgroep Frequentie-onderzoek van het Nederlands.Google Scholar
- Dale Gerdemann and Gertjan van Noord.Transducers from rewrite rules with backreferences. In Ninth Conference of the European Chapter of the Association for Computational Linguistics, Bergen Norway, 1999.Google Scholar
- Gregory Grefenstette.Light parsing as finite-state filtering. In EACI 1996 Workshop Extended Finite-State Models of Language, Budapest, 1996.Google Scholar
- John E. Hopcroft. An n log n algorithm for minimizing the states in a finite automaton. In Z. Kohavi, editor, The Theory of Machines and Computations, pages 189–196. Academic Press, 1971.Google Scholar
- John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison Wesley, 1979.Google Scholar
- C. Douglas Johnson. Formal Aspects of Phonological Descriptions. Mouton, The Hague, 1972.Google Scholar
- Ronald Kaplan and Martin Kay.Regular models of phonological rule systems. Computational Linguistics, 20(3):331–379, 1994.Google Scholar
- Lauri Karttunen.The replace operator. In 33th Annual Meeting of the Association for Computational Linguistics, M.I.T. Cambridge Mass., 1995.Google Scholar
- Lauri Karttunen.Directed replacement. In 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, 1996.Google Scholar
- Lauri Karttunen. The replace operator. In Emannual Roche and Yves Schabes, editors, Finite-State Language Processing, pages 117–147. Bradford, MIT Press, 1997.Google Scholar
- Lauri Karttunen.The proper treatment of optimality theory in computational phonology. In Finite-state Methods in Natural Language Processing, pages 1–12, Ankara, 1998.Google Scholar
- George Anton Kiraz and Edmund Grimley-Evans.Multi-tape automata for speech and language systems: A prolog implementation. In Derick Wood and Sheng Yu, editors, Automata Implementation. Second Internation Workshop on Implementing Automata, WIA’ 97, pages 87–103. Springer Lecture Notes in Computer Science 1436, 1998.Google Scholar
- Mehryar Mohri, Fernando C.N. Pereira, and Michael Riley. A rational design for a weighted finite-state transducer library. In Automata Implementation. Second International Workshop on Implementing Automata, WIA’ 97. Springer Verlag, 1998. Lecture Notes in Computer Science 1436.CrossRefGoogle Scholar
- Mehryar Mohri and Richard Sproat.An efficient compiler for weighted rewrite rules. In 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, 1996.Google Scholar
- Alan Prince and Paul Smolensky. Optimalit y theory: Constraint interaction in generative grammar. Technical Report TR-2, Rutgers University Cognitive Science Center, New Brunswick, NJ, 1993. MIT Press, To Appear.Google Scholar
- D. Raymond and D. Wood. The grail papers. Technical Report TR-491, University of Western Ontario, Department of Computer Science, London Ontario, 1996.Google Scholar
- Emmanuel Roche.Parsing with finite-state transducers. In Emmanuel Roche and Yves Schabes, editors, Finite-State Language Processing, pages 241–281. MIT Press, Cambridge, 1997.Google Scholar
- Emmanuel Roche and Yves Schabes.Introduction. In Emmanuel Roche and Yves Schabes, editors, Finite-State Language Processing. MIT Press, Cambridge, Mass, 1997.Google Scholar
- Gertjan van Noord.FSA Utilities: A toolbox to manipulate finite-state automata. In Darrell Raymond, Derick Wood, and Sheng Yu, editors, Automata Implementation, pages 87–108. Springer Verlag, 1997. Lecture Notes in Computer Science 1260.Google Scholar
- Gertjan van Noord.FSA Utilities (version 5), 1998. The FSAUtilities toolbox is available free of charge under Gnu General Public License at http://www.let.rug.nl/~vannoord/Fsa/.
- Gertjan van Noord.The treatment of epsilon moves in subset construction. In Finite-state Methods in Natural Language Processing, Ankara, 1998. cmplg/ 9804003.Accepted for Computational Linguistics.Google Scholar
- Bruce W. Watson. Taxonomies and Toolkits of Regular Language Algorithms. PhD thesis, Eindhoven University of Technology, 1995.Google Scholar