Abstract
This paper details the steps involved in scaling-up a lexicalised finite-state morphology transducer for use on unrestricted text. Our starting point was a base-line inflectional morphology engine [1], with 81% token coverage measured against a 15 million word corpus of Irish texts [2]. Manually scaling the FST lexicon component of a morphology transducer is time-consuming, expensive and rarely, if ever, complete. In order to scale up the engine we used a combination of strategies including semi-automatic population of the finite-state lexicon from machine-readable dictionary resources and from printed resources using optical character recognition, the addition of derivational morphology and the development of morphological guessers. This paper details the coverage increase contributed by each step. The full system achieves token coverage of 93% which is extended to 100% through the use of morphological guessers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Uí Dhonnchadha, E.: An analyser and generator for Irish inflectional morphology using finite state transducers. Master’s thesis, School of Computing, Dublin City University, Dublin, Ireland (2002)
ITÉ (accessed, November 2005), http://www.ite.ie/corpus/
Beesley, K.R., Karttunen, L.: Finite State Morphology. CSLI Studies in Computational Linguistics. CSLI Publications (2003)
Karttunen, L., Beesley, K.R.: Two-level rule compiler. Technical report, Xerox PARC (1992)
Oideachais, A.R.: Foclóir Póca English-Irish/Irish-English Dictionary. An Gúm, Baile Átha Cliath (1986)
Symbols (accessed, November 2005), http://www.symbols.net/names/
Uí Dhonnchadha, E., Nic Pháidín, C., Van Genabith, J.: Design, implementation and evaluation of an inflectional morphology finite-state transducer for Irish. MT - Machine Translation: Special Issue on Finite State Language Resources and Language Processing (in press)
Críostaí, B.: Graiméar Gaeilge na mBráithre Críostaí. An Gúm, Baile Átha Cliath (1999)
Ó Dónaill, N.: Foclóir Gaeilge Béarla. Oifig an tSoláthair, Baile Átha Cliath (1977)
Ó Droighneáin, M.: An Sloinnteoir Gaeilge agus an tAinmneoir. Coiscéim, Baile Átha Cliath (1991)
Ó Siochfhrada, N.: Foclóir Gaeilge/Béarla - Béarla/Gaeilge. An Comhlacht Oideachais, Baile Átha Cliath (1998)
Grefenstette, G., Schiller, A., Ait-Mokhtar, S.: Recognizing lexical patterns in text. In: van Eynde, F., Gibbon, D. (eds.) Lexicon Development for Speech and Language Processing. Kluwer Academic Publishers, Dordrecht (2000)
Kilgarriff, A., Rundell, M., Uí Dhonnchadha, E.: Efficient corpus creation for lexicography. Language Resources and Evaluation Journal (forthcoming)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dhonnchadha, E.U., Van Genabith, J. (2006). Scaling an Irish FST Morphology Engine for Use on Unrestricted Text. In: Yli-Jyrä, A., Karttunen, L., Karhumäki, J. (eds) Finite-State Methods and Natural Language Processing. FSMNLP 2005. Lecture Notes in Computer Science(), vol 4002. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780885_24
Download citation
DOI: https://doi.org/10.1007/11780885_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35467-3
Online ISBN: 978-3-540-35469-7
eBook Packages: Computer ScienceComputer Science (R0)