Treatment of Unknown Words

Daciuk, Jan

doi:10.1007/3-540-45526-4_7

Jan Daciuk⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2214))

Included in the following conference series:

International Workshop on Implementing Automata

316 Accesses
2 Citations

Abstract

Words not present in the dictionary are almost always found in unrestricted texts. However, there is a need to obtain their likely base forms (in lemmatization), or morphological categories (in tagging), or both. Some of them find their ways into dictionaries, and it would be nice to predict what their entries should look like. Humans can perform those tasks using endings of words (sometimes prefixes and infixes as well), and so can do computers. Previous approaches used manually constructed lists of endings and associated information. Brill proposed transformation-based learning from corpora, and Mikheev used Brill’s approach on data for a morphological lexicon. However, both Brill’s algorithm, and Mikheev’s algorithm that is derived from Brill’s one, lack speed, both in the rule acquisition phase, and in the rule application phase. Their algorithms handle only the case of tagging, although an extension to other tasks seems possible. We propose a very fast finite-state method that handles all of the tasks described above, and that achieves similar quality of guessing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Eric Brill. A Corpus-Based Approach to Language Learning. PhD thesis, Department of Computer and Information Science, University of Pennsylvania, USA, 1993.
Google Scholar
Eric Brill. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543–565, December 1995.
Google Scholar
Andrei Mikheev. Automatic rule induction for unknown-word guessing. Computational Linguistics, 23(3):405–423, September 1997.
Google Scholar
Dominique Petitpierre and Graham Russell. MMORPH-the Multext morphology program. Technical report, ISSCO, 54 route des Acacias, CH-1227 Carouge, Switzerland, October 1995.
Google Scholar
Emmanuel Roche and Yves Schabes. Deterministic part-of-speech tagging with finite-state transducers. Computational Linguistics, 21(2):227–253, June 1995.
Google Scholar
Jan Tokarski. Schematyczny indeks a tergo polskich form wyrazowych. Wydawnictwo Naukowe PWN, 1993.
Google Scholar
Ralph Weischedel, Marie Meteer, Richard Schwartz, Lance Ramshaw, and Jeff Palmucci. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 19(2):359–382, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Informatics, Technical University of Gdańsk, Ul. Narutowicza 11/12, PL80-952, Gdańsk, Poland
Jan Daciuk

Authors

Jan Daciuk
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universität Potsdam, Institut für Informatik, August-Bebel-Straße 89, 14482, Potsdam, Germany
Oliver Boldt
Institut für Informatik, Universität Potsdam, Potsdam
Helmut Jürgensen
Department of Computer Science, The University of Western Ontario, London, Ontario, Canada, N6A 5B7
Helmut Jürgensen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Daciuk, J. (2001). Treatment of Unknown Words. In: Boldt, O., Jürgensen, H. (eds) Automata Implementation. WIA 1999. Lecture Notes in Computer Science, vol 2214. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45526-4_7

Download citation

DOI: https://doi.org/10.1007/3-540-45526-4_7
Published: 16 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42812-1
Online ISBN: 978-3-540-45526-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics