Treatment of Unknown Words

  • Jan Daciuk
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2214)


Words not present in the dictionary are almost always found in unrestricted texts. However, there is a need to obtain their likely base forms (in lemmatization), or morphological categories (in tagging), or both. Some of them find their ways into dictionaries, and it would be nice to predict what their entries should look like. Humans can perform those tasks using endings of words (sometimes prefixes and infixes as well), and so can do computers. Previous approaches used manually constructed lists of endings and associated information. Brill proposed transformation-based learning from corpora, and Mikheev used Brill’s approach on data for a morphological lexicon. However, both Brill’s algorithm, and Mikheev’s algorithm that is derived from Brill’s one, lack speed, both in the rule acquisition phase, and in the rule application phase. Their algorithms handle only the case of tagging, although an extension to other tasks seems possible. We propose a very fast finite-state method that handles all of the tasks described above, and that achieves similar quality of guessing.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Eric Brill. A Corpus-Based Approach to Language Learning. PhD thesis, Department of Computer and Information Science, University of Pennsylvania, USA, 1993.Google Scholar
  2. [2]
    Eric Brill. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543–565, December 1995.Google Scholar
  3. [3]
    Andrei Mikheev. Automatic rule induction for unknown-word guessing. Computational Linguistics, 23(3):405–423, September 1997.Google Scholar
  4. [4]
    Dominique Petitpierre and Graham Russell. MMORPH-the Multext morphology program. Technical report, ISSCO, 54 route des Acacias, CH-1227 Carouge, Switzerland, October 1995.Google Scholar
  5. [5]
    Emmanuel Roche and Yves Schabes. Deterministic part-of-speech tagging with finite-state transducers. Computational Linguistics, 21(2):227–253, June 1995.Google Scholar
  6. [6]
    Jan Tokarski. Schematyczny indeks a tergo polskich form wyrazowych. Wydawnictwo Naukowe PWN, 1993.Google Scholar
  7. [7]
    Ralph Weischedel, Marie Meteer, Richard Schwartz, Lance Ramshaw, and Jeff Palmucci. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 19(2):359–382, 1993.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Jan Daciuk
    • 1
  1. 1.Department of Applied InformaticsTechnical University of GdańskGdańskPoland

Personalised recommendations