Automatic acquisition of lexical knowledge from sparse and noisy data

  • René Schneider
Regular Papers Applications of ML
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1398)


Optical character recognition (OCR) still garbles a considerable amount of information reduction and noise on texts so that many documents are unsuitable for information extraction systems. This paper introduces a statistical method for bootstrapping a lexicon from a very small number of “noisy ,” domain-specific texts. This method determines regularity in grammatical forms and also reoccuring ungrammatical forms from the input text. Through a combination of frequency lists and Levenshtein matrices, a language independent, robust core lexicon is constructed that supports the analysis of “noisy texts,” too.


Word Form Optical Character Recognition Levenshtein Distance Lexical Knowledge Frequency List 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    T. Bayer, U. Bohnacker, and I. Renz.Information extraction from paper documents. In H. Bunke and P.S.P. Wang, editors, Handbook on Optical Character Recognition and Document Image Analysis, pages 653–677. World Scientific Publishing Company, 1997.Google Scholar
  2. 2.
    W.N. Francis and H. Kučera. Frequency Analysis of English Usage. Houghton Mifflin, Boston, 1982.Google Scholar
  3. 3.
    J. Nerbonne, W. Heeringa, E. van den Hout, P. van der Kooi, S. Otten and W. van de Vis. Phonetic distance between dutch dialects. In Durieux, G., Daelemans, W., and Gillis, S., editors, Proceedings of Computational Linguistics in the Netherlands, pages 185–202, Antwerp, Centre for Dutch Language and Speech (UIA), 1996.Google Scholar
  4. 4.
    C.E. Shannon. A mathematical theory of communication. The Bell Systems Technical Journal, 27:623–656, 1948.Google Scholar
  5. 5.
    E. von Weizsäcker. Erstmaligkeit und Bestätigung als Komponenten der pragmatischen Information. In E. von Weizsäcker, editor, Offene Systeme I, pages 83–113. Klett, Stuttgart, 1974.Google Scholar
  6. 6.
    G.K. Zipf. The Psycho-Biology of Language. Houghton Mifflin, Boston, 1935.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • René Schneider
    • 1
  1. 1.Department of Speech and Language UnderstandingDaimler-Benz AG, Institute of Information TechnologyUlmGermany

Personalised recommendations