Skip to main content

Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2276))

Abstract

In many tasks of document categorization and clustering it is necessary to automatically learn a word frequency list from a corpus. However, morphological variations of words disturb the statistics when the program considers the words as mere letter strings. Thus it is important to identify the strings resulting from morphological variation of the same base meaning. Since using large morphological dictionaries has its well-known technical disadvantages, we propose a heuristic approximate method for such identification based on an empirical formula for testing the similarity of two words. We give a simple method for the determination of the formula parameters. The formula is based on the number of the coincident letters in the initial parts of the two words and the number of non-coincident letters in the final parts of these two words. An iterative algorithm constructs the word frequency list using common parts of all similar words. We give English and Spanish examples. The described technology is implemented in our system Dictionary Designer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gelbukh, A. (1992): Effective implementation of morphology model for an inflectional natural language. “Automatic documentation and Mathematical Linguistics”, Allerton Press, 26, N 1, pp. 22–31.

    Google Scholar 

  2. Ivahnenko, A. (1980): Manual on typical algorithms of modelling. “Technika” Publ., Kiev (in Russian).

    Google Scholar 

  3. Makagonov, P., Alexandrov, M., Sboychakov, K. (2000): A toolkit for development of the domain-oriented dictionaries for structuring document flows. In: H.A. Kiers et al (Eds.), Data Analysis, Classification, and Related Methods, Springer, 2000 (Studies in classification, data analysis, and knowledge organization), pp. 83–88.

    Google Scholar 

  4. Manning, D. C., Schutze, H. (1999): Foundations of statistical natural language processing. MIT Press.

    Google Scholar 

  5. Porter, M. (1980): An algorithm for suffix stripping. Program, 14, pp. 130–137.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Makagonov, P., Alexandrov, M. (2002). Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45715-1_45

Download citation

  • DOI: https://doi.org/10.1007/3-540-45715-1_45

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43219-7

  • Online ISBN: 978-3-540-45715-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics