Abstract
In many tasks of document categorization and clustering it is necessary to automatically learn a word frequency list from a corpus. However, morphological variations of words disturb the statistics when the program considers the words as mere letter strings. Thus it is important to identify the strings resulting from morphological variation of the same base meaning. Since using large morphological dictionaries has its well-known technical disadvantages, we propose a heuristic approximate method for such identification based on an empirical formula for testing the similarity of two words. We give a simple method for the determination of the formula parameters. The formula is based on the number of the coincident letters in the initial parts of the two words and the number of non-coincident letters in the final parts of these two words. An iterative algorithm constructs the word frequency list using common parts of all similar words. We give English and Spanish examples. The described technology is implemented in our system Dictionary Designer.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gelbukh, A. (1992): Effective implementation of morphology model for an inflectional natural language. “Automatic documentation and Mathematical Linguistics”, Allerton Press, 26, N 1, pp. 22–31.
Ivahnenko, A. (1980): Manual on typical algorithms of modelling. “Technika” Publ., Kiev (in Russian).
Makagonov, P., Alexandrov, M., Sboychakov, K. (2000): A toolkit for development of the domain-oriented dictionaries for structuring document flows. In: H.A. Kiers et al (Eds.), Data Analysis, Classification, and Related Methods, Springer, 2000 (Studies in classification, data analysis, and knowledge organization), pp. 83–88.
Manning, D. C., Schutze, H. (1999): Foundations of statistical natural language processing. MIT Press.
Porter, M. (1980): An algorithm for suffix stripping. Program, 14, pp. 130–137.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Makagonov, P., Alexandrov, M. (2002). Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45715-1_45
Download citation
DOI: https://doi.org/10.1007/3-540-45715-1_45
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43219-7
Online ISBN: 978-3-540-45715-2
eBook Packages: Springer Book Archive