Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List

Makagonov, Pavel; Alexandrov, Mikhail

doi:10.1007/3-540-45715-1_45

Pavel Makagonov⁵ &
Mikhail Alexandrov⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2276))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1494 Accesses
1 Citations

Abstract

In many tasks of document categorization and clustering it is necessary to automatically learn a word frequency list from a corpus. However, morphological variations of words disturb the statistics when the program considers the words as mere letter strings. Thus it is important to identify the strings resulting from morphological variation of the same base meaning. Since using large morphological dictionaries has its well-known technical disadvantages, we propose a heuristic approximate method for such identification based on an empirical formula for testing the similarity of two words. We give a simple method for the determination of the formula parameters. The formula is based on the number of the coincident letters in the initial parts of the two words and the number of non-coincident letters in the final parts of these two words. An iterative algorithm constructs the word frequency list using common parts of all similar words. We give English and Spanish examples. The described technology is implemented in our system Dictionary Designer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gelbukh, A. (1992): Effective implementation of morphology model for an inflectional natural language. “Automatic documentation and Mathematical Linguistics”, Allerton Press, 26, N 1, pp. 22–31.
Google Scholar
Ivahnenko, A. (1980): Manual on typical algorithms of modelling. “Technika” Publ., Kiev (in Russian).
Google Scholar
Makagonov, P., Alexandrov, M., Sboychakov, K. (2000): A toolkit for development of the domain-oriented dictionaries for structuring document flows. In: H.A. Kiers et al (Eds.), Data Analysis, Classification, and Related Methods, Springer, 2000 (Studies in classification, data analysis, and knowledge organization), pp. 83–88.
Google Scholar
Manning, D. C., Schutze, H. (1999): Foundations of statistical natural language processing. MIT Press.
Google Scholar
Porter, M. (1980): An algorithm for suffix stripping. Program, 14, pp. 130–137.
Google Scholar

Download references

Author information

Authors and Affiliations

Moscow Mayor’s Directorate, Moscow City Government, Novi Arbat 36, 121205, Moscow, Russia
Pavel Makagonov
Center for Computing Research, National Polytechnic Institute (IPN), Av. Juan de Dios Batiz, C.P. 07738, Mexico, DF
Mikhail Alexandrov

Authors

Pavel Makagonov
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Alexandrov
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CIC Centro de Investigacion en Computacion, IPN Instituto Politecnico Nacional, Col Zacateno, CP 07738, Mexico DF, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Makagonov, P., Alexandrov, M. (2002). Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45715-1_45

Download citation

DOI: https://doi.org/10.1007/3-540-45715-1_45
Published: 05 February 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43219-7
Online ISBN: 978-3-540-45715-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics