Abstract
In the information retrieval and document categorization context, both Euclidean distance- and cosine-based similarity models are based on the assumption that term vectors are orthogonal. But this assumption is not true. Term associations are ignored in such similarity models. This paper analyzes the properties of term-document space, term-category space and categorydocument space. Then, without the assumption of term independence, we propose a new mathematical model to estimate the association between terms and define a ∞-similarity model of documents. Here we make best use of existing category membership represented by corpus as much as possible, and the objective is to improve categorization performance. The empirical results been obtained by k-NN classifier over Reuters-21578 corpus show that utilization of term association can improve the effectiveness of categorization system and ∞- similarity model outperforms than ones without term association.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Joahims, T., Text categorization with support vector machines: Learning with many relevant features. In the proceedings of the European conference on Machine learning, 1998, 137–142.
Lewis, D. D. & Ringuette, M., Comparison of two learning algorithms for text categorization. In Proceedings of the 3rd SDAIR, 1994, 81–93.
Raghavan V.V. & S.K.M. Wong., A Critical Analysis of the Vector space Model for information Retrieval, JASIS, 37:5, September 1986.
Salton, G. & Buckley, C., Term weighting approaches in automatic text retrieval, Information Processing and Management, Vol. 24, No.5, 1988, 513–523.
Salton, G., Introduction to Modern Information Retrieval, 1983, McGraw-Hill.
Salton, G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Pennsylvania, 1989.
Yang, Y. & Pedersen, J. O., A Comparative Study on Feature Selection in Text Categorization, In the 14th Int. Conf. On Machine Learning, 1997, 412–420.
Yang, Y., An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1), 1999, 69–90.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kou, H., Gardarin, G. (2002). Similarity Model and Term Association for Document Categorization. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds) Natural Language Processing and Information Systems. NLDB 2002. Lecture Notes in Computer Science, vol 2553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36271-1_22
Download citation
DOI: https://doi.org/10.1007/3-540-36271-1_22
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00307-6
Online ISBN: 978-3-540-36271-5
eBook Packages: Springer Book Archive