Abstract
In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes.
This work was supported by NSF CCR-9972519, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE ASCI program, and by Army High Performance Computing Research Center contract number DAAH04-95-C-0008. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute. Related papers are available via WWWat URL: http://www.cs.umn.edu/~karypis
Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
L. Baker and A. McCallum. Distributional clustering of words for text classification. InSIGIR-98, 1998.
D. Boley, M. Gini, R. Gross, E.H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher,and J. Moore. Document categorization and query generation on the world wide web using WebACE. AI Review (accepted for publication), 1999.
W.W. Cohen. Fast effective rule induction. In Proc. of the Twelfth International Conference on Machine Learning, 1995.
W.W. Cohen and H. Hirsh. Joins that generalize: Text classification using WHIRL. In Proc. of the Fourth Int’l Conference on Knowledge Discovery and Data Mining, 1998.
T. Curran and P. Thompson. Automatic categorization of statute documents. In Proc. of the 8th ASIS SIG/CR Classification Research Workshop, Tucson, Arizona, 1997.
E.H. Han and G. Karypis. Centroid-based document classification algorithms: Analysis & experimental results. Technical report TR-00-017, Department of Computer Science, University of Minnesota, Minneapolis, 2000. Available on the WWW at URL http://www.cs.umn.edu/~karypis.
W. Hersh, C. Buckley, T.J. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In SIGIR-94, pages 192–201, 1994.
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proc. of the European Conference on Machine Learning, 1998.
WaiLamand ChaoYang Ho. Using a generalized instance set for automatic text categorization. In SIGIR-98, 1998.
D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In Tenth European Conference on Machine Learning, 1998.
D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/_lewis, 1999.
A. McCallum and K. Nigam.Acomparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
Andrew Kachites McCallum. Bow:A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.
S. Shankar and G. Karypis.Afeature weight adjustment algorithm for document classification. In SIGKDD’00 Workshop on Text Mining, Boston, MA, 2000.
D. Wettschereck, D.W. Aha, and T. Mohri. A review and empirical evaluation of featureweighting methods for a class of lazy learning algorithms. AI Review, 11, 1997.
Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR-99, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Han, EH.(., Karypis, G. (2000). Centroid-Based Document Classification: Analysis and Experimental Results. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2000. Lecture Notes in Computer Science(), vol 1910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45372-5_46
Download citation
DOI: https://doi.org/10.1007/3-540-45372-5_46
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41066-9
Online ISBN: 978-3-540-45372-7
eBook Packages: Springer Book Archive