Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification

Han, Eui-Hong (Sam); Karypis, George; Kumar, Vipin

doi:10.1007/3-540-45357-1_9

Eui-Hong (Sam) Han⁴,
George Karypis⁴ &
Vipin Kumar⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2035))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1887 Accesses
109 Citations
3 Altmetric

Abstract

Text categorization presents unique challenges due to the large number of attributes present in the data set, large number of training samples, attribute dependency, and multi-modality of categories. Existing classification techniques have limited applicability in the data sets of these natures. In this paper, we present a Weight Adjusted k-Nearest Neighbor (WAKNN) classification that learns feature weights based on a greedy hill climbing technique. We also present two performance optimizations of WAKNN that improve the computational performance by a few orders of magnitude, but do not compromise on the classification quality. We experimentally evaluated WAKNN on 52 document data sets from a variety of domains and compared its performance against several classification algorithms, such as C4.5, RIPPER, Naive-Bayesian, PEBLS and VSM. Experimental results on these data sets confirm that WAKNN consistently outperforms other existing classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D. Boley, M. Gini, R. Gross, E.H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the world wide web using WebACE. AI Review, 13(5-6), 1999.
Google Scholar
W.W. Cohen. Fast effective rule induction. In Proc. of the Twelfth International Conference on Machine Learning, 1995.
Google Scholar
S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10(1):57–78, 1993.
Google Scholar
T. Curran and P. Thompson. Automatic categorization of statute documents. In Proc. of the 8th ASIS SIG/CR Classification Research Workshop, Tucson, Arizona, 1997.
Google Scholar
I.S. Dhillon and D.M. Modha. Visualizing class structure of multi-dimensional data. In Proc. of the 30th Symposium of the Interface: Computing Science and Statistics, pages 488–493, 1998.
Google Scholar
R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973.
Google Scholar
E.H. Han. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. PhD thesis, University of Minnesota, October 1999.
Google Scholar
W. Hersh, C. Buckley, T.J. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In SIGIR-94, pages 192–201, 1994.
Google Scholar
A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
Google Scholar
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proc. of the European Conference on Machine Learning, 1998.
Google Scholar
L.N. Kanal and Vipin Kumar, editors. Search in Artificial Intelligence. Springer-Verlag, New York, NY, 1988.
MATH Google Scholar
I. Kononenko. Estimating attributes: Analysis and extensions of relief. In Proc. of the 1994 European Conference on Machine Learning, 1994.
Google Scholar
D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. In Proc. of the Third Annual Symposium on Document Analysis and Information Retrieval, 1994.
Google Scholar
D.D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/lewis, 1999.
D.G. Lowe. Similarity metric learning for a variable-kernel classifier. Neural Computation, pages 72–85, January 1995.
Google Scholar
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
Google Scholar
M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
Google Scholar
J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.
Google Scholar
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.
Google Scholar
G.W. Snedecor and W.G. Cochran. Statistical Methods. Iowa State University Press, 1989.
Google Scholar
TREC. Text Retrieval conference.
Google Scholar
D. Wettschereck, D.W. Aha, and T. Mohri. A review and empirical evaluation of feature-weighting methods for a class of lazy learning algorithms. AI Review, 11, 1997.
Google Scholar
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In SIGIR-94, 1994.
Google Scholar
Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR-99, 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, University of Minnesota, 4-192 EE/CSci Building 200 Union Street SE, Minneapolis, MN 55455
Eui-Hong (Sam) Han, George Karypis & Vipin Kumar

Authors

Eui-Hong (Sam) Han
View author publications
You can also search for this author in PubMed Google Scholar
George Karypis
View author publications
You can also search for this author in PubMed Google Scholar
Vipin Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science and Information Systems, The University of Hong Kong, Pokfulam, Hong Kong China
David Cheung
CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT 2601, Australia
Graham J. Williams
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong China
Qing Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, EH.(., Karypis, G., Kumar, V. (2001). Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_9

Download citation

DOI: https://doi.org/10.1007/3-540-45357-1_9
Published: 11 April 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41910-5
Online ISBN: 978-3-540-45357-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics