Skip to main content

Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2001)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2035))

Included in the following conference series:

Abstract

Text categorization presents unique challenges due to the large number of attributes present in the data set, large number of training samples, attribute dependency, and multi-modality of categories. Existing classification techniques have limited applicability in the data sets of these natures. In this paper, we present a Weight Adjusted k-Nearest Neighbor (WAKNN) classification that learns feature weights based on a greedy hill climbing technique. We also present two performance optimizations of WAKNN that improve the computational performance by a few orders of magnitude, but do not compromise on the classification quality. We experimentally evaluated WAKNN on 52 document data sets from a variety of domains and compared its performance against several classification algorithms, such as C4.5, RIPPER, Naive-Bayesian, PEBLS and VSM. Experimental results on these data sets confirm that WAKNN consistently outperforms other existing classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D. Boley, M. Gini, R. Gross, E.H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the world wide web using WebACE. AI Review, 13(5-6), 1999.

    Google Scholar 

  2. W.W. Cohen. Fast effective rule induction. In Proc. of the Twelfth International Conference on Machine Learning, 1995.

    Google Scholar 

  3. S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10(1):57–78, 1993.

    Google Scholar 

  4. T. Curran and P. Thompson. Automatic categorization of statute documents. In Proc. of the 8th ASIS SIG/CR Classification Research Workshop, Tucson, Arizona, 1997.

    Google Scholar 

  5. I.S. Dhillon and D.M. Modha. Visualizing class structure of multi-dimensional data. In Proc. of the 30th Symposium of the Interface: Computing Science and Statistics, pages 488–493, 1998.

    Google Scholar 

  6. R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973.

    Google Scholar 

  7. E.H. Han. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. PhD thesis, University of Minnesota, October 1999.

    Google Scholar 

  8. W. Hersh, C. Buckley, T.J. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In SIGIR-94, pages 192–201, 1994.

    Google Scholar 

  9. A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

    Google Scholar 

  10. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proc. of the European Conference on Machine Learning, 1998.

    Google Scholar 

  11. L.N. Kanal and Vipin Kumar, editors. Search in Artificial Intelligence. Springer-Verlag, New York, NY, 1988.

    MATH  Google Scholar 

  12. I. Kononenko. Estimating attributes: Analysis and extensions of relief. In Proc. of the 1994 European Conference on Machine Learning, 1994.

    Google Scholar 

  13. D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. In Proc. of the Third Annual Symposium on Document Analysis and Information Retrieval, 1994.

    Google Scholar 

  14. D.D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/lewis, 1999.

  15. D.G. Lowe. Similarity metric learning for a variable-kernel classifier. Neural Computation, pages 72–85, January 1995.

    Google Scholar 

  16. A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.

    Google Scholar 

  17. M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

    Google Scholar 

  18. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.

    Google Scholar 

  19. G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.

    Google Scholar 

  20. G.W. Snedecor and W.G. Cochran. Statistical Methods. Iowa State University Press, 1989.

    Google Scholar 

  21. TREC. Text Retrieval conference.

    Google Scholar 

  22. D. Wettschereck, D.W. Aha, and T. Mohri. A review and empirical evaluation of feature-weighting methods for a class of lazy learning algorithms. AI Review, 11, 1997.

    Google Scholar 

  23. Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In SIGIR-94, 1994.

    Google Scholar 

  24. Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR-99, 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Han, EH.(., Karypis, G., Kumar, V. (2001). Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_9

Download citation

  • DOI: https://doi.org/10.1007/3-540-45357-1_9

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41910-5

  • Online ISBN: 978-3-540-45357-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics