Skip to main content

Learning Outliers to Refine a Corpus for Chinese Webpage Categorization

  • Conference paper
Advances in Natural Computation (ICNC 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3610))

Included in the following conference series:

  • 1927 Accesses

Abstract

Webpage categorization has turned out to be an important topic in recent years. In a webpage, text is usually the main content, so that auto text categorization (ATC) becomes the key technique to such a task. For Chinese text categorization as well as Chinese webpage categorization, one of the basic and urgent problems is the construction of a good benchmark corpus. In this study, a machine learning approach is presented to refine a corpus for Chinese webpage categorization, where the AdaBoost algorithm is adopted to identify outliers in the corpus. The standard k nearest neighbor (kNN) algorithm under a vector space model (VSM) is adopted to construct a webpage categorization system. Simulation results as well as manual investigation of the identified outliers reveal that the presented method works well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dumais, S., Chen, H.: Hierarchical classification of Web content. In: Proceedings of the 23rd annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 256–263 (2000)

    Google Scholar 

  2. Lewis, D., Ringuette, M.: A Comparison of Two Learning Algorithms for Text Classification. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)

    Google Scholar 

  3. Yang, Y., Pedersen, J.P.: A Comparative Study on Feature Selection in Text Categorization. In: The 14th International Conference on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

  4. Cohen, W.J., Singer, Y.: Context-sensitive Learning Methods for Text Categorization. In: Proceedings of 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315 (1996)

    Google Scholar 

  5. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  6. Weiss, S.M., Apte, C., Damerau, F.J.: Maximizing Text-mining Performance. IEEE Intelligent Systems 14(4), 63–69 (1999)

    Article  Google Scholar 

  7. Wiener, E., Pedersen, J.O., Weigend, A.S.: A Neural Network Approach to Topic Spotting. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 22–34 (1993)

    Google Scholar 

  8. He, J., Tan, A.H., Tan, C.L.: On Machine Learning Methods for Chinese Document Categorization. Applied Intelligence 18(3), 311–322 (2003)

    Article  MATH  Google Scholar 

  9. Luo, D.S., Wu, X.H., Chi, H.S.: On Outlier Problem of Statistical Ensemble Learning. In: Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, pp. 281–286 (2004)

    Google Scholar 

  10. Schapire, R.E.: The Strength of Weak Learnability. Machine Learning 5, 197–227 (1990)

    Google Scholar 

  11. Freund, Y., Schapire, R.E.: Experiments with a New Boosting Algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148–156 (1996)

    Google Scholar 

  12. Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: A Statistical View of Boosting. The Annals of Statistics 38(2), 337–374 (2000)

    Article  MathSciNet  Google Scholar 

  13. Rätsch, G., Onoda, T., Müller, K.R.: Soft Margins for AdaBoost. Machine Learning 42(3), 287–320 (2001)

    Article  MATH  Google Scholar 

  14. Freund, Y.: An Adaptive Version of the Boost by Majority Algorithm. Machine Learning 43(3), 293–318 (2001)

    Article  MATH  Google Scholar 

  15. Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    Google Scholar 

  16. Aas, K., Eikvil, L.: Text Categorization: A Survey. Technique Report, No. 941, Norwegian Computing Center (1999), http://citeseer.nj.nec.com/aas99text.html

  17. Zhang, H.P., Liu, Q., Cheng, X.Q., Zhang, H., Yu, H.K.: Chinese Lexical Analysis Using Hierarchical Hidden Markov Model. In: Second SIGHAN Workshop on Chinese Language Processing, pp. 63–70 (2003)

    Google Scholar 

  18. Wu, X.H., Luo, D.S., Wang, X.H., Chi, H.S.: WrodsGroup based Scheme for Chinese Text Categorization. Submitted to Journal of Chinese Information Processing

    Google Scholar 

  19. Dong, D.N.: The Modern Chinese Classification Dictionary. The Publishing House of the Unabridged Chinese Dictionary (1999)

    Google Scholar 

  20. Yang, Y.M.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1(1/2), 67–88 (1999)

    Google Scholar 

  21. Yang, Y.M., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Luo, D., Wang, X., Wu, X., Chi, H. (2005). Learning Outliers to Refine a Corpus for Chinese Webpage Categorization. In: Wang, L., Chen, K., Ong, Y.S. (eds) Advances in Natural Computation. ICNC 2005. Lecture Notes in Computer Science, vol 3610. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11539087_19

Download citation

  • DOI: https://doi.org/10.1007/11539087_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28323-2

  • Online ISBN: 978-3-540-31853-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics