Skip to main content

Text Categorization Using an Automatically Generated Labelled Dataset: An Evaluation Study

  • Conference paper
Neural Information Processing (ICONIP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8834))

Included in the following conference series:

  • 4815 Accesses

Abstract

Naïve Bayes(NB), kNN and Adaboost are three commonly used text classifiers. Evaluation of these classifiers involves a variety of factors to be considered including benchmark used, feature selections, parameter settings of algorithms, and the measurement criteria employed. Researchers have demonstrated that some algorithms outperform others on some corpus, however, labeling and corpus bias are two concerns in text categorization. This paper focuses on evaluating the three commonly used text classifiers by using an automatically generated text document set which is labelled by a group of experts to alleviate subjectiveness of labelling, and at the same time to examine how the performance of the algorithms is influenced by feature selection algorithms and the number of features selected.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Lewis, D.D., et al.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  2. Aggarwal, C.C., Zhai, C.: A Survey of Text Classification Algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer (2012)

    Google Scholar 

  3. Manning, C.D., et al.: Introduction to Information Retrieval. Cambridge University Press (2008)

    Google Scholar 

  4. Schapire, R.E., Singer, Y.: Boostexter: A Boosting-based System for Text Categorization. Machine Learning 39(2-3), 135–168 (2000)

    Article  MATH  Google Scholar 

  5. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Google Scholar 

  6. Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Inform. Retrieval 1(2), 69–90 (1999)

    Article  Google Scholar 

  7. Hersh, W., et al.: OHSUMED: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1994, pp. 192–201 (1994)

    Google Scholar 

  8. Davidov, D., et al.: Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250–257 (2004)

    Google Scholar 

  9. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  10. Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)

    Google Scholar 

  11. Zhu, D., Dreher, H.: Characteristics and Uses of Labeled Datasets – ODP Case Study. In: Proceedings of the sixth International Conference on Semantics, Knowledge, and Grids (2010)

    Google Scholar 

  12. Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. Journal of Society for Artificial Intelligence 14(5), 771–780 (1999)

    Google Scholar 

  13. Schapire, R.E., et al.: Boosting and Rocchio Appliced to Text Filtering. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 215–223 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhu, D., Wong, K.W. (2014). Text Categorization Using an Automatically Generated Labelled Dataset: An Evaluation Study. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds) Neural Information Processing. ICONIP 2014. Lecture Notes in Computer Science, vol 8834. Springer, Cham. https://doi.org/10.1007/978-3-319-12637-1_60

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12637-1_60

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12636-4

  • Online ISBN: 978-3-319-12637-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics