Text Categorization Using an Automatically Generated Labelled Dataset: An Evaluation Study

Zhu, Dengya; Wong, Kok Wai

doi:10.1007/978-3-319-12637-1_60

Dengya Zhu²⁰ &
Kok Wai Wong²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8834))

Included in the following conference series:

International Conference on Neural Information Processing

4815 Accesses

Abstract

Naïve Bayes(NB), kNN and Adaboost are three commonly used text classifiers. Evaluation of these classifiers involves a variety of factors to be considered including benchmark used, feature selections, parameter settings of algorithms, and the measurement criteria employed. Researchers have demonstrated that some algorithms outperform others on some corpus, however, labeling and corpus bias are two concerns in text categorization. This paper focuses on evaluating the three commonly used text classifiers by using an automatically generated text document set which is labelled by a group of experts to alleviate subjectiveness of labelling, and at the same time to examine how the performance of the algorithms is influenced by feature selection algorithms and the number of features selected.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lewis, D.D., et al.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar
Aggarwal, C.C., Zhai, C.: A Survey of Text Classification Algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer (2012)
Google Scholar
Manning, C.D., et al.: Introduction to Information Retrieval. Cambridge University Press (2008)
Google Scholar
Schapire, R.E., Singer, Y.: Boostexter: A Boosting-based System for Text Categorization. Machine Learning 39(2-3), 135–168 (2000)
Article MATH Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Google Scholar
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Inform. Retrieval 1(2), 69–90 (1999)
Article Google Scholar
Hersh, W., et al.: OHSUMED: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1994, pp. 192–201 (1994)
Google Scholar
Davidov, D., et al.: Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250–257 (2004)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)
Google Scholar
Zhu, D., Dreher, H.: Characteristics and Uses of Labeled Datasets – ODP Case Study. In: Proceedings of the sixth International Conference on Semantics, Knowledge, and Grids (2010)
Google Scholar
Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. Journal of Society for Artificial Intelligence 14(5), 771–780 (1999)
Google Scholar
Schapire, R.E., et al.: Boosting and Rocchio Appliced to Text Filtering. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 215–223 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Systems, Curtin University, GPO Box U1987, Perth, Western Australia, Australia
Dengya Zhu
School of Engineering and Information Technology, Murdoch University, South Street, Murdoch, Western Australia, Australia, 6150
Kok Wai Wong

Authors

Dengya Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Kok Wai Wong
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Artificial Intelligence, Faculty of Computer Science and Information Technology Building, University of Malaya, 50603, Kuala Lumpur, Malaysia
Chu Kiong Loo
Department of Electronics and Communication Engineering,College of Engineering, Jalan IKRAM-UNITEN, Universiti Tenaga Nasional, 43009, Kajang, Selangor, Malaysia
Keem Siah Yap
School of Engineering and Information Technology, Murdoch University, South St, 6150, Murdoch, Western Australia, Australia
Kok Wai Wong
Department of Electrical and Electronics Engineering, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, 120-749, Seoul, South Korea
Andrew Teoh
Department of Electrical and Electronic Engineering, Xi’an Jiaotong-Liverpool University, Ren’ai Road 111, SIP 215123, Suzhou, Jiangsu Province, China
Kaizhu Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, D., Wong, K.W. (2014). Text Categorization Using an Automatically Generated Labelled Dataset: An Evaluation Study. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds) Neural Information Processing. ICONIP 2014. Lecture Notes in Computer Science, vol 8834. Springer, Cham. https://doi.org/10.1007/978-3-319-12637-1_60

Download citation

DOI: https://doi.org/10.1007/978-3-319-12637-1_60
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12636-4
Online ISBN: 978-3-319-12637-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics