Skip to main content

Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

  • Conference paper
Human-Inspired Computing and Its Applications (MICAI 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8856))

Included in the following conference series:

  • 1780 Accesses

Abstract

Automatic text classification into predefined categories is an increasingly important task given the vast number of electronic documents available on the Internet and enterprise servers. Successful text classification relies heavily on the vital task of dimensionality reduction, which aims to improve classification accuracy, give greater expression to the classification process, and improve classification computational efficiency. In this paper, two algorithms for feature selection are presented, based on sampling and weighted sampling that build on the C4.5 algorithm. The results demonstrate considerable improvements with regard to classification accuracy - up to 10% - compared to traditional algorithms such as C4.5, Naïve Bayes and Support Vector Machines. The classification process is performed using the Naïve Bayes model in the space of reduced dimensionality. Experiments were carried out using data sets based on the Reuters-21578 collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Su, J., Sayyad-Shirab, J., Stan, M.: Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 97–104 (2011)

    Google Scholar 

  2. Laur, E.J.M., March, A.D.: Combining Bayesian Text Classification and Shrinkage to Automate Healthcare Coding: A Data Quality Analysis. J. Data and Information Quality 2(3), 1–22 (2011)

    Google Scholar 

  3. He, Y., Xie, J., Xu, C.: An improved Naive Bayesian algorithm for Web page text classification. In: 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD (2011)

    Google Scholar 

  4. Ambert, K.H., Cohen, A.M.: k-Information Gain Scaled Nearest Neighbors: A Novel Approach to Classifying Protein-Protein Interaction-Related Documents. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9(1), 305–310 (2012)

    Article  Google Scholar 

  5. Wajeed, M.A., Adilakshmi, T.: Semi-supervised text classification using enhanced KNN algorithm. In: 2011 World Congress on Information and Communication Technologies, WICT (2011)

    Google Scholar 

  6. Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based Framework for Text Categorization. Procedia Engineering 69, 1356–1364 (2014)

    Article  Google Scholar 

  7. Bhadri Raju, M.S.V.S., Vishnu Vardhan, B., Sowmya, V.: Variant Nearest Neighbor Classification Algorithm for Text Document. In: Satapathy, S.C., et al. (eds.) ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India- Vol II, pp. 243–251. Springer International Publishing (2014)

    Google Scholar 

  8. Li, W., Miao, D., Wang, W.: Two-level hierarchical combination method for text classification. Expert Systems with Applications 38(3), 2030–2039 (2011)

    Article  Google Scholar 

  9. Jung-Yi, J., Ren-Jia, L., Shie-Jue, L.: A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification. IEEE Transactions on Knowledge and Data Engineering 23(3), 335–349 (2011)

    Article  Google Scholar 

  10. Saha, D.: Web Text Classification Using a Neural Network. In: 2011 Second International Conference on Emerging Applications of Information Technology, EAIT (2011)

    Google Scholar 

  11. Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF-IDF, LSI and multi-words for text classification. Expert Systems with Applications 38(3), 2758–2765 (2011)

    Article  Google Scholar 

  12. Shi, K., et al.: Efficient text classification method based on improved term reduction and term weighting. The Journal of China Universities of Posts and Telecommunications 18(Suppl.1), 131–135 (2011)

    Article  Google Scholar 

  13. Shi, K., Li, L., He, J., Liu, H., Zhang, N., Song, W.: An improved KNN text classification algorithm based on density. In: 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), pp. 113–117 (2011)

    Google Scholar 

  14. Jiang, C., et al.: Text classification using graph mining-based feature extraction. Knowledge-Based Systems 23(4), 302–308 (2010)

    Article  Google Scholar 

  15. Sun, Y., Liu, X., Cui, X.: The Mining of Term Semantic Relationships and its Application in Text Classification. In: 2012 Fifth International Conference on Intelligent Computation Technology and Automation, ICICTA (2012)

    Google Scholar 

  16. Ganiz, M.C., George, C., Pottenger, W.M.: Higher Order Naïve Bayes: A Novel Non-IID Approach to Text Classification. IEEE Transactions on Knowledge and Data Engineering 23(7), 1022–1034 (2011)

    Article  Google Scholar 

  17. Yun, J., et al.: A multi-layer text classification framework based on two-level representation model. Expert Systems with Applications 39(2), 2035–2046 (2012)

    Article  Google Scholar 

  18. Özgür, L., Güngör, T.: Text classification with the support of pruned dependency patterns. Pattern Recognition Letters 31(12), 1598–1607 (2010)

    Article  Google Scholar 

  19. Figueiredo, F., et al.: Word co-occurrence features for text classification. Information Systems 36(5), 843–858 (2011)

    Article  MathSciNet  Google Scholar 

  20. Xia, T., Du Improve, Y.: VSM text classification by title vector based document representation method. In: 2011 6th International Conference on Computer Science & Education (ICCSE), pp. 210–213 (2011)

    Google Scholar 

  21. Zhang, P.Y.: The Application of Semantic Similarity in Text Classification. Modern Development in Materials, Machinery and Automation 346, 141–144 (2013)

    Google Scholar 

  22. Hiroshi Ogura, H.A., Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications 38(5), 4978–4989 (2011)

    Article  Google Scholar 

  23. Chen, J., et al.: Feature selection for text classification with Naïve Bayes. Expert Systems with Applications 36(3, pt. 1), 5432–5435 (2009)

    Article  Google Scholar 

  24. Guozhong Feng, J.G., Jing, B.-Y., Hao, L.: A Bayesian feature selection paradigm for text classification. Information Processing & Management 48(2), 283–302 (2012)

    Article  Google Scholar 

  25. Li, F.G., Fan, J.L., Wang, L., Zhang, H.L., Duan, R.: A method based on manifold learning and Bagging for text classification. In: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), pp. 2713–2716 (2011)

    Google Scholar 

  26. Li, Y., Hung, E., Chung, K.: A subspace decision cluster classifier for text classification. Expert Systems with Applications 38(10), 12475–12482 (2011)

    Article  Google Scholar 

  27. Nizamani, S., Memon, N., Wiil, U.K., Karampelas, P.: CCM: A Text Classification Model by Clustering. In: 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 461–467 (2011)

    Google Scholar 

  28. Suli, Z., Xin, P.: A novel text classification based on Mahalanobis distance. In: 2011 3rd International Conference on Computer Research and Development, ICCRD (2011)

    Google Scholar 

  29. Nedungadi, P., Harikumar, H., Ramesh, M.: A high performance hybrid algorithm for text classification. In: 2014 Fifth International Conference on the Applications of Digital Information and Web Technologies, ICADIWT (2014)

    Google Scholar 

  30. Subramanya, A., Bilmes, J.: Soft-supervised learning for text classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing 2008, pp. 1090–1099. Association for Computational Linguistics, Honolulu (2008)

    Google Scholar 

  31. Shi, L., et al.: Rough set and ensemble learning based semi-supervised algorithm for text classification. Expert Systems with Applications 38(5), 6300–6306 (2011)

    Article  Google Scholar 

  32. Lee, L.H., et al.: High Relevance Keyword Extraction facility for Bayesian text classification on different domains of varying characteristic. Expert Systems with Applications: An International Journal 39(1), 1147–1155 (2012)

    Article  Google Scholar 

  33. Farhoodi, M., Yari, A., Sayah, A.: N-gram based text classification for Persian newspaper corpus. In: 2011 7th International Conference on Digital Content, Multimedia Technology and its Applications, IDCTA (2011)

    Google Scholar 

  34. Meng, J., Lin, H., Li, Y.: Knowledge transfer based on feature representation mapping for text classification. Expert Systems with Applications: An International Journal, 2011 38(8), 10562–10567 (2011)

    Article  Google Scholar 

  35. Mikawa, K.I.T., Goto, M.: A proposal of extended cosine measure for distance metric learning in text classification. In: 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1741–1746 (2011)

    Google Scholar 

  36. Wajeed, M.A., Adilakshmi, T.: Different similarity measures for text classification using KNN. In: 2011 2nd International Conference on Computer and Communication Technology (ICCCT), pp. 41–45 (2011)

    Google Scholar 

  37. Xu, G., et al.: Improved TFIDF weighting for imbalanced biomedical text classification, pp. 2360–2367. Elsevier Science Energy Procedia (2011)

    Google Scholar 

  38. Gospodnetic, O., E. Hatcher, and D. Cutting.: Lucene in action, Mannaging (2005)

    Google Scholar 

  39. Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  40. Cobo, M.J., et al.: Science Mapping Software Tools: Review, Analysis and Cooperative Study among Tools. Journal of the American Society for Information Science and Technology 62(7), 1382–1402 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Molano, V., Cobos, C., Mendoza, M., Herrera-Viedma, E., Manic, M. (2014). Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes. In: Gelbukh, A., Espinoza, F.C., Galicia-Haro, S.N. (eds) Human-Inspired Computing and Its Applications. MICAI 2014. Lecture Notes in Computer Science(), vol 8856. Springer, Cham. https://doi.org/10.1007/978-3-319-13647-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13647-9_9

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13646-2

  • Online ISBN: 978-3-319-13647-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics