Abstract
Automatic text classification into predefined categories is an increasingly important task given the vast number of electronic documents available on the Internet and enterprise servers. Successful text classification relies heavily on the vital task of dimensionality reduction, which aims to improve classification accuracy, give greater expression to the classification process, and improve classification computational efficiency. In this paper, two algorithms for feature selection are presented, based on sampling and weighted sampling that build on the C4.5 algorithm. The results demonstrate considerable improvements with regard to classification accuracy - up to 10% - compared to traditional algorithms such as C4.5, Naïve Bayes and Support Vector Machines. The classification process is performed using the Naïve Bayes model in the space of reduced dimensionality. Experiments were carried out using data sets based on the Reuters-21578 collection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Su, J., Sayyad-Shirab, J., Stan, M.: Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 97–104 (2011)
Laur, E.J.M., March, A.D.: Combining Bayesian Text Classification and Shrinkage to Automate Healthcare Coding: A Data Quality Analysis. J. Data and Information Quality 2(3), 1–22 (2011)
He, Y., Xie, J., Xu, C.: An improved Naive Bayesian algorithm for Web page text classification. In: 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD (2011)
Ambert, K.H., Cohen, A.M.: k-Information Gain Scaled Nearest Neighbors: A Novel Approach to Classifying Protein-Protein Interaction-Related Documents. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9(1), 305–310 (2012)
Wajeed, M.A., Adilakshmi, T.: Semi-supervised text classification using enhanced KNN algorithm. In: 2011 World Congress on Information and Communication Technologies, WICT (2011)
Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based Framework for Text Categorization. Procedia Engineering 69, 1356–1364 (2014)
Bhadri Raju, M.S.V.S., Vishnu Vardhan, B., Sowmya, V.: Variant Nearest Neighbor Classification Algorithm for Text Document. In: Satapathy, S.C., et al. (eds.) ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India- Vol II, pp. 243–251. Springer International Publishing (2014)
Li, W., Miao, D., Wang, W.: Two-level hierarchical combination method for text classification. Expert Systems with Applications 38(3), 2030–2039 (2011)
Jung-Yi, J., Ren-Jia, L., Shie-Jue, L.: A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification. IEEE Transactions on Knowledge and Data Engineering 23(3), 335–349 (2011)
Saha, D.: Web Text Classification Using a Neural Network. In: 2011 Second International Conference on Emerging Applications of Information Technology, EAIT (2011)
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF-IDF, LSI and multi-words for text classification. Expert Systems with Applications 38(3), 2758–2765 (2011)
Shi, K., et al.: Efficient text classification method based on improved term reduction and term weighting. The Journal of China Universities of Posts and Telecommunications 18(Suppl.1), 131–135 (2011)
Shi, K., Li, L., He, J., Liu, H., Zhang, N., Song, W.: An improved KNN text classification algorithm based on density. In: 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), pp. 113–117 (2011)
Jiang, C., et al.: Text classification using graph mining-based feature extraction. Knowledge-Based Systems 23(4), 302–308 (2010)
Sun, Y., Liu, X., Cui, X.: The Mining of Term Semantic Relationships and its Application in Text Classification. In: 2012 Fifth International Conference on Intelligent Computation Technology and Automation, ICICTA (2012)
Ganiz, M.C., George, C., Pottenger, W.M.: Higher Order Naïve Bayes: A Novel Non-IID Approach to Text Classification. IEEE Transactions on Knowledge and Data Engineering 23(7), 1022–1034 (2011)
Yun, J., et al.: A multi-layer text classification framework based on two-level representation model. Expert Systems with Applications 39(2), 2035–2046 (2012)
Özgür, L., Güngör, T.: Text classification with the support of pruned dependency patterns. Pattern Recognition Letters 31(12), 1598–1607 (2010)
Figueiredo, F., et al.: Word co-occurrence features for text classification. Information Systems 36(5), 843–858 (2011)
Xia, T., Du Improve, Y.: VSM text classification by title vector based document representation method. In: 2011 6th International Conference on Computer Science & Education (ICCSE), pp. 210–213 (2011)
Zhang, P.Y.: The Application of Semantic Similarity in Text Classification. Modern Development in Materials, Machinery and Automation 346, 141–144 (2013)
Hiroshi Ogura, H.A., Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications 38(5), 4978–4989 (2011)
Chen, J., et al.: Feature selection for text classification with Naïve Bayes. Expert Systems with Applications 36(3, pt. 1), 5432–5435 (2009)
Guozhong Feng, J.G., Jing, B.-Y., Hao, L.: A Bayesian feature selection paradigm for text classification. Information Processing & Management 48(2), 283–302 (2012)
Li, F.G., Fan, J.L., Wang, L., Zhang, H.L., Duan, R.: A method based on manifold learning and Bagging for text classification. In: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), pp. 2713–2716 (2011)
Li, Y., Hung, E., Chung, K.: A subspace decision cluster classifier for text classification. Expert Systems with Applications 38(10), 12475–12482 (2011)
Nizamani, S., Memon, N., Wiil, U.K., Karampelas, P.: CCM: A Text Classification Model by Clustering. In: 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 461–467 (2011)
Suli, Z., Xin, P.: A novel text classification based on Mahalanobis distance. In: 2011 3rd International Conference on Computer Research and Development, ICCRD (2011)
Nedungadi, P., Harikumar, H., Ramesh, M.: A high performance hybrid algorithm for text classification. In: 2014 Fifth International Conference on the Applications of Digital Information and Web Technologies, ICADIWT (2014)
Subramanya, A., Bilmes, J.: Soft-supervised learning for text classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing 2008, pp. 1090–1099. Association for Computational Linguistics, Honolulu (2008)
Shi, L., et al.: Rough set and ensemble learning based semi-supervised algorithm for text classification. Expert Systems with Applications 38(5), 6300–6306 (2011)
Lee, L.H., et al.: High Relevance Keyword Extraction facility for Bayesian text classification on different domains of varying characteristic. Expert Systems with Applications: An International Journal 39(1), 1147–1155 (2012)
Farhoodi, M., Yari, A., Sayah, A.: N-gram based text classification for Persian newspaper corpus. In: 2011 7th International Conference on Digital Content, Multimedia Technology and its Applications, IDCTA (2011)
Meng, J., Lin, H., Li, Y.: Knowledge transfer based on feature representation mapping for text classification. Expert Systems with Applications: An International Journal, 2011 38(8), 10562–10567 (2011)
Mikawa, K.I.T., Goto, M.: A proposal of extended cosine measure for distance metric learning in text classification. In: 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1741–1746 (2011)
Wajeed, M.A., Adilakshmi, T.: Different similarity measures for text classification using KNN. In: 2011 2nd International Conference on Computer and Communication Technology (ICCCT), pp. 41–45 (2011)
Xu, G., et al.: Improved TFIDF weighting for imbalanced biomedical text classification, pp. 2360–2367. Elsevier Science Energy Procedia (2011)
Gospodnetic, O., E. Hatcher, and D. Cutting.: Lucene in action, Mannaging (2005)
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Cobo, M.J., et al.: Science Mapping Software Tools: Review, Analysis and Cooperative Study among Tools. Journal of the American Society for Information Science and Technology 62(7), 1382–1402 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Molano, V., Cobos, C., Mendoza, M., Herrera-Viedma, E., Manic, M. (2014). Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes. In: Gelbukh, A., Espinoza, F.C., Galicia-Haro, S.N. (eds) Human-Inspired Computing and Its Applications. MICAI 2014. Lecture Notes in Computer Science(), vol 8856. Springer, Cham. https://doi.org/10.1007/978-3-319-13647-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-13647-9_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13646-2
Online ISBN: 978-3-319-13647-9
eBook Packages: Computer ScienceComputer Science (R0)