Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

Molano, Viviana; Cobos, Carlos; Mendoza, Martha; Herrera-Viedma, Enrique; Manic, Milos

doi:10.1007/978-3-319-13647-9_9

Viviana Molano²²,
Carlos Cobos²²,
Martha Mendoza²²,
Enrique Herrera-Viedma^23,24 &
…
Milos Manic²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8856))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1780 Accesses

Abstract

Automatic text classification into predefined categories is an increasingly important task given the vast number of electronic documents available on the Internet and enterprise servers. Successful text classification relies heavily on the vital task of dimensionality reduction, which aims to improve classification accuracy, give greater expression to the classification process, and improve classification computational efficiency. In this paper, two algorithms for feature selection are presented, based on sampling and weighted sampling that build on the C4.5 algorithm. The results demonstrate considerable improvements with regard to classification accuracy - up to 10% - compared to traditional algorithms such as C4.5, Naïve Bayes and Support Vector Machines. The classification process is performed using the Naïve Bayes model in the space of reduced dimensionality. Experiments were carried out using data sets based on the Reuters-21578 collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Su, J., Sayyad-Shirab, J., Stan, M.: Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 97–104 (2011)
Google Scholar
Laur, E.J.M., March, A.D.: Combining Bayesian Text Classification and Shrinkage to Automate Healthcare Coding: A Data Quality Analysis. J. Data and Information Quality 2(3), 1–22 (2011)
Google Scholar
He, Y., Xie, J., Xu, C.: An improved Naive Bayesian algorithm for Web page text classification. In: 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD (2011)
Google Scholar
Ambert, K.H., Cohen, A.M.: k-Information Gain Scaled Nearest Neighbors: A Novel Approach to Classifying Protein-Protein Interaction-Related Documents. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9(1), 305–310 (2012)
Article Google Scholar
Wajeed, M.A., Adilakshmi, T.: Semi-supervised text classification using enhanced KNN algorithm. In: 2011 World Congress on Information and Communication Technologies, WICT (2011)
Google Scholar
Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based Framework for Text Categorization. Procedia Engineering 69, 1356–1364 (2014)
Article Google Scholar
Bhadri Raju, M.S.V.S., Vishnu Vardhan, B., Sowmya, V.: Variant Nearest Neighbor Classification Algorithm for Text Document. In: Satapathy, S.C., et al. (eds.) ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India- Vol II, pp. 243–251. Springer International Publishing (2014)
Google Scholar
Li, W., Miao, D., Wang, W.: Two-level hierarchical combination method for text classification. Expert Systems with Applications 38(3), 2030–2039 (2011)
Article Google Scholar
Jung-Yi, J., Ren-Jia, L., Shie-Jue, L.: A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification. IEEE Transactions on Knowledge and Data Engineering 23(3), 335–349 (2011)
Article Google Scholar
Saha, D.: Web Text Classification Using a Neural Network. In: 2011 Second International Conference on Emerging Applications of Information Technology, EAIT (2011)
Google Scholar
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF-IDF, LSI and multi-words for text classification. Expert Systems with Applications 38(3), 2758–2765 (2011)
Article Google Scholar
Shi, K., et al.: Efficient text classification method based on improved term reduction and term weighting. The Journal of China Universities of Posts and Telecommunications 18(Suppl.1), 131–135 (2011)
Article Google Scholar
Shi, K., Li, L., He, J., Liu, H., Zhang, N., Song, W.: An improved KNN text classification algorithm based on density. In: 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), pp. 113–117 (2011)
Google Scholar
Jiang, C., et al.: Text classification using graph mining-based feature extraction. Knowledge-Based Systems 23(4), 302–308 (2010)
Article Google Scholar
Sun, Y., Liu, X., Cui, X.: The Mining of Term Semantic Relationships and its Application in Text Classification. In: 2012 Fifth International Conference on Intelligent Computation Technology and Automation, ICICTA (2012)
Google Scholar
Ganiz, M.C., George, C., Pottenger, W.M.: Higher Order Naïve Bayes: A Novel Non-IID Approach to Text Classification. IEEE Transactions on Knowledge and Data Engineering 23(7), 1022–1034 (2011)
Article Google Scholar
Yun, J., et al.: A multi-layer text classification framework based on two-level representation model. Expert Systems with Applications 39(2), 2035–2046 (2012)
Article Google Scholar
Özgür, L., Güngör, T.: Text classification with the support of pruned dependency patterns. Pattern Recognition Letters 31(12), 1598–1607 (2010)
Article Google Scholar
Figueiredo, F., et al.: Word co-occurrence features for text classification. Information Systems 36(5), 843–858 (2011)
Article MathSciNet Google Scholar
Xia, T., Du Improve, Y.: VSM text classification by title vector based document representation method. In: 2011 6th International Conference on Computer Science & Education (ICCSE), pp. 210–213 (2011)
Google Scholar
Zhang, P.Y.: The Application of Semantic Similarity in Text Classification. Modern Development in Materials, Machinery and Automation 346, 141–144 (2013)
Google Scholar
Hiroshi Ogura, H.A., Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications 38(5), 4978–4989 (2011)
Article Google Scholar
Chen, J., et al.: Feature selection for text classification with Naïve Bayes. Expert Systems with Applications 36(3, pt. 1), 5432–5435 (2009)
Article Google Scholar
Guozhong Feng, J.G., Jing, B.-Y., Hao, L.: A Bayesian feature selection paradigm for text classification. Information Processing & Management 48(2), 283–302 (2012)
Article Google Scholar
Li, F.G., Fan, J.L., Wang, L., Zhang, H.L., Duan, R.: A method based on manifold learning and Bagging for text classification. In: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), pp. 2713–2716 (2011)
Google Scholar
Li, Y., Hung, E., Chung, K.: A subspace decision cluster classifier for text classification. Expert Systems with Applications 38(10), 12475–12482 (2011)
Article Google Scholar
Nizamani, S., Memon, N., Wiil, U.K., Karampelas, P.: CCM: A Text Classification Model by Clustering. In: 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 461–467 (2011)
Google Scholar
Suli, Z., Xin, P.: A novel text classification based on Mahalanobis distance. In: 2011 3rd International Conference on Computer Research and Development, ICCRD (2011)
Google Scholar
Nedungadi, P., Harikumar, H., Ramesh, M.: A high performance hybrid algorithm for text classification. In: 2014 Fifth International Conference on the Applications of Digital Information and Web Technologies, ICADIWT (2014)
Google Scholar
Subramanya, A., Bilmes, J.: Soft-supervised learning for text classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing 2008, pp. 1090–1099. Association for Computational Linguistics, Honolulu (2008)
Google Scholar
Shi, L., et al.: Rough set and ensemble learning based semi-supervised algorithm for text classification. Expert Systems with Applications 38(5), 6300–6306 (2011)
Article Google Scholar
Lee, L.H., et al.: High Relevance Keyword Extraction facility for Bayesian text classification on different domains of varying characteristic. Expert Systems with Applications: An International Journal 39(1), 1147–1155 (2012)
Article Google Scholar
Farhoodi, M., Yari, A., Sayah, A.: N-gram based text classification for Persian newspaper corpus. In: 2011 7th International Conference on Digital Content, Multimedia Technology and its Applications, IDCTA (2011)
Google Scholar
Meng, J., Lin, H., Li, Y.: Knowledge transfer based on feature representation mapping for text classification. Expert Systems with Applications: An International Journal, 2011 38(8), 10562–10567 (2011)
Article Google Scholar
Mikawa, K.I.T., Goto, M.: A proposal of extended cosine measure for distance metric learning in text classification. In: 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1741–1746 (2011)
Google Scholar
Wajeed, M.A., Adilakshmi, T.: Different similarity measures for text classification using KNN. In: 2011 2nd International Conference on Computer and Communication Technology (ICCCT), pp. 41–45 (2011)
Google Scholar
Xu, G., et al.: Improved TFIDF weighting for imbalanced biomedical text classification, pp. 2360–2367. Elsevier Science Energy Procedia (2011)
Google Scholar
Gospodnetic, O., E. Hatcher, and D. Cutting.: Lucene in action, Mannaging (2005)
Google Scholar
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Cobo, M.J., et al.: Science Mapping Software Tools: Review, Analysis and Cooperative Study among Tools. Journal of the American Society for Information Science and Technology 62(7), 1382–1402 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Cauca, Colombia
Viviana Molano, Carlos Cobos & Martha Mendoza
Department of Computer Science and Artificial Intelligence, University of Granada, Spain
Enrique Herrera-Viedma
Department of Electrical and Computer Engineering, Faculty of Engineering, King Abdulaziz University, Jeddah, 21589, Saudi Arabia
Enrique Herrera-Viedma
School of Engineering East Hall, Virginia Commonwealth University, Virginia, U.S.A.
Milos Manic

Authors

Viviana Molano
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Cobos
View author publications
You can also search for this author in PubMed Google Scholar
Martha Mendoza
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Herrera-Viedma
View author publications
You can also search for this author in PubMed Google Scholar
Milos Manic
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Centro de Investigación en Computación, Instituto Politécnico Nacional, Av. Juan Dios Bátiz s/n, Col. Nueva Industrial Vallejo, 07738, Mexico City, Mexico
Alexander Gelbukh
Área Académica de Computación y Electrónica, Carretera Pachuca-Tulancingo, Universidad Autónoma del Estado de Hidalgo, Km. 4.5, Col. Carboneras, Mineral de la Reforma, 42180, Hidalgo, Mexico
Félix Castro Espinoza
Facultad de ciencias, Universidad Autónoma Nacional de México, Ciudad Universitaria, México DF, Mexico
Sofía N. Galicia-Haro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Molano, V., Cobos, C., Mendoza, M., Herrera-Viedma, E., Manic, M. (2014). Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes. In: Gelbukh, A., Espinoza, F.C., Galicia-Haro, S.N. (eds) Human-Inspired Computing and Its Applications. MICAI 2014. Lecture Notes in Computer Science(), vol 8856. Springer, Cham. https://doi.org/10.1007/978-3-319-13647-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-13647-9_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13646-2
Online ISBN: 978-3-319-13647-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics