Efficient Feature Selection Based on Modified Cuckoo Search Optimization Problem for Classifying Web Text Documents

Dhar, Ankita; Dash, Niladri Sekhar; Roy, Kaushik

doi:10.1007/978-981-13-9187-3_57

Ankita Dhar⁹,
Niladri Sekhar Dash¹⁰ &
Kaushik Roy⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1037))

Included in the following conference series:

International Conference on Recent Trends in Image Processing and Pattern Recognition

970 Accesses
6 Citations

Abstract

The continuous increase of information in the web with varying dimensions is becoming difficult for users to filter and analyse them efficiently as it incorporates redundant and irrelevant terms. Managing, filtering and organizing such huge datasets need the classification of text documents to be performed. Text classification is the process of assigning the text documents to their predefined text categories based on the content. The aim of this paper is to explore Cuckoo search optimization (CSO) problem established from the behaviour of cuckoo birds for selection of relevant features by modifying the algorithm. The revised algorithm is named as modified Cuckoo search (MCS) optimization algorithm that can be proved to be useful for developing an efficient text classification system. The proposed method is generated by combining the ability of MCS with the sharpness of Naive Bayes Multinomial (NBM) algorithm for generating proper feature which increases the rate of success. The approach adopted here is tested on 9000 text documents that cover eight different domains fetched from several web sources and obtains encouraging outcome. The results compared with the results from other well-known approaches for text classification task show the effectiveness of the proposed approach as an automatic Bangla text classification system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Al-Radaideh, Q.A., Al-Khateeb, S.S.: An associative rule-based classifier for Arabic medical text. Int. J. Knowl. Eng. Data Min. 03, 255–273 (2015)
Article Google Scholar
Aly, W., Kelleny, H.A.: Adaptation of Cuckoo search for documents clustering. Int. J. Comput. Appl. Technol. 86, 4–10 (2014)
Google Scholar
ArunaDevi, K., Saveeth, R.: A novel approach on tamil text classification using C-Feature. Int. J. Sci. Res. Dev. 2, 343–345 (2014)
Google Scholar
Bolaj, P., Govilkar, S.: Text classification for Marathi documents using supervised learning methods. Int. J. Comput. Appl. 155, 6–10 (2016)
Google Scholar
Bouguelia, M.R., Nowaczyk, S., Santosh, K.C., Verikas, A.: Agreeing to disagree: active learning with noisy labels without crowdsourcing. Int. J. Mach. Learn. Cybern. 9, 1307–1319 (2018)
Article Google Scholar
DeySarkar, S., Goswami, S., Agarwal, A., Akhtar, J.: A novel feature selection technique for text classification using Naive Bayes. Int. Sch. Res. Not. 2014, 10 (2014)
Google Scholar
Dhar, A., Dash, N.S., Roy, K.: Categorization of bangla web text documents based on TF-IDF-ICF text analysis scheme. In: Mandal, J.K., Sinha, D. (eds.) CSI 2018. CCIS, vol. 836, pp. 477–484. Springer, Singapore (2018). https://doi.org/10.1007/978-981-13-1343-1_39
Chapter Google Scholar
Gupta, N., Gupta, V.: Punjabi text classification using Naive Bayes, centroid and hybrid approach. In: Proceedings of the 3rd Workshop on South and South East Asian Natural Language Processing, pp. 109–122 (2012)
Google Scholar
Guru, D.S., Suhil, M.: A novel term\_ class relevance measure for text categorization. In: Proceedings of International Conference on Advanced Computing Technologies and Applications, pp. 13–22 (2015)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)
Article Google Scholar
Islam, Md.S., Jubayer, F.E.Md., Ahmed, S.I.: A support vector machine mixed with TF-IDF algorithm to categorize Bengali document. In: Proceedings of International Conference on Electrical, Computer and Communication Engineering, pp. 191–196 (2017)
Google Scholar
Jin, P., Zhang, Y., Chen, X., Xia, Y.: Bag-of-embeddings for text classification. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, pp. 2824–2830 (2016)
Google Scholar
Kabir, F., Siddique, S., Kotwal, M.R.A., Huda, M.N.: Bangla text document categorization using stochastic gradient descent (SGD) classifier. In: Proceedings of International Conference on Cognitive Computing and Information Processing, pp. 1–4 (2015)
Google Scholar
Kim, S., Han, K., Rim, H., Myaeng, S.: Some effective techniques for Naive Bayes text classification. IEEE Trans. Knowl. Data Eng. 18, 1457–1466 (2006)
Article Google Scholar
Mandal, A.K., Sen, R.: Supervised learning methods for Bangla web document categorization. Int. J. Artif. Intell. Appl. 05, 93–105 (2014)
Google Scholar
Mansur, M., UzZaman, N., Khan, M.: Analysis of N-gram based text categorization for Bangla in a Newspaper Corpus. In: Proceedings of International Conference on Computer and Information Technology, p. 08 (2006)
Google Scholar
Rautray, R., Balabantaray, R.C.: CSTS: cuckoo search based model for text summarization. In: Dash, S.S., Vijayakumar, K., Panigrahi, B.K., Das, S. (eds.) Artificial Intelligence and Evolutionary Computations in Engineering Systems. AISC, vol. 517, pp. 141–150. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3174-8_13
Chapter Google Scholar
Redmond, M., Salesi, S., Cosma, G.: A novel approach based on an extended cuckoo search algorithm for the classification of tweets which contain Emoticon and Emoji. In: Proceedings of IEEE International Conference on Knowledge Engineering and Applications, pp. 13–19 (2017)
Google Scholar
Sujana, T.S., Rao, N.M.S., Reddy, R.S.: An efficient feature selection using parallel cuckoo search and Naive Bayes classifier. In: Proceedings of IEEE International Conference on Networks & Advances in Computational Technologies, pp. 167–172 (2017)
Google Scholar
Vajda, S., Santosh, K.C.: A fast k-nearest neighbor classifier using unsupervised clustering. In: Santosh, K.C., Hangarge, M., Bevilacqua, V., Negi, A. (eds.) RTIP2R 2016. CCIS, vol. 709, pp. 185–193. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-4859-3_17
Chapter Google Scholar
Wang, D., Zhang, H., Liu, R., Lv, W.: Feature selection based on term frequency and T-Test for text categorization. In: Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 1482–1486 (2012)
Google Scholar
Wilbur, W.J., Kim, W.: The ineffectiveness of within-document term frequency in text classification. Inf. Retrieval 12, 509–525 (2009)
Article Google Scholar
Yang, X.S., Deb, S.: Cuckoo search via Levy flights. World Congress on Nature & Biologically Inspired Computing, pp. 210–214 (2009)
Google Scholar

Download references

Acknowledgement

One of the authors thank DST for the INSPIRE fellowship and also thank various links provided in [7] from which the data has been collected.

Author information

Authors and Affiliations

Department of Computer Science, West Bengal State University, Kolkata, India
Ankita Dhar & Kaushik Roy
Linguistic Research Unit, Indian Statistical Institute, Kolkata, India
Niladri Sekhar Dash

Authors

Ankita Dhar
View author publications
You can also search for this author in PubMed Google Scholar
Niladri Sekhar Dash
View author publications
You can also search for this author in PubMed Google Scholar
Kaushik Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ankita Dhar .

Editor information

Editors and Affiliations

Department of Computer Science, University of South Dakota, Vermillion, SD, USA
K. C. Santosh
Solapur University, Solapur, India
Ravindra S. Hegadi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dhar, A., Dash, N.S., Roy, K. (2019). Efficient Feature Selection Based on Modified Cuckoo Search Optimization Problem for Classifying Web Text Documents. In: Santosh, K., Hegadi, R. (eds) Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2018. Communications in Computer and Information Science, vol 1037. Springer, Singapore. https://doi.org/10.1007/978-981-13-9187-3_57

Download citation

DOI: https://doi.org/10.1007/978-981-13-9187-3_57
Published: 17 July 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9186-6
Online ISBN: 978-981-13-9187-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics