Skip to main content

Efficient Feature Selection Based on Modified Cuckoo Search Optimization Problem for Classifying Web Text Documents

  • Conference paper
  • First Online:
Recent Trends in Image Processing and Pattern Recognition (RTIP2R 2018)

Abstract

The continuous increase of information in the web with varying dimensions is becoming difficult for users to filter and analyse them efficiently as it incorporates redundant and irrelevant terms. Managing, filtering and organizing such huge datasets need the classification of text documents to be performed. Text classification is the process of assigning the text documents to their predefined text categories based on the content. The aim of this paper is to explore Cuckoo search optimization (CSO) problem established from the behaviour of cuckoo birds for selection of relevant features by modifying the algorithm. The revised algorithm is named as modified Cuckoo search (MCS) optimization algorithm that can be proved to be useful for developing an efficient text classification system. The proposed method is generated by combining the ability of MCS with the sharpness of Naive Bayes Multinomial (NBM) algorithm for generating proper feature which increases the rate of success. The approach adopted here is tested on 9000 text documents that cover eight different domains fetched from several web sources and obtains encouraging outcome. The results compared with the results from other well-known approaches for text classification task show the effectiveness of the proposed approach as an automatic Bangla text classification system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Al-Radaideh, Q.A., Al-Khateeb, S.S.: An associative rule-based classifier for Arabic medical text. Int. J. Knowl. Eng. Data Min. 03, 255–273 (2015)

    Article  Google Scholar 

  2. Aly, W., Kelleny, H.A.: Adaptation of Cuckoo search for documents clustering. Int. J. Comput. Appl. Technol. 86, 4–10 (2014)

    Google Scholar 

  3. ArunaDevi, K., Saveeth, R.: A novel approach on tamil text classification using C-Feature. Int. J. Sci. Res. Dev. 2, 343–345 (2014)

    Google Scholar 

  4. Bolaj, P., Govilkar, S.: Text classification for Marathi documents using supervised learning methods. Int. J. Comput. Appl. 155, 6–10 (2016)

    Google Scholar 

  5. Bouguelia, M.R., Nowaczyk, S., Santosh, K.C., Verikas, A.: Agreeing to disagree: active learning with noisy labels without crowdsourcing. Int. J. Mach. Learn. Cybern. 9, 1307–1319 (2018)

    Article  Google Scholar 

  6. DeySarkar, S., Goswami, S., Agarwal, A., Akhtar, J.: A novel feature selection technique for text classification using Naive Bayes. Int. Sch. Res. Not. 2014, 10 (2014)

    Google Scholar 

  7. Dhar, A., Dash, N.S., Roy, K.: Categorization of bangla web text documents based on TF-IDF-ICF text analysis scheme. In: Mandal, J.K., Sinha, D. (eds.) CSI 2018. CCIS, vol. 836, pp. 477–484. Springer, Singapore (2018). https://doi.org/10.1007/978-981-13-1343-1_39

    Chapter  Google Scholar 

  8. Gupta, N., Gupta, V.: Punjabi text classification using Naive Bayes, centroid and hybrid approach. In: Proceedings of the 3rd Workshop on South and South East Asian Natural Language Processing, pp. 109–122 (2012)

    Google Scholar 

  9. Guru, D.S., Suhil, M.: A novel term\_ class relevance measure for text categorization. In: Proceedings of International Conference on Advanced Computing Technologies and Applications, pp. 13–22 (2015)

    Article  Google Scholar 

  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)

    Article  Google Scholar 

  11. Islam, Md.S., Jubayer, F.E.Md., Ahmed, S.I.: A support vector machine mixed with TF-IDF algorithm to categorize Bengali document. In: Proceedings of International Conference on Electrical, Computer and Communication Engineering, pp. 191–196 (2017)

    Google Scholar 

  12. Jin, P., Zhang, Y., Chen, X., Xia, Y.: Bag-of-embeddings for text classification. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, pp. 2824–2830 (2016)

    Google Scholar 

  13. Kabir, F., Siddique, S., Kotwal, M.R.A., Huda, M.N.: Bangla text document categorization using stochastic gradient descent (SGD) classifier. In: Proceedings of International Conference on Cognitive Computing and Information Processing, pp. 1–4 (2015)

    Google Scholar 

  14. Kim, S., Han, K., Rim, H., Myaeng, S.: Some effective techniques for Naive Bayes text classification. IEEE Trans. Knowl. Data Eng. 18, 1457–1466 (2006)

    Article  Google Scholar 

  15. Mandal, A.K., Sen, R.: Supervised learning methods for Bangla web document categorization. Int. J. Artif. Intell. Appl. 05, 93–105 (2014)

    Google Scholar 

  16. Mansur, M., UzZaman, N., Khan, M.: Analysis of N-gram based text categorization for Bangla in a Newspaper Corpus. In: Proceedings of International Conference on Computer and Information Technology, p. 08 (2006)

    Google Scholar 

  17. Rautray, R., Balabantaray, R.C.: CSTS: cuckoo search based model for text summarization. In: Dash, S.S., Vijayakumar, K., Panigrahi, B.K., Das, S. (eds.) Artificial Intelligence and Evolutionary Computations in Engineering Systems. AISC, vol. 517, pp. 141–150. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3174-8_13

    Chapter  Google Scholar 

  18. Redmond, M., Salesi, S., Cosma, G.: A novel approach based on an extended cuckoo search algorithm for the classification of tweets which contain Emoticon and Emoji. In: Proceedings of IEEE International Conference on Knowledge Engineering and Applications, pp. 13–19 (2017)

    Google Scholar 

  19. Sujana, T.S., Rao, N.M.S., Reddy, R.S.: An efficient feature selection using parallel cuckoo search and Naive Bayes classifier. In: Proceedings of IEEE International Conference on Networks & Advances in Computational Technologies, pp. 167–172 (2017)

    Google Scholar 

  20. Vajda, S., Santosh, K.C.: A fast k-nearest neighbor classifier using unsupervised clustering. In: Santosh, K.C., Hangarge, M., Bevilacqua, V., Negi, A. (eds.) RTIP2R 2016. CCIS, vol. 709, pp. 185–193. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-4859-3_17

    Chapter  Google Scholar 

  21. Wang, D., Zhang, H., Liu, R., Lv, W.: Feature selection based on term frequency and T-Test for text categorization. In: Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 1482–1486 (2012)

    Google Scholar 

  22. Wilbur, W.J., Kim, W.: The ineffectiveness of within-document term frequency in text classification. Inf. Retrieval 12, 509–525 (2009)

    Article  Google Scholar 

  23. Yang, X.S., Deb, S.: Cuckoo search via Levy flights. World Congress on Nature & Biologically Inspired Computing, pp. 210–214 (2009)

    Google Scholar 

Download references

Acknowledgement

One of the authors thank DST for the INSPIRE fellowship and also thank various links provided in [7] from which the data has been collected.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ankita Dhar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dhar, A., Dash, N.S., Roy, K. (2019). Efficient Feature Selection Based on Modified Cuckoo Search Optimization Problem for Classifying Web Text Documents. In: Santosh, K., Hegadi, R. (eds) Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2018. Communications in Computer and Information Science, vol 1037. Springer, Singapore. https://doi.org/10.1007/978-981-13-9187-3_57

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-9187-3_57

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-9186-6

  • Online ISBN: 978-981-13-9187-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics