Application of TF-IDF Feature for Categorizing Documents of Online Bangla Web Text Corpus

  • Ankita Dhar
  • Niladri Sekhar Dash
  • Kaushik Roy
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 695)


This paper explores the use of standard features as well as machine learning approaches for categorizing Bangla text documents of online Web corpus. The TF-IDF feature with dimensionality reduction technique (40% of TF) is used here for bringing in precision in the whole process of lexical matching for identification of domain category or class of a piece of text document. This approach stands on the generic observation that text categorization or text classification is a task of automatically sorting out a set of text documents into some predefined sets of text categories. Although an ample range of methods have been applied on English texts for categorization, limited studies are carried out on Indian language texts including that of Bangla. Hence, an attempt is made here to analyze the level of efficiency of the categorization method mentioned above for Bangla text documents. For verification and validation, Bangla text documents that are obtained from various online Web sources are normalized and used as inputs for the experiment. The experimental results show that the feature extraction method along with LIBLINEAR classification model can generate quite satisfactory performance by attaining good results in terms of high-dimensional feature sets and relatively noisy document feature vectors.


Bangla text classification Term frequency Inverse document frequency LIBLINEAR Corpus 



One of the authors would like to thank Department of Science and Technology (DST) for support in the form of INSPIRE fellowship.


  1. 1.
    Chen, J., Huang, H., Tian, S., Qu, Y.: Feature selection for text classification with Naive Bayes. Expert Syst. Appl. 36, 5432–5435 (2009)CrossRefGoogle Scholar
  2. 2.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142 (1998)CrossRefGoogle Scholar
  3. 3.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)zbMATHGoogle Scholar
  4. 4.
    Bijalwan, V., Kumar, V., Kumari, P., Pascual, J.: KNN based machine learning approach for text and document mining. Int. J. Database Theor. Appl. 7, 61–70 (2014)CrossRefGoogle Scholar
  5. 5.
    Pawar, P.Y., Gawande, S.H.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Lear. Comput. 2 (2012)Google Scholar
  6. 6.
    Mohammad, A.H., Al-Momani, O., Alwada’n, T.: Arabic text categorization using k-nearest neighbour, Decision Trees (C4.5) and Rocchio classifier: a comparative study. Int. J. Curr. Eng. Technol. 6, 477–482 (2016)Google Scholar
  7. 7.
    Ali, A.R., Ijaz, M.: Urdu text classification. In: Proceedings of the 7th International Conference on Frontiers of Information Technology, pp. 21–27 (2009)Google Scholar
  8. 8.
    Wei, Z., Miao, D., Chauchat, J.H., Zhao, R., Li, W.: N-grams based feature selection and text representation for Chinese text classification. Int. J. Comput. Intel. Syst. 2, 365–372 (2009)CrossRefGoogle Scholar
  9. 9.
    Patil, J.J., Bogiri, N.: Automatic text categorization marathi documents. Int. J. Adv. Res. Comput. Sci. Manage. Stud. 2321–7782 (2015)Google Scholar
  10. 10.
    Dixit, N., Choudhary, N.: Automatic classification of Hindi verbs in syntactic perspective. Int. J. Emerg. Technol. Adv. Eng. 4, 2250–2459 (2014)Google Scholar
  11. 11.
    ArunaDevi, K., Saveetha, R.: A novel approach on tamil text classification using C-Feature. Int. J. Sci. Res. Dev. 2321–0613 (2014)Google Scholar
  12. 12.
    Gupta, N., Gupta, V.: Punjabi text classification using Naive Bayes, centroid and hybrid approach. In: Proceedings of the 3rd Workshop on South and South East Asian Natural Language Processing (SANLP), pp. 109–122 (2012)Google Scholar
  13. 13.
    Murthy, K.N.: Automatic Categorization of Telugu News Articles. Department of Computer and Information Sciences, University of Hyderabad (2003)Google Scholar
  14. 14.
    Mansur, M., UzZaman, N., Khan, M.: Analysis of N-gram based text categorization for Bangla in a newspaper corpus. In: Proceedings of International Conference on Computer and Information Technology (2006)Google Scholar
  15. 15.
    Mandal, A.K., Sen, R.: Supervised learning methods for Bangla web document categorization. Int. J. Artif. Intell. Appl. (IJAIA) 5, 93–105 (2014)Google Scholar
  16. 16.
    Kabir, F., Siddique, S., Kotwal, M.R.A., Huda, M.N.: Bangla text document categorization using stochastic gradient descent (SGD) classifier. In: Proceedings of International Conference on Cognitive Computing and Information Processing, pp. 1–4 (2015)Google Scholar
  17. 17.
    Islam, Md.S., Jubayer, F.E. Md., Ahmed, S.I.: A comparative study on different types of approaches to Bengali document categorization. In: Proceedings of International Conference on Engineering Research, Innovation and Education (ICERIE), 6 pp (2017)Google Scholar
  18. 18.
    Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)zbMATHGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Ankita Dhar
    • 1
  • Niladri Sekhar Dash
    • 2
  • Kaushik Roy
    • 1
  1. 1.Department of Computer ScienceWest Bengal State UniversityKolkataIndia
  2. 2.Linguistic Research UnitIndian Statistical InstituteKolkataIndia

Personalised recommendations