Application of TF-IDF Feature for Categorizing Documents of Online Bangla Web Text Corpus

Dhar, Ankita; Dash, Niladri Sekhar; Roy, Kaushik

doi:10.1007/978-981-10-7566-7_6

Ankita Dhar¹⁸,
Niladri Sekhar Dash¹⁹ &
Kaushik Roy¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 695))

1606 Accesses
15 Citations

Abstract

This paper explores the use of standard features as well as machine learning approaches for categorizing Bangla text documents of online Web corpus. The TF-IDF feature with dimensionality reduction technique (40% of TF) is used here for bringing in precision in the whole process of lexical matching for identification of domain category or class of a piece of text document. This approach stands on the generic observation that text categorization or text classification is a task of automatically sorting out a set of text documents into some predefined sets of text categories. Although an ample range of methods have been applied on English texts for categorization, limited studies are carried out on Indian language texts including that of Bangla. Hence, an attempt is made here to analyze the level of efficiency of the categorization method mentioned above for Bangla text documents. For verification and validation, Bangla text documents that are obtained from various online Web sources are normalized and used as inputs for the experiment. The experimental results show that the feature extraction method along with LIBLINEAR classification model can generate quite satisfactory performance by attaining good results in terms of high-dimensional feature sets and relatively noisy document feature vectors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, J., Huang, H., Tian, S., Qu, Y.: Feature selection for text classification with Naive Bayes. Expert Syst. Appl. 36, 5432–5435 (2009)
Article Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142 (1998)
Chapter Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
MATH Google Scholar
Bijalwan, V., Kumar, V., Kumari, P., Pascual, J.: KNN based machine learning approach for text and document mining. Int. J. Database Theor. Appl. 7, 61–70 (2014)
Article Google Scholar
Pawar, P.Y., Gawande, S.H.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Lear. Comput. 2 (2012)
Google Scholar
Mohammad, A.H., Al-Momani, O., Alwada’n, T.: Arabic text categorization using k-nearest neighbour, Decision Trees (C4.5) and Rocchio classifier: a comparative study. Int. J. Curr. Eng. Technol. 6, 477–482 (2016)
Google Scholar
Ali, A.R., Ijaz, M.: Urdu text classification. In: Proceedings of the 7th International Conference on Frontiers of Information Technology, pp. 21–27 (2009)
Google Scholar
Wei, Z., Miao, D., Chauchat, J.H., Zhao, R., Li, W.: N-grams based feature selection and text representation for Chinese text classification. Int. J. Comput. Intel. Syst. 2, 365–372 (2009)
Article Google Scholar
Patil, J.J., Bogiri, N.: Automatic text categorization marathi documents. Int. J. Adv. Res. Comput. Sci. Manage. Stud. 2321–7782 (2015)
Google Scholar
Dixit, N., Choudhary, N.: Automatic classification of Hindi verbs in syntactic perspective. Int. J. Emerg. Technol. Adv. Eng. 4, 2250–2459 (2014)
Google Scholar
ArunaDevi, K., Saveetha, R.: A novel approach on tamil text classification using C-Feature. Int. J. Sci. Res. Dev. 2321–0613 (2014)
Google Scholar
Gupta, N., Gupta, V.: Punjabi text classification using Naive Bayes, centroid and hybrid approach. In: Proceedings of the 3rd Workshop on South and South East Asian Natural Language Processing (SANLP), pp. 109–122 (2012)
Google Scholar
Murthy, K.N.: Automatic Categorization of Telugu News Articles. Department of Computer and Information Sciences, University of Hyderabad (2003)
Google Scholar
Mansur, M., UzZaman, N., Khan, M.: Analysis of N-gram based text categorization for Bangla in a newspaper corpus. In: Proceedings of International Conference on Computer and Information Technology (2006)
Google Scholar
Mandal, A.K., Sen, R.: Supervised learning methods for Bangla web document categorization. Int. J. Artif. Intell. Appl. (IJAIA) 5, 93–105 (2014)
Google Scholar
Kabir, F., Siddique, S., Kotwal, M.R.A., Huda, M.N.: Bangla text document categorization using stochastic gradient descent (SGD) classifier. In: Proceedings of International Conference on Cognitive Computing and Information Processing, pp. 1–4 (2015)
Google Scholar
Islam, Md.S., Jubayer, F.E. Md., Ahmed, S.I.: A comparative study on different types of approaches to Bengali document categorization. In: Proceedings of International Conference on Engineering Research, Innovation and Education (ICERIE), 6 pp (2017)
Google Scholar
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar

Download references

Acknowledgements

One of the authors would like to thank Department of Science and Technology (DST) for support in the form of INSPIRE fellowship.

Author information

Authors and Affiliations

Department of Computer Science, West Bengal State University, Kolkata, India
Ankita Dhar & Kaushik Roy
Linguistic Research Unit, Indian Statistical Institute, Kolkata, India
Niladri Sekhar Dash

Authors

Ankita Dhar
View author publications
You can also search for this author in PubMed Google Scholar
Niladri Sekhar Dash
View author publications
You can also search for this author in PubMed Google Scholar
Kaushik Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ankita Dhar .

Editor information

Editors and Affiliations

Department of Electronics and Communication Engineering, SRMGPC, Lucknow, Uttar Pradesh, India
Vikrant Bhateja
Departamento de Computación, CINVESTAV-IPN, Mexico City, Mexico
Carlos A. Coello Coello
Department of Computer Science and Engineering, PVP Siddhartha Institute of Technology, Vijayawada, Andhra Pradesh, India
Suresh Chandra Satapathy
School of Computer Engineering, KIIT University, Bhubaneswar, Odisha, India
Prasant Kumar Pattnaik

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dhar, A., Dash, N.S., Roy, K. (2018). Application of TF-IDF Feature for Categorizing Documents of Online Bangla Web Text Corpus. In: Bhateja, V., Coello Coello, C., Satapathy, S., Pattnaik, P. (eds) Intelligent Engineering Informatics. Advances in Intelligent Systems and Computing, vol 695. Springer, Singapore. https://doi.org/10.1007/978-981-10-7566-7_6

Download citation

DOI: https://doi.org/10.1007/978-981-10-7566-7_6
Published: 11 April 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7565-0
Online ISBN: 978-981-10-7566-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics