Abstract
News articles categorization is a supervised learning approach in which news articles are assigned category labels based on likelihood demonstrated by a training set of labeled articles. A system for automatic categorization of news articles into a standard set of categories has been implemented. The proposed work will use Term Frequency–Inverse Document Frequency (TF-IDF) term weighting scheme for optimization of classification techniques to get more optimized results and use two supervised learning approaches, i.e., Support Vector Machine (SVM) and K-Nearest neighbor (kNN) and compare the performances of both classifiers. Each news document is preprocessed and transformed into a term-document matrix (Tsoumakas et al. in Data mining and knowledge discovery handbook. Springer, Berlin, pp 667–685 (2010) [1]). After preprocessing and transforming each news article into a vector of weights, TF-IDF term weighting scheme was used for weighting the word. TF-IDF weighted the words calculating the number of words that appear in a document. An unknown news item is also transformed into a vector of keyword weights, and then categorized into suitable categories such as Sports, Business, and Science and Technology. The system purposed in research work was trained on the collection of approximately 300 categorized news articles extracted from the various Indian newspaper websites and tested on a different set of 60 randomly extracted news items from the same sources (Trstenjak et al. in Proc Eng 69:1356–1364 (2014) [2], Buana et al. Int J Comput Appl 50:37–42 [3]). It has been observed that the performances of both algorithms improve when TF-IDF approach is used.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Berlin (2010)
Trstenjak, B., Mikac, S., Donko, D.: kNN with TF-IDF based framework for text categorization. Proc. Eng. 69, 1356–1364 (2014). In: 24th DAAAM International Symposium on Intelligent Manufacturing and Automation, 2013
Buana, P.W., Jannet Sesaltina, D.R.M., Putra Darma Gede Ketut, I.: Combination of K-nearest neighbour and K-means based on term re-weighting for classify Indonesian News. Int. J. Comput. Appl. (0975-8887) 50(11), 37–42 (July 2012)
Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 4(8), 966–974 (Aug 2005)
Bijalwan, V., Kumar, V., Kumari, P., Pascual, J.: kNN based machine learning approach for text and document mining. Int. J. Database Theory Appl. 7(1), 61–70 (2014)
Rahmawati, D., Khodra, L.M.: Automatic multilabel classification for Indonesian news articles (2015). IEEE 978-1-4673-8143-7/15
Wen, Zhang, Taketoshi, Yoshida, Xijin, Tang: A comparative study of TF*IDF, LSI and multi words for text classification. Expert Syst. Appl. J. 38(3), 2758–2765 (2011)
Elkan, C.: Evaluating Classifiers, elkan@cs.ucsd.edu (Jan 20, 2012)
Package shiny. https://cran.r-project.org/web/packages/shiny/shiny.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Kanika, Sangeeta (2019). Applying Machine Learning Algorithms for News Articles Categorization: Using SVM and kNN with TF-IDF Approach. In: Luhach, A.K., Hawari, K.B.G., Mihai, I.C., Hsiung, PA., Mishra, R.B. (eds) Smart Computational Strategies: Theoretical and Practical Aspects. Springer, Singapore. https://doi.org/10.1007/978-981-13-6295-8_9
Download citation
DOI: https://doi.org/10.1007/978-981-13-6295-8_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6294-1
Online ISBN: 978-981-13-6295-8
eBook Packages: Computer ScienceComputer Science (R0)