Skip to main content

Applying Machine Learning Algorithms for News Articles Categorization: Using SVM and kNN with TF-IDF Approach

  • Chapter
  • First Online:
Smart Computational Strategies: Theoretical and Practical Aspects

Abstract

News articles categorization is a supervised learning approach in which news articles are assigned category labels based on likelihood demonstrated by a training set of labeled articles. A system for automatic categorization of news articles into a standard set of categories has been implemented. The proposed work will use Term Frequency–Inverse Document Frequency (TF-IDF) term weighting scheme for optimization of classification techniques to get more optimized results and use two supervised learning approaches, i.e., Support Vector Machine (SVM) and K-Nearest neighbor (kNN) and compare the performances of both classifiers. Each news document is preprocessed and transformed into a term-document matrix (Tsoumakas et al. in Data mining and knowledge discovery handbook. Springer, Berlin, pp 667–685 (2010) [1]). After preprocessing and transforming each news article into a vector of weights, TF-IDF term weighting scheme was used for weighting the word. TF-IDF weighted the words calculating the number of words that appear in a document. An unknown news item is also transformed into a vector of keyword weights, and then categorized into suitable categories such as Sports, Business, and Science and Technology. The system purposed in research work was trained on the collection of approximately 300 categorized news articles extracted from the various Indian newspaper websites and tested on a different set of 60 randomly extracted news items from the same sources (Trstenjak et al. in Proc Eng 69:1356–1364 (2014) [2], Buana et al. Int J Comput Appl 50:37–42 [3]). It has been observed that the performances of both algorithms improve when TF-IDF approach is used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Berlin (2010)

    Google Scholar 

  2. Trstenjak, B., Mikac, S., Donko, D.: kNN with TF-IDF based framework for text categorization. Proc. Eng. 69, 1356–1364 (2014). In: 24th DAAAM International Symposium on Intelligent Manufacturing and Automation, 2013

    Google Scholar 

  3. Buana, P.W., Jannet Sesaltina, D.R.M., Putra Darma Gede Ketut, I.: Combination of K-nearest neighbour and K-means based on term re-weighting for classify Indonesian News. Int. J. Comput. Appl. (0975-8887) 50(11), 37–42 (July 2012)

    Google Scholar 

  4. Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 4(8), 966–974 (Aug 2005)

    Google Scholar 

  5. Bijalwan, V., Kumar, V., Kumari, P., Pascual, J.: kNN based machine learning approach for text and document mining. Int. J. Database Theory Appl. 7(1), 61–70 (2014)

    Google Scholar 

  6. Rahmawati, D., Khodra, L.M.: Automatic multilabel classification for Indonesian news articles (2015). IEEE 978-1-4673-8143-7/15

    Google Scholar 

  7. Wen, Zhang, Taketoshi, Yoshida, Xijin, Tang: A comparative study of TF*IDF, LSI and multi words for text classification. Expert Syst. Appl. J. 38(3), 2758–2765 (2011)

    Article  Google Scholar 

  8. Elkan, C.: Evaluating Classifiers, elkan@cs.ucsd.edu (Jan 20, 2012)

    Google Scholar 

  9. Package shiny. https://cran.r-project.org/web/packages/shiny/shiny.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kanika .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kanika, Sangeeta (2019). Applying Machine Learning Algorithms for News Articles Categorization: Using SVM and kNN with TF-IDF Approach. In: Luhach, A.K., Hawari, K.B.G., Mihai, I.C., Hsiung, PA., Mishra, R.B. (eds) Smart Computational Strategies: Theoretical and Practical Aspects. Springer, Singapore. https://doi.org/10.1007/978-981-13-6295-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-6295-8_9

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-6294-1

  • Online ISBN: 978-981-13-6295-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics