Skip to main content

Tamil Stopword Removal Based on Term Frequency

  • Conference paper
  • First Online:
Data Engineering and Communication Technology

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1079))

Abstract

As text data in digital form is increasing exponentially nowadays, managing and retrieving these documents becomes difficult. A number of natural language processing (NLP) processes, viz. archival, retrieval, query response, information summarization, etc., highly rely on automatic classification of text documents. This has induced researchers to apply machine learning logic to automatically categorize documents based on languages and within documents belonging to the same language to devise methods to segregate them according to its contents. More than at present, 70% of the total text classification process involves ‘Preprocessing of text’, alone [1]. This indicates its importance of preprocessing and the efficiency based on text classification logic is solely dependent on an efficient preprocessing step. This article deals with corpus creation for Tamil documents and Tamil language stopword removal. Dictionary-based and frequency-based stopword removal methods have been proposed in this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Katharina, M., Martin, S.: The mining mart approach to knowledge discovery in databases. Intell. Technol. Inf. Anal. 47–65 (Springer) (2004)

    Google Scholar 

  2. El-Khair, A.: Effects of stop words elimination for Arabic information retrieval: a comparative study. Int. J. Comput. Inf. Sci. 4(3), 119–133 (2006)

    Google Scholar 

  3. Al-Shalabi, R., Kanaan, G., Jaam, J.M., Hasnah, A., Hilat, E.: Stop-word removal algorithm for arabic language. In: Proceedings of 1st International Conference on Information & Communication Technologies: from Theory to Applications CTTA ’04, pp. 545–550 (2004)

    Google Scholar 

  4. Ashish, T., Kothari, M., Pinkesh, P.: Pre-processing phase of text summarization based on Gujarati language. Int. J. Innovative Res. Comput. Sci. Technol. (IJIRCST) 2 (4) (2014)

    Google Scholar 

  5. Raulji, J.K., Saini, J.R.: Stop-word removal algorithm and its implementation for Sanskrit language. Int. J. Comput. Appl. (0975–8887) 150 (2) (2016)

    Google Scholar 

  6. Jha, V., Manjunath, N., Shenoy, P.D., Venugopal, K.R.: HSRA: Hindi stopword removal algorithm. In: 2016 International Conference on Microelectronics, Computing and Communications (MicroCom), Durgapur, pp. 1–5 (2016)

    Google Scholar 

  7. Rakholia, R.M., Saini, J.R.: A rule-based approach to identify stop words for Gujarati language. In: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. Advances in Intelligent Systems and Computing, vol. 515. Springer, Singapore (2017)

    Google Scholar 

  8. Lo, R.T.W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. JDIM 3, 3–8 (2005)

    Google Scholar 

  9. Popova, S., Krivosheeva, T., Korenevsky, M.: Automatic stop list generation for clustering recognition results of call center recordings. Speech and Computer. SPECOM 2014. Lecture Notes in Computer Science, vol. 8773. Springer, Cham (2014)

    Google Scholar 

  10. Yaghoub-Zadeh-Fard, M., Minaei-Bidgoli, B., Rahmani, S., Shahrivari, S.: PSWG: An automatic stop-word list generator for Persian information retrieval systems based on similarity function & POS information. In: 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), Tehran, pp. 111–117 (2015)

    Google Scholar 

  11. Makrehchi, M., Kamel, M.S.: Automatic extraction of domain-specific stopwords from labeled documents. In: Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval (ECIR’08). Springer-Verlag, Berlin, Heidelberg, pp. 222–233 (2008)

    Google Scholar 

  12. Zou, F. et al.: Automatic construction of Chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1009–1014 (2006)

    Google Scholar 

  13. Rajan, K., Ramalingam, V., Ganesan, M., Palanivel, S., Palaniappan, B.: Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36, 10914–10918 (2009)

    Google Scholar 

  14. Sanjanasri, J., Anand Kumar, M.: A computational framework for Tamil document classification using random kitchen sink. In: International Conference on Advances in Computing, Communications and Informatics (ICACCI) (2015)

    Google Scholar 

  15. Hanumanthappa, M., Swamy, M.N.: Indian language text documents categorization and keyword extraction. IJCTA, 9 (3), 1473–1481 (2016)

    Google Scholar 

  16. Swamy, M.N., Hanumanthappa, M.: Indian language text representation and categorization using supervised learning algorithm. Int. J. Data Min. Techn. Appl. 02, 251–257 (2013)

    Google Scholar 

  17. Kanimozhi, S.: Web based classification of Tamil documents using ABPA. Int. J. Sci. Eng. Res. 3 (5), ISSN 2229-5518 (2012)

    Google Scholar 

  18. Ravishankar, N., Raghunathan, S.: Corpus based sentiment classification of tamil movie tweets using syntactic patterns. IIOABJ, 8 (2017)

    Google Scholar 

  19. Ramakrishna Murty, M.V., Murthy, J.V.R., Prasad Reddy, P.V.G.D.: Text document classification based on a least square support vector machines with singular value decomposition. Int. J. Comput. Appl. (IJCA) [impact factor 0.821, 2012], 27 (7), 21–26 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. Rajkumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rajkumar, N., Subashini, T.S., Rajan, K., Ramalingam, V. (2020). Tamil Stopword Removal Based on Term Frequency. In: Raju, K.S., Senkerik, R., Lanka, S.P., Rajagopal, V. (eds) Data Engineering and Communication Technology. Advances in Intelligent Systems and Computing, vol 1079. Springer, Singapore. https://doi.org/10.1007/978-981-15-1097-7_3

Download citation

Publish with us

Policies and ethics