Tamil Stopword Removal Based on Term Frequency

Rajkumar, N.; Subashini, T. S.; Rajan, K.; Ramalingam, V.

doi:10.1007/978-981-15-1097-7_3

N. Rajkumar¹⁸,
T. S. Subashini¹⁸,
K. Rajan¹⁹ &
…
V. Ramalingam¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1079))

978 Accesses
5 Citations

Abstract

As text data in digital form is increasing exponentially nowadays, managing and retrieving these documents becomes difficult. A number of natural language processing (NLP) processes, viz. archival, retrieval, query response, information summarization, etc., highly rely on automatic classification of text documents. This has induced researchers to apply machine learning logic to automatically categorize documents based on languages and within documents belonging to the same language to devise methods to segregate them according to its contents. More than at present, 70% of the total text classification process involves ‘Preprocessing of text’, alone [1]. This indicates its importance of preprocessing and the efficiency based on text classification logic is solely dependent on an efficient preprocessing step. This article deals with corpus creation for Tamil documents and Tamil language stopword removal. Dictionary-based and frequency-based stopword removal methods have been proposed in this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Katharina, M., Martin, S.: The mining mart approach to knowledge discovery in databases. Intell. Technol. Inf. Anal. 47–65 (Springer) (2004)
Google Scholar
El-Khair, A.: Effects of stop words elimination for Arabic information retrieval: a comparative study. Int. J. Comput. Inf. Sci. 4(3), 119–133 (2006)
Google Scholar
Al-Shalabi, R., Kanaan, G., Jaam, J.M., Hasnah, A., Hilat, E.: Stop-word removal algorithm for arabic language. In: Proceedings of 1st International Conference on Information & Communication Technologies: from Theory to Applications CTTA ’04, pp. 545–550 (2004)
Google Scholar
Ashish, T., Kothari, M., Pinkesh, P.: Pre-processing phase of text summarization based on Gujarati language. Int. J. Innovative Res. Comput. Sci. Technol. (IJIRCST) 2 (4) (2014)
Google Scholar
Raulji, J.K., Saini, J.R.: Stop-word removal algorithm and its implementation for Sanskrit language. Int. J. Comput. Appl. (0975–8887) 150 (2) (2016)
Google Scholar
Jha, V., Manjunath, N., Shenoy, P.D., Venugopal, K.R.: HSRA: Hindi stopword removal algorithm. In: 2016 International Conference on Microelectronics, Computing and Communications (MicroCom), Durgapur, pp. 1–5 (2016)
Google Scholar
Rakholia, R.M., Saini, J.R.: A rule-based approach to identify stop words for Gujarati language. In: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. Advances in Intelligent Systems and Computing, vol. 515. Springer, Singapore (2017)
Google Scholar
Lo, R.T.W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. JDIM 3, 3–8 (2005)
Google Scholar
Popova, S., Krivosheeva, T., Korenevsky, M.: Automatic stop list generation for clustering recognition results of call center recordings. Speech and Computer. SPECOM 2014. Lecture Notes in Computer Science, vol. 8773. Springer, Cham (2014)
Google Scholar
Yaghoub-Zadeh-Fard, M., Minaei-Bidgoli, B., Rahmani, S., Shahrivari, S.: PSWG: An automatic stop-word list generator for Persian information retrieval systems based on similarity function & POS information. In: 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), Tehran, pp. 111–117 (2015)
Google Scholar
Makrehchi, M., Kamel, M.S.: Automatic extraction of domain-specific stopwords from labeled documents. In: Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval (ECIR’08). Springer-Verlag, Berlin, Heidelberg, pp. 222–233 (2008)
Google Scholar
Zou, F. et al.: Automatic construction of Chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1009–1014 (2006)
Google Scholar
Rajan, K., Ramalingam, V., Ganesan, M., Palanivel, S., Palaniappan, B.: Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36, 10914–10918 (2009)
Google Scholar
Sanjanasri, J., Anand Kumar, M.: A computational framework for Tamil document classification using random kitchen sink. In: International Conference on Advances in Computing, Communications and Informatics (ICACCI) (2015)
Google Scholar
Hanumanthappa, M., Swamy, M.N.: Indian language text documents categorization and keyword extraction. IJCTA, 9 (3), 1473–1481 (2016)
Google Scholar
Swamy, M.N., Hanumanthappa, M.: Indian language text representation and categorization using supervised learning algorithm. Int. J. Data Min. Techn. Appl. 02, 251–257 (2013)
Google Scholar
Kanimozhi, S.: Web based classification of Tamil documents using ABPA. Int. J. Sci. Eng. Res. 3 (5), ISSN 2229-5518 (2012)
Google Scholar
Ravishankar, N., Raghunathan, S.: Corpus based sentiment classification of tamil movie tweets using syntactic patterns. IIOABJ, 8 (2017)
Google Scholar
Ramakrishna Murty, M.V., Murthy, J.V.R., Prasad Reddy, P.V.G.D.: Text document classification based on a least square support vector machines with singular value decomposition. Int. J. Comput. Appl. (IJCA) [impact factor 0.821, 2012], 27 (7), 21–26 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Annamalai University, Chennai, India
N. Rajkumar, T. S. Subashini & V. Ramalingam
Department of Computer Engineering, Muthiah Polytechnic College, Annamalai Nagar, Chidambaram, India
K. Rajan

Authors

N. Rajkumar
View author publications
You can also search for this author in PubMed Google Scholar
T. S. Subashini
View author publications
You can also search for this author in PubMed Google Scholar
K. Rajan
View author publications
You can also search for this author in PubMed Google Scholar
V. Ramalingam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. Rajkumar .

Editor information

Editors and Affiliations

Professor & Head, Department of Computer Science & Engineering, CMR Technical Campus, Hyderabad, India
K. Srujan Raju
Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Roman Senkerik
Stanley College of Engineering and Technology, Hyderabad, Telangana, India
Satya Prasad Lanka
Department of EEE, Stanley College of Engineering and Technology, Hyderabad, Telangana, India
V. Rajagopal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rajkumar, N., Subashini, T.S., Rajan, K., Ramalingam, V. (2020). Tamil Stopword Removal Based on Term Frequency. In: Raju, K.S., Senkerik, R., Lanka, S.P., Rajagopal, V. (eds) Data Engineering and Communication Technology. Advances in Intelligent Systems and Computing, vol 1079. Springer, Singapore. https://doi.org/10.1007/978-981-15-1097-7_3

Download citation

DOI: https://doi.org/10.1007/978-981-15-1097-7_3
Published: 09 January 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1096-0
Online ISBN: 978-981-15-1097-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics