Abstract
As text data in digital form is increasing exponentially nowadays, managing and retrieving these documents becomes difficult. A number of natural language processing (NLP) processes, viz. archival, retrieval, query response, information summarization, etc., highly rely on automatic classification of text documents. This has induced researchers to apply machine learning logic to automatically categorize documents based on languages and within documents belonging to the same language to devise methods to segregate them according to its contents. More than at present, 70% of the total text classification process involves ‘Preprocessing of text’, alone [1]. This indicates its importance of preprocessing and the efficiency based on text classification logic is solely dependent on an efficient preprocessing step. This article deals with corpus creation for Tamil documents and Tamil language stopword removal. Dictionary-based and frequency-based stopword removal methods have been proposed in this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Katharina, M., Martin, S.: The mining mart approach to knowledge discovery in databases. Intell. Technol. Inf. Anal. 47–65 (Springer) (2004)
El-Khair, A.: Effects of stop words elimination for Arabic information retrieval: a comparative study. Int. J. Comput. Inf. Sci. 4(3), 119–133 (2006)
Al-Shalabi, R., Kanaan, G., Jaam, J.M., Hasnah, A., Hilat, E.: Stop-word removal algorithm for arabic language. In: Proceedings of 1st International Conference on Information & Communication Technologies: from Theory to Applications CTTA ’04, pp. 545–550 (2004)
Ashish, T., Kothari, M., Pinkesh, P.: Pre-processing phase of text summarization based on Gujarati language. Int. J. Innovative Res. Comput. Sci. Technol. (IJIRCST) 2 (4) (2014)
Raulji, J.K., Saini, J.R.: Stop-word removal algorithm and its implementation for Sanskrit language. Int. J. Comput. Appl. (0975–8887) 150 (2) (2016)
Jha, V., Manjunath, N., Shenoy, P.D., Venugopal, K.R.: HSRA: Hindi stopword removal algorithm. In: 2016 International Conference on Microelectronics, Computing and Communications (MicroCom), Durgapur, pp. 1–5 (2016)
Rakholia, R.M., Saini, J.R.: A rule-based approach to identify stop words for Gujarati language. In: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. Advances in Intelligent Systems and Computing, vol. 515. Springer, Singapore (2017)
Lo, R.T.W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. JDIM 3, 3–8 (2005)
Popova, S., Krivosheeva, T., Korenevsky, M.: Automatic stop list generation for clustering recognition results of call center recordings. Speech and Computer. SPECOM 2014. Lecture Notes in Computer Science, vol. 8773. Springer, Cham (2014)
Yaghoub-Zadeh-Fard, M., Minaei-Bidgoli, B., Rahmani, S., Shahrivari, S.: PSWG: An automatic stop-word list generator for Persian information retrieval systems based on similarity function & POS information. In: 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), Tehran, pp. 111–117 (2015)
Makrehchi, M., Kamel, M.S.: Automatic extraction of domain-specific stopwords from labeled documents. In: Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval (ECIR’08). Springer-Verlag, Berlin, Heidelberg, pp. 222–233 (2008)
Zou, F. et al.: Automatic construction of Chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1009–1014 (2006)
Rajan, K., Ramalingam, V., Ganesan, M., Palanivel, S., Palaniappan, B.: Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36, 10914–10918 (2009)
Sanjanasri, J., Anand Kumar, M.: A computational framework for Tamil document classification using random kitchen sink. In: International Conference on Advances in Computing, Communications and Informatics (ICACCI) (2015)
Hanumanthappa, M., Swamy, M.N.: Indian language text documents categorization and keyword extraction. IJCTA, 9 (3), 1473–1481 (2016)
Swamy, M.N., Hanumanthappa, M.: Indian language text representation and categorization using supervised learning algorithm. Int. J. Data Min. Techn. Appl. 02, 251–257 (2013)
Kanimozhi, S.: Web based classification of Tamil documents using ABPA. Int. J. Sci. Eng. Res. 3 (5), ISSN 2229-5518 (2012)
Ravishankar, N., Raghunathan, S.: Corpus based sentiment classification of tamil movie tweets using syntactic patterns. IIOABJ, 8 (2017)
Ramakrishna Murty, M.V., Murthy, J.V.R., Prasad Reddy, P.V.G.D.: Text document classification based on a least square support vector machines with singular value decomposition. Int. J. Comput. Appl. (IJCA) [impact factor 0.821, 2012], 27 (7), 21–26 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Rajkumar, N., Subashini, T.S., Rajan, K., Ramalingam, V. (2020). Tamil Stopword Removal Based on Term Frequency. In: Raju, K.S., Senkerik, R., Lanka, S.P., Rajagopal, V. (eds) Data Engineering and Communication Technology. Advances in Intelligent Systems and Computing, vol 1079. Springer, Singapore. https://doi.org/10.1007/978-981-15-1097-7_3
Download citation
DOI: https://doi.org/10.1007/978-981-15-1097-7_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1096-0
Online ISBN: 978-981-15-1097-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)