Advertisement

Social Choice Theory Based Domain Specific Hindi Stop Words List Construction and Its Application in Text Mining

  • Ruby RaniEmail author
  • D. K. Lobiyal
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11278)

Abstract

In this paper, we have given an attempt to create domain specific Hindi stop words list using statistical and knowledge based techniques from prepared textual corpora of different domains. In order to remove the biased raking nature of each technique, Borda’s rule of vote ranking method has been employed for unbiased stop words list construction. We also propose a novel approach called netting ranked performance evaluation (NRPE) to evaluate prepared stop words lists, in which stop words removal is done in leading and trailing fashion based on ascending and descending order of terms. Further, using combined band net (CBN) performance, we demonstrate the ability of each technique in identifying of candidate stop words followed by selection of features for text mining models. The experimental results show that a technique selects good features for classification/clustering needs not necessarily finds the good stop words. Results also show that the final Borda’s lists gives normalized performance over individual technique. This approach guarantees candidate stop word removal, least information dissipation and text mining model performance enhancement.

Keywords

Hindi language Stop words removal Information retrieval Borda’s vote ranking method Text classifier Text clustering 

Notes

Acknowledgements

This work has been partially supported by the UPE-II grant received from JNU. Authors would like to thank anonymous reviewers for their kind comments.

References

  1. 1.
    Ricardo, B.-Y.: Modern Information Retrieval. Pearson Education, India (1999)Google Scholar
  2. 2.
    Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 256–263 (1995)Google Scholar
  3. 3.
    Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)CrossRefGoogle Scholar
  4. 4.
    Sinka, M.P., Corne, D.: Evolving better stoplists for document clustering and web intelligence. In: HIS, pp. 1015–1023 (2003)Google Scholar
  5. 5.
    Petras, V., Perelman, N., Gey, F.: UC Berkeley at CLEF-2003 – Russian language experiments and domain-specific retrieval. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 401–411. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-30222-3_39CrossRefGoogle Scholar
  6. 6.
    White, B.J., Fortier, J., Clapper, D., Grabolosa, P.: The impact of domain-specific stop-word lists on ecommerce website search performance. J. Strateg. E-Commerce 5(1/2), 83 (2007)Google Scholar
  7. 7.
    Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic construction of chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1010–1015 (2006)Google Scholar
  8. 8.
    Yao, Z., Ze-wen, C.: Research on the construction and filter method of stop-word list in text preprocessing. In: International Conference on Intelligent Computation Technology and Automation (ICICTA), 2011, vol. 1, pp. 217–221 (2011)Google Scholar
  9. 9.
    Hao, L., Hao, L.: Automatic identification of stop words in chinese text classification. In: International Conference on Computer Science and Software Engineering, 2008, vol. 1, pp. 718–722 (2008)Google Scholar
  10. 10.
    Alhadidi, B., Alwedyan, M.: Hybrid stop-word removal technique for Arabic language. Egypt. Comput. Sci. J. 30(1), 35–38 (2008)Google Scholar
  11. 11.
    Alajmi, A., Saad, E.M., Darwish, R.R.: Toward an ARABIC stop-words list generation. Int. J. Comput. Appl. 46(8), 8–13 (2012)Google Scholar
  12. 12.
    Jha, V., Manjunath, N., Shenoy, P.D., Venugopal, K.R.: HSRA: Hindi stopword removal algorithm. In: International Conference on Microelectronics, Computing and Communications (MicroCom), 2016, pp. 1–5 (2016)Google Scholar
  13. 13.
    Choudhary, N., Jha, G.N.: Creating multilingual parallel corpora in Indian languages. In: Vetulani, Z., Mariani, J. (eds.) LTC 2011. LNCS (LNAI), vol. 8387, pp. 527–537. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-08958-4_43CrossRefGoogle Scholar
  14. 14.
    Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Shenoy, P.D., Srinivasa, K.G., Venugopal, K.R., Patnaik, L.M.: Dynamic association rule mining using genetic algorithms. Intell. Data Anal. 9(5), 439–453 (2005)CrossRefGoogle Scholar
  16. 16.
    Pandey, A.K., Siddiqui, T.J.: Evaluating effect of stemming and stop-word removal on Hindi text retrieval. In: Tiwary, U.S., Siddiqui, T.J., Radhakrishna, M., Tiwari, M.D. (eds.) Proceedings of the First International Conference on Intelligent Human Computer Interaction, pp. 316–326. Springer, New Delhi (2009).  https://doi.org/10.1007/978-81-8489-203-1_31CrossRefGoogle Scholar
  17. 17.
    Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Kucera, H., Francis, W.N.: Frequency analysis of English usage: Lexicon and grammar. Houghton Mifflin, Boston (1982)Google Scholar
  19. 19.
    Van Rijsbergen, C.J.: A non-classical logic for information retrieval. Comput. J. 29(6), 481–485 (1986)CrossRefGoogle Scholar
  20. 20.
    Lo, R.T.-W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. J. Digit. Inf. Manage 5, 17–24 (2005). Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR)Google Scholar
  21. 21.
    Makrehchi, M., Kamel, M.S.: Extracting domain-specific stopwords for text classifiers. Intell. Data Anal. 21(1), 39–62 (2017)CrossRefGoogle Scholar
  22. 22.
    Makrehchi, M., Kamel, M.S.: Automatic extraction of domain-specific stopwords from labeled documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, Ryen W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 222–233. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-78646-7_22CrossRefGoogle Scholar
  23. 23.
    Singh, S., Siddiqui, T.J.: Evaluating effect of context window size, stemming and stop word removal on Hindi word sense disambiguation. In: International Conference on Information Retrieval & Knowledge Management (CAMP), 2012, pp. 1–5 (2012)Google Scholar
  24. 24.
    Rani, R., Lobiyal, D.K.: Automatic construction of generic stop words list for Hindi text. Procedia Comput. Sci. Elsevier J. 132, 1–7 (2018)CrossRefGoogle Scholar
  25. 25.
    Ranks, “Hindi stopwords”. Accessed 17 Dec 2017Google Scholar
  26. 26.
    Taranjeet, “Hindi stopwords”, 17 April 2017Google Scholar
  27. 27.
    GitHub, “Hindi stopword list”, 29 December 2011Google Scholar
  28. 28.
    Kantor, P.B., Lee, J.J.: The maximum entropy principle in information retrieval. In: Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 269–274 (1986)Google Scholar
  29. 29.
    Myerson, R.B.: Fundamentals of social choice theory. Quart. J. Polit. Sci. 8(3), 305–337 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Jawaharlal Nehru UniversityNew DelhiIndia

Personalised recommendations