Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Bengali Stop Word and Phrase Detection Mechanism

Abstract

Though plenty of research works have been done on stop word/phrase detection, there is no work done on Bengali stop words and stop phrases. This research innovates the definition and classification of Bengali stop words and phrases and implements two approaches to identify them. First one is a corpus-based approach, while the second one is based on the finite-state automaton. Performance of both approaches is measured and compared. Result analysis shows that corpus-based method outperforms the finite-state automaton-based method. The corpus-based and finite-state automaton-based method shows 90% and 80% of accuracy, respectively, for stop word detection and 80% and 70% accuracy, respectively, for stop phrase detection.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

References

  1. 1.

    Sinka, M.P.; Corne, D.W.: Evolving better stoplists for document clustering and web intelligence. In: Design and Application of Hybrid Intelligent Systems, pp. 1015–1023 (2003)

  2. 2.

    Silvatt, C.; Ribeirot, B.: The importance of stop word removal on recall values in text categorization. In: Neural Networks, Proceedings of the International Joint Conference on, vol 3 (2003)

  3. 3.

    Al-Shalabi, R.; Kanan, G.G.; Jaam, J.M.; Hasnah, A.; Hailat, E.: Stop-word removal algorithm for Arabic language. In: Information and Communication Technologies: From Theory to Applications (2004)

  4. 4.

    Lo, R.T.-W.; He, B.; Ounis, I.: Automatically building a stop word list for an information retrieval system. In: 5th Dutch-Belgium Information Retrieval Workshop (DIR)’05 Utrecht, the Netherlands (2005)

  5. 5.

    Zou, F.; Wang, F.L.; Deng, X.; Han, S.: Evaluation of stop word lists in Chinese language. WSEAS Trans. Inf. Sci. Appl. 2(6), 1036–1044 (2006)

  6. 6.

    Zou, F.; Wang, F.L.; Deng, X.; Han, S.; Wang, L.S.: Automatic construction of chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, pp. 1010–1015 (2006)

  7. 7.

    Google, Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems, United States Patent, 7,409,383 Tong (August 5, 2008)

  8. 8.

    Pandey, A.K.; Siddiqui, T.J.: Evaluating effect of stemming and stop-word removal on Hindi text retrieval. In: Proceedings of the First International Conference on Intelligent Human Computer Interaction. Springer, New-Delhi (2009)

  9. 9.

    Dragut, E.; Fang, F.; Sistla, P.; Yu, C.; Meng, W.: Stop word and related problems in web interface integration. Proc. VLDB Endow 2(1), 349–360 (2009). https://doi.org/10.14778/1687627.168766

  10. 10.

    Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inf. Sci. Technol. 62(12), 2512–2527 (2011)

  11. 11.

    Singh, S.; Siddiqui, T.J.: Evaluating effect of con-text window size, stemming and stop word removal on hindi word sense disambiguation. In: International Conference on Information Retrieval & Knowledge Management (CAMP) (2012). https://doi.org/10.1109/InfRKM.2012.6204972

  12. 12.

    Popova, S.; Kovriguina, L.; Mouromtsev, D.; Khodyrev, I.: Stop-words in keyphrase extraction problem. In: Proceeding of the 14th Conference of Fruct Association, Espo, Finland (2013)

  13. 13.

    Ferilli, S; Esposito, F; Grieco, D.: Automatic learning of linguistic resources for stopword removal and stemming from text. In: 10th Italian Research Conference on Digital Libraries, IRCDL (2014)

  14. 14.

    Saif, H; Fernandez, M; Alani, H.: Automatic stopword generation using contextual semantics for sentiment analysis of Twitter. In: ISWC-PD’14 Proceeding of the 2014 International Conference on Posters & demonstrations track, vol. 1272 (2014)

  15. 15.

    Nal, D.; Xu, C.: Automatically generation and evaluation of stop words list for Chinese patents. Telkomnika 13(4), 1414–1421 (2015)

  16. 16.

    Amarasinghe, K; Manic, M.: Optimal stop word selection for text mining in critical infrastructure domain. In: IEEE 3rd International Symposium on Resilient Cognitive Systems, at Philadelphia, PA (2015)

  17. 17.

    Raulji, J.K.; Saini, J.R.; Ambedkar, B.: Stop-word removal algorithm and its implementation for Sanskrit language. Int. J. Comput. Appl. 50(2), 0975–8887 (2016)

  18. 18.

    Schofield, A; Magnnusson, M.; Mimno, D.: Pulling out the stopword removal for topic models. In: Proceedings of the 15th Conference of the European Chapter of the Association for the Computational Linguistics, vol. 2. EACL, Valencia, Spain (2017)

  19. 19.

    Metin, S.K.; Karaoglan, B.: Stop word detection as a binary classification problem. Anadolu Univ. J. Sci. Technol. A Appl. Sci. Eng. 18(2), 346–359 (2017). https://doi.org/10.18038/aubtda.322136

  20. 20.

    Dar, K.S.; Shafat, A.B.; Hassan, M.U.: An efficient stop word elimination algorithm for Urdu language. In: 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (2017)

  21. 21.

    Siddiqi, S.; Sharan, A.: Construction of a generic stop-words list for Hindi Language without corpus statistics. Int. J. Adv. Comput. Res. 8(34), 35–40 (2017)

  22. 22.

    Pimpalshende, A.; Mahajan, A.R.: Test model for stop word removal of devnagari text documents based on finite automata. In: IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI) (2017)

  23. 23.

    Ghosh, K.; Bhattacharya, A.: Stopword removal: Why bother? A case study on verbose queries. In: The 10th Annual ACM India Compute Conference (2017)

  24. 24.

    Fani, H.; Bashari, M.; Zarrinkalam, F.; Bagheri, E; Al-Obeidat, F.: Stopword detection for streaming content. In: European Conference on Information Retrieval, ECIR 2018: Advances in Information Retrieval, pp. 737–743 (2018)

  25. 25.

    Miretie, S.G.; Khedkar, V.: Automatic generation of stopwords in the Amharic text. Int. J. Comput. Appl. 180(10), 0975–8887 (2018)

  26. 26.

    Behera, S.: Implementation of a finite state automaton to recognize and remove stop words in English text on its retrieval. In: Proceedings of the 2nd International Conference on Trends in Electronics and Informatics (ICOEI 2018) IEEE Conference Record: # 42666 (2018)

  27. 27.

    Rani, R.; Lobiyal, D.K.: Social choice theory based domain specific hindi stop words list construction and its application in text mining. In: 10th International Conference, IHCI 2018, Allahabad, India (2018)

  28. 28.

    Kaur, J.; Buttar, P.K.: Stopwords removal and its algorithms based on different methods. Int. J. Adv. Res. Comput. Sci. 9(5), 81 (2018)

  29. 29.

    Kaur, J.; Buttar, P.K.: A systematic review on stopword removal algorithms. Int. J. Future Revolut. Comput. Sci. Commun. Eng. 4(4) (2018)

  30. 30.

    Loper, E.; Bird, S.: NLTK: The natural language toolkit. In: Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pp. 62–69 (2002)

  31. 31.

    Carroll, J.; Long, D.D.E.: Theory of Finite Automata with an Introduction to Formal Languages, p. 29. Prentice Hall, Englewood Cliffs (1989)

  32. 32.

    Pressman, R.S.: Software Engineering: A Practitioner’s Approach, pp. 485–503. McGraw-Hill, New York (2010)

  33. 33.

    Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C.: Introduction to Algorithms, pp. 16–53. MIT Press, Cambridge (1989)

  34. 34.

    Aho, A.V.; Lam, M.S.; Sethi, R.; Ullman, J.D.: Compilers (Principals, Techniques & Tools), pp. 109–300. Greg Tobin, New York (2007)

  35. 35.

    Smola, A.; Vishwanathan, S.V.N.: Introduction to machine learning. In: The Press Syndicate of the University of Cambridge (2008)

  36. 36.

    Jupyter[Online].http://jupyter.org/.Accessed 2 June 2017

Download references

Acknowledgements

The authors extend their appreciation to the Deanship of Scientific Research at King Saud University for funding this work through research group no. RG-1441-394.

Author information

Correspondence to M. F. Mridha.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Haque, R.U., Mridha, M.F., Hamid, M.A. et al. Bengali Stop Word and Phrase Detection Mechanism. Arab J Sci Eng (2020). https://doi.org/10.1007/s13369-020-04388-8

Download citation

Keywords

  • Stop phrase
  • Stop word
  • Natural language processing
  • Finite automaton
  • Text processing