Advertisement

An Automatic Construction of Malay Stop Words Based on Aggregation Method

  • Khalifa Chekima
  • Rayner AlfredEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 652)

Abstract

In information retrieval, the key to an effective indexing can be achieved through the removal of stop words. Despite having many theories and algorithms related to the construction of stop words in many languages, yet, most of the Malay stop words used are either utilized/borrowed from English stop words, or constructed manually by different researchers which happen to be costly, time consuming and susceptible to error. In other words, no standard stop word list has been constructed for Malay language yet. In this study, we propose an aggregation technique using three different approaches for an automatic construction of general Malay Stop words. The first approach based on statistical method, by considering words’ frequencies (highest and lowest) against their ranks, this method inspired by zipf’s law. The second approach by considering words’ distribution against documents using variance measure. The third approach by computing how informative a word is by using Entropy measure. As a result, a total of 339 Malay stop words were produced. The discussion and implication of these findings are further elaborated.

Keywords

Information retrieval Malay stop-words Natural language processing 

References

  1. 1.
    Lo, R.T.-W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. In: Journal on Digital Information Management: Special Issue on the 5th Dutch- Belgian Information Retrieval Workshop (DIR) (2005)Google Scholar
  2. 2.
    Saif, H.; Fernandez, M.; He, Y.; Alani, H.: Evaluation datasets for twitter sentiment analysis a survey and a new dataset, the STS-Gold. In: Proceedings, 1st Workshop (2013)Google Scholar
  3. 3.
    Burchfield, R.: Review of Frequency analysis English usage: Lexicon and grammar. J. Engl. Linguist. 18(1), 64–70 (1985)CrossRefGoogle Scholar
  4. 4.
    Silva, C., Ribeiro, B.: The importance of stop word removal on recall values in text categorization. In: Proceedings of the International Joint Conference on Neural Networks, pp. 1661–1666 (2003)Google Scholar
  5. 5.
    Van Rijisbergen, C.J.: Information Retrieval, Butterworths, London (1975)Google Scholar
  6. 6.
    Francis, W., Kucera, H.: Frequency Analysis of English Usage. Houghton Mifflin, New York (1982)Google Scholar
  7. 7.
    Fox, C.: Lexical Analysis and Stop List. Prentice-Hall, Upper Saddle River (1990)Google Scholar
  8. 8.
    Fox, C.: Information retrieval data structures and algorithms. Lexical Analysis and Stoplists, pp. 102– 130 (1992)Google Scholar
  9. 9.
    Sinka, M. P., Corne, D.: Evolving better stoplists for document clustering and web intelligence. In: HIS, pp. 1015–1023 (2003a)Google Scholar
  10. 10.
    Sinka, M.P., Corne, D.W.:Towards modernized and web-specific stoplists for web document analysis. In: Proceedings of the IEEE/WIC International Conference on Web Intelligence, WI 2003, IEEE (2003b)Google Scholar
  11. 11.
    Ayral, H., Yavuz, S.: An automated domain specific stop word generation method for natural language text classification. In: 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA), pp. 500–503. IEEE (2011)Google Scholar
  12. 12.
    Trumbach, C.C., Payne, D.: Identifying synonymous concepts in preparation for technology mining. J. Inf. Sci. 33(6), 660–677 (2007)CrossRefGoogle Scholar
  13. 13.
    Makrehchi, M., Kamel, M.S.: Automatic extraction of domain-specific stopwords from labeled documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 222–233. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  14. 14.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)zbMATHGoogle Scholar
  15. 15.
    Yao, Z., Ze-wen, C.: Research on the construction and filter method of stop-word list in text Preprocessing. In: Fourth International Conference on Intelligent Computation Technology and Automation (2011)Google Scholar
  16. 16.
    Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic construction of chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, pp. 1010–1015, 16–18 April 2006Google Scholar
  17. 17.
    Savoy, J.: A stemming procedure and stopword list for general French corpora. J. Am. Soc. Inf. Sci. 50(10), 944–952 (1999)CrossRefGoogle Scholar
  18. 18.
    Fox, C.: A stop list for general text. ACM-SIGIR Forum 24, 19–35 (1990)CrossRefGoogle Scholar
  19. 19.
    Alhadidi, B., Alwedyan, M.: Hybrid stop-word removal technique for Arabic language. Egypt. Comput. Sci. J. 30(1), 35–38 (2008)Google Scholar
  20. 20.
    Alajmi, A., Saad, E.M., Darwish, R.R.: Toward an ARABIC stop-words list generation. Int. J. Comput. Appl. 46(8), 8–13 (2012)Google Scholar
  21. 21.
    Zheng, G., Gaowa,G.: The selection of Mongolian stop words. In: IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS) (2010)Google Scholar
  22. 22.
    Alsaffar, A., Omar, N.: Study of feature selection and machine learning algorithm for Malay sentiment classification. In: International Conference on Information Technology and Multimedia (2014)Google Scholar
  23. 23.
    Samasudin, N., Puteh, M., Hamdan, A.R., Nazri, M.Z.N.: Is artificial immune system suitable for opinion mining? In: 4th Conference on Data Mining and Optimization (DMO) (2012)Google Scholar
  24. 24.
    Shamasuddin, N., Puteh, M.: Bess or xbest: mining the Malaysian online reviews. In: 3rd Conference on Data Mining and Optimization (2011)Google Scholar
  25. 25.
    Darwich, M., Noah, S.A.M., Omar, N.: Inducing a domain-independent sentiment lexicon in Malay. In: Jaist Symposium on Advance Science And Technology (2015)Google Scholar
  26. 26.
    Isa, N., Puteh, M., Kamarudin, R.M.H.R.: Sentiment classification of Malay newspapers using immune network (SCIN). In: Proceeding of the World Congress on Engineering, UK (2013)Google Scholar
  27. 27.
    Kwee, A.T., Tsai, F.S., Tang, W.: Sentence-level novelty detection in English and Malay. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 40–51. Springer, Heidelberg (2009)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2016

Authors and Affiliations

  1. 1.Faculty of Computing and InformaticsUniversiti Malaysia SabahKota KinabaluMalaysia

Personalised recommendations