Assessing Factors that Influence the Performances of Automated Topic Selection for Malay Articles

  • Rayner AlfredEmail author
  • Leow Jia Ren
  • Joe Henry Obit
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 652)


Malay language is a major language that is in used by citizens of Malaysia, Indonesia, Singapore and Brunei. As the language is widely used, there are abundant of text articles written in Malay language that are available on the internet. This has resulted in the increasing of the Malay articles published online and the number of articles has increased greatly over the years. Automatically labeling Malay text articles is crucial in managing these articles. Due to lack of resources and tools used to perform the topic selection automatically for Malay text articles, this paper studies the factors that influence the performances of the algorithms that can be applied to perform a topic selection automatically for Malay articles. This is done by comparing the contents of the articles with the corresponding topics and all Malay articles will be assigned to the appropriate topics depending on the results of the classification process. In this paper, all Malay articles will be classified by using the k-Nearest Neighbors (k-NN) and Naïve Bayes classifiers. Both classifiers are used to classify and assign a topic to these Malay articles according to a predefined set of topics. The effectiveness of classifying these Malay articles using the k-NN classifier is highly dependent on the distance methods used and the number of Nearest Neighbors, k. Thus, this paper also assesses the effects of using different distance methods (e.g., Cosine Similarity and the Euclidean Distance) and varying the number of clusters, k. Other than that, the effects of utilizing the stemming process on the performance of the classifiers are also studied. Based on the results obtained, the proposed approach shows that the k-NN classifier performs better than the Naïve Bayes classifier in classifying the Malay articles into their respective topics. In addition to that, the stemming process also improves the overall performances of both classifiers. Other findings include the application of Cosine Similarity as the distance measure has improved the performance of the k-NN classifier.


Topic selection Feature extraction Classification Clustering 


  1. 1.
    Khan, A., Baharudin, B., Lee, L.H., Khan, K.: A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Technol. 1(1), 4–20 (2010)Google Scholar
  2. 2.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval, edn. 2. ACM Press Books, Addison-Wesley Professional ISBN-10: 0321416910 (2011)Google Scholar
  3. 3.
    Salim, J., Ismail, M., Suwarno, I., Alshalabi, H., Tiun, S., Omar, N., Albared, M.: Experiments on the use of feature selection and machine learning methods in automatic malay text categorization. Procedia Technol. 11, 748–754 (2013). ISSN 2212-0173CrossRefGoogle Scholar
  4. 4.
    Uguz, Harun: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl.-Based Syst. 24(7), 1024–1032 (2011)CrossRefGoogle Scholar
  5. 5.
    Echeverry-Correa, J.D., Ferreiros-López, J., Coucheiro-Limeres, A., Córdoba, R., Montero, J.M.: Topic identification techniques applied to dynamic language model adaptation for automatic speech recognition. Expert System with Applications 42(1), 101–112 (2015). ISSN: 0957-4174CrossRefGoogle Scholar
  6. 6.
    Lee, J., Othman, R.M., Mohamad, N.Z.: Syllable-based Malay word stemmer, computers & informatics (ISCI). In: 2013 IEEE Symposium on, Langkawi, pp. 7–11 (2013). doi: 10.1109/ISCI.2013.6612366
  7. 7.
    Jiang, L., Zhang, H.: Learning instance greedily cloning naive Bayes for ranking. In: Fifth IEEE International Conference on Data Mining (ICDM 2005), pp. 8–12 (2005). doi: 10.1109/ICDM.2005.87
  8. 8.
    Sankupellay, M., Subbu, V.: Malay-language stemmer. Sunway Academic J. 3, 147–153 (2006)Google Scholar
  9. 9.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008). ISBN-10: 0521865719CrossRefzbMATHGoogle Scholar
  10. 10.
    Meenakshi, Singla, S.: Review paper on text categorization techniques. Int. J. Innovative Res. Comput. Commun. Eng. 3(11), 809–813 (2015). ISSN: 2320-9801Google Scholar
  11. 11.
    Samat, N.A., Murad, M.A.A., Abdullah, M.T., Atan, R.: Malay documents clustering algorithm based on singular value decomposition. Int. J. Comput. Sci. Netw. Sec. (IJCSNS) 8(10), 357–361 (2008)Google Scholar
  12. 12.
    Ismail, N.K., Saad, N.H.M., Omar, S.B.S., Sembok, T.M.T.: 2D visualization of terms and documents in Malay language. In: 5th International Conference on Information and Communication Technology for the Muslim World (ICT4 M), Rabat, pp. 1–6 (2013). doi: 10.1109/ICT4M.2013.6518919
  13. 13.
    Koulali, R., El-Haj, M., Meziane, A.: Arabic topic detection using automatic text summarisation. In: 2013 ACS International Conference on Computer Systems and Applications (AICCSA), Ifrane, pp. 1–4 (2013). doi: 10.1109/AICCSA.2013.6616460
  14. 14.
    Thakur, S.K., Singh, V.K.: A lexicon pool augmented naive bayes classifier for nepali text. In: 2014 Seventh International Conference on Contemporary Computing (IC3), Noida, pp. 542–546 (2014). doi: 10.1109/IC3.2014.6897231
  15. 15.
    Sembok, T.M.T., Bakar, Z.A., Ahmad, F.: Experiments in Malay information retrieval. In: 2011 International Conference on Electrical Engineering and Informatics (ICEEI), Bandung, pp. 1–5 (2011). doi: 10.1109/ICEEI.2011.6021578
  16. 16.
    Yong-qing, W., Pei-yu, L., Zhen-fang, Z.: A feature selection method based on improved TFIDF. In: Third International Conference on Pervasive Computing and Applications, 2008, ICPCA 2008, Alexandria, pp. 94–97 (2008). doi: 10.1109/ICPCA.2008.4783657
  17. 17.
    Qin, Z.: Naive bayes classification given probability estimation trees. In: 2006 5th International Conference on Machine Learning and Applications (ICMLA 2006), Orlando, FL, pp. 34–42 (2006). doi: 10.1109/ICMLA.2006.36
  18. 18.
    Sharum, M.Y., Abdullah, M.T., Sulaiman, M.N., Murad, M.A.A., Hamzah, Z.A.Z.: MALIM — A new computational approach of malay morphology. In: 2010 International Symposium on Information Technology, Kuala Lumpur, pp. 837–843 (2010). doi: 10.1109/ITSIM.2010.5561561
  19. 19.
    Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: Proceedings of the 18th conference on Computational linguistics - Volume 1 (COLING 2000). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 453–459 (2000). doi:
  20. 20.
    Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1995) (1995)Google Scholar
  21. 21.
    Viswanath, P., Hitendra Sarma, T.: An improvement to k-nearest neighbor classifier. In: IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, 2011, pp. 227–231 (2011). doi: 10.1109/RAICS.2011.6069307
  22. 22.
    Qu, C., Yuan, R., Wei, X.: KNNCC: an algorithm for k-nearest neighbor clique clustering. In: 2013 International Conference on Machine Learning and Cybernetics, Tianjin, pp. 1763–1766 (2013). doi: 10.1109/ICMLC.2013.6890883
  23. 23.
    Tanha, J., de Does, J., Depuydt, K.: An LDA-based topic selection approach to language model. In: Adaptation for Handwritten Text Recognition, Proceedings of Recent Advances in Natural Language Processing, pp. 646–653, Hissar, Bulgaria, September 7–9 (2015)Google Scholar
  24. 24.
    Leong, L.C., Basri, S., Alfred, R.: Enhancing Malay Stemming Algorithm with Background Knowledge. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS, vol. 7458, pp. 753–758. Springer, Heidelberg (2012)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2016

Authors and Affiliations

  1. 1.Faculty of Computing and InformaticsUniversiti Malaysia SabahSabahMalaysia

Personalised recommendations