Advertisement

A Novel Short Text Clustering Model Based on Grey System Theory

  • Hüseyin FidanEmail author
  • Mehmet Erkan Yuksel
Research Article - Computer Engineering and Computer Science
  • 31 Downloads

Abstract

Short text clustering has great challenges due to the structural reasons, especially when applied to small datasets. Limited number of words leads to a poor-quality feature vector, low clustering accuracy, and failure of analysis. Although some approaches have been observed in the related literature, there is still no agreement on an efficient solution. On the other hand, the Grey system theory, which gives better results in numerical analyses with insufficient data, has not yet been applied to short text clustering. The purpose of our study is to develop a short text clustering model based on Grey system theory applicable to small datasets. In order to measure the efficiency of our method, book reviews labeled as negative or positive were obtained from Amazon.com dataset collections, and small datasets have been created. The Grey relational clustering as well as hierarchical and partitional algorithms has been applied to the small datasets separately. According to the results, our model has better accuracy values than the other algorithms in clustering of small datasets containing short text. Consequently, we demonstrated that the Grey relational clustering should be applied to short text clustering for much better results.

Keywords

Grey system theory Text mining Short text clustering Small dataset clustering Grey relational clustering Machine learning 

References

  1. 1.
    Abbas, O.A.: Comparisons between data clustering algorithms. Int. Arab. J. Inf. Technol. 5(3), 320–325 (2008)Google Scholar
  2. 2.
    Tajunisha, N.; Saravanan, V.: Performance analysis of K-means with different initialization methods for high dimensional data. Int. J. Artif. Intell. Appl. 1(4), 44–52 (2010)Google Scholar
  3. 3.
    Celebi, M.E.: Improving the performance of k-means for color quantization. Image Vis. Comput. 29, 260–271 (2011)CrossRefGoogle Scholar
  4. 4.
    Jun, S.; Park, S.S.; Jang, D.S.: Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Syst. Appl. 41, 3204–3212 (2014)CrossRefGoogle Scholar
  5. 5.
    Fidan, H.: E-ticaret Müşteri Bağlılığı Gri İlişkisel Kümeleme Analizi. AJIT-e Online Acad. J. Inf. Technol. 9(32), 163–182 (2018)Google Scholar
  6. 6.
    Tuzhilin, A.: Customer relationship management and web mining: the next frontier. Data Min. Knowl. Discov. 24(3), 584–612 (2012)CrossRefGoogle Scholar
  7. 7.
    Allahyari, M.; Pouriyeh, S.; Assefi, M.; Safaei, S.; Trippe, E.D.; Gutierrez, J.B.; Kochut, K.: Brief survey of text mining: classification, clustering and extraction techniques. In: KDD Bigdas Canada (2017)Google Scholar
  8. 8.
    Hebrail G.; Marsais J.: Experiments of textual data analysis at electricite de France. In: IFCS- 92 of the International Federation of Classification Societies, pp. 569–576 (1992)Google Scholar
  9. 9.
    Feldman, R.; Dagan, I.: Knowledge discovery in textual databases (KDT). In: KDD-95, pp. 112–117 (1995)Google Scholar
  10. 10.
    Harish, B.S.; Guru, D.S.; Manjunath, S.: Representation and classification of text documents: a brief review. In: IJCA Special Issue on “Recent Trends in Image Processing and Pattern Recognition, RTIPPR (2010)Google Scholar
  11. 11.
    Beliga, S.; Mestrovic, A.; Ipsic, M.S.: An overview of graph based keyword extraction methods and approaches. J. Inf. Organ. Sci. 39(1), 1–20 (2015)Google Scholar
  12. 12.
    Han, J.; Kamber, M.; Pei, J.: Data Mining Concepts and Techniques. Morgan Kaufmann Publications, Burlington (2012)zbMATHGoogle Scholar
  13. 13.
    Ravi, K.; Ravi, R.: A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl. Based Syst. 89, 14–46 (2015)CrossRefGoogle Scholar
  14. 14.
    Montoyo, A.; Barco, P.M.; Balahur, A.: Subjectivity and sentiment analysis: an overview of the current state of the area and envisaged developments. Decis. Support Syst. 53, 675–679 (2012)CrossRefGoogle Scholar
  15. 15.
    Salton, G.: A vector space model for automatic indexing. Inf. Retr. Lang. Process. 18(11), 613–620 (1975)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefMathSciNetGoogle Scholar
  17. 17.
    Kim, H.K.; Kim, H.; Cho, S.: Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266(29), 336–352 (2017)CrossRefGoogle Scholar
  18. 18.
    Xia, T.; Chai, Y.: An improvement to TF-IDF: term distribution based term weight algorithm. J. Softw. 6(3), 413–420 (2011)CrossRefGoogle Scholar
  19. 19.
    Zhang, K.; Narayanan, R.; Choudhary, A.: Voice of the customers: mining online customer reviews for product feature-based ranking. Workshop on Online Social Networks (2010)Google Scholar
  20. 20.
    Tala, F.: A study of stemming effects on information retrieval in Bahasa Indonesia. Master thesis, Institute for Logic, Language and Computation, University of Amsterdam (2003)Google Scholar
  21. 21.
    Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)CrossRefGoogle Scholar
  22. 22.
    Luhn, H.O.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159–165 (1958)CrossRefMathSciNetGoogle Scholar
  23. 23.
    Willett, P.: Recent trends in hierarchical document clustering: a critical review. Inf. Process. Manag. 24(5), 577–597 (1988)CrossRefGoogle Scholar
  24. 24.
    Cutting, D.; Karger, D.R.; Pedersen, J.O.; Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)Google Scholar
  25. 25.
    Raykov, Y.P.; Boukouvalas, A.; Baig, F.; Little, M.A.: What to do when K-means clustering fails: a simple yet principled alternative algorithm. PLoS ONE (2016).  https://doi.org/10.1371/journal.pone.0162259 CrossRefGoogle Scholar
  26. 26.
    Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 2, 241–254 (1967)CrossRefzbMATHGoogle Scholar
  27. 27.
    Ghosh, S.; Dubey, S.K.: Comparative analysis of k-means and fuzzy c-means algorithms. Int. J. Adv. Comput. Sci. Appl. 4(4), 35–39 (2013)Google Scholar
  28. 28.
    Kaushik, S.: An Introduction to Clustering and Different Methods of Clustering. https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering (2016). Accessed 20 Aug 2018
  29. 29.
    Xu, R.; Wunsch, D.C.: Survey on clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)CrossRefGoogle Scholar
  30. 30.
    Pelleg, D.; Moore, A.: X-means: extending k-means with efficient estimation of the number of clusters. In: 17th International Conference on Machine Learning (ICML’00), pp. 727–734 (2000)Google Scholar
  31. 31.
    Faguo, Z.; Fan, Z.; Bingru, Y.: Research on short text classification algorithm based on statistics and rules, In: Third International Symposium on Electronic Commerce and Security, pp. 3–79 (2010)Google Scholar
  32. 32.
    Beleites, C.; Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations. Anal. Bioanal. Chem. 390, 1261–1271 (2008)CrossRefGoogle Scholar
  33. 33.
    Peng, T.; Jiang, M.; Hu, M.: A dynamic clustering algorithm based on small data set. In: Sixth International Conference on Computer Graphics, Imaging and Visualization (2009).  https://doi.org/10.1109/CGIV.2009.78
  34. 34.
    Brownlee, J.: A Gentle Introduction to k-Fold Cross-Validation. https://machinelearningmastery.com/k-fold-cross-validation/ (2018). Accessed 22 Oct 2018
  35. 35.
    Srivastava, T.: Basics of Ensemble Learning Explained in Simple English. https://www.analyticsvidhya.com/blog/2015/08/introduction-ensemble-learning/ (2015). Accessed 28 Sept 2018
  36. 36.
    Bafghi, E.P.: Clustering of customers based on shopping behavior and employing genetic algorithms. Eng. Technol. Appl. Sci. Res. 7(1), 1420–1424 (2017)Google Scholar
  37. 37.
    Xu, J.; Wang, P.; Tian, G.; Xu, B.; Zhao, J.; Wang, F.; Hao, H.: Short text clustering via convolutional neural networks. In: NAACL-HLT, pp. 62–69 (2015)Google Scholar
  38. 38.
    Meila, M.; Heckerman, D.: An experimental comparison of model-based clustering methods. Mach. Learn. 42, 9–29 (2001)CrossRefzbMATHGoogle Scholar
  39. 39.
    Quan, X.; Liu, G.; Lu, Z.; Ni, X.; Liu, W.: Short text similarity based on probabilistic topics. Knowl. Inf. Syst. 25(3), 473–491 (2010)CrossRefGoogle Scholar
  40. 40.
    Siddiqui, T.; Aalam, P.: Short text clustering; challenges and solutions: a literature review. Int. J. Math. Comput. Res. 3(6), 1025–1031 (2015)Google Scholar
  41. 41.
    Onan, A.; Korukoglu, S.; Bulut, H.: Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst. Appl. 57, 232–247 (2016)CrossRefGoogle Scholar
  42. 42.
    Hotho, A.; Staab, S.; Stumme, G.: WordNet improves text document clustering. In: 26th Annual International ACM SIGIR Conference Semantic Web Workshop, pp. 541–5449 (2003)Google Scholar
  43. 43.
    Hu, X.; Zhang, X.; Lu, C.; Park, E.K.; Zhou, X.: Exploiting Wikipedia as external knowledge for document clustering. In: 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 389–396 (2009)Google Scholar
  44. 44.
    Bollegala, D.; Matsuo, Y.; Ishizuka, M.: A web search engine-based approach to measure semantic similarity between words. IEEE Trans. Knowl. Data Eng. 23(7), 977–990 (2011)CrossRefGoogle Scholar
  45. 45.
    Shami, M.; Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: 15th International Conference 2006 on World Wide Web, pp. 377–386 (2006)Google Scholar
  46. 46.
    Wang, J.; Zhou, Y.; Li, L.; Hu, B.; Hu, X.: Improving short text clustering performance with keyword expansion. In: Wang, H., Shen, Y., Huang, T., Zeng, Z. (eds.) The Sixth International Symposium on Neural Networks, Advances in Intelligent and Soft Computing, vol. 56, pp. 291–298. Springer, Berlin, Heidelberg (2009)Google Scholar
  47. 47.
    Ni, X.; Quan, X.; Lu, Z.; Liu, W.; Hua, B.: Short text clustering by finding core terms. Knowl. Inf. Syst. 27(3), 345–365 (2011)CrossRefGoogle Scholar
  48. 48.
    Yin, J.; Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: SIGKDD, pp. 233–242 (2014)Google Scholar
  49. 49.
    Majumder, S.; Balaji, N.; Brey, K.; Fu, W.; Menzies, T.: 500 + times faster than deep learning (a case study exploring faster methods for text mining stackoverflow). In: Mining Software Repositories (MSR) IEEE/ACM 15th International Conference on ACM (2018)Google Scholar
  50. 50.
    Ye, M.; Zhang, P.; Nie, L.: Clustering sparse binary data with hierarchical Bayesian Bernoulli mixture model. Comput. Stat. Data Anal. 123, 32–49 (2018)CrossRefMathSciNetzbMATHGoogle Scholar
  51. 51.
    Wong, C.C.; Chen, C.C.: Data clustering by grey relational analysis. J. Grey Syst. 10(3), 281–288 (1998)Google Scholar
  52. 52.
    Yeh, M.F.: Data clustering via grey relational pattern analysis. J. Grey Syst. 14(3), 259–264 (2002)Google Scholar
  53. 53.
    Chang, K.C.; Yeh, F.: Grey relational analysis based approach for data clustering. IEE Proc. Vis. Image Signal Process. 152(2), 165–172 (2005)CrossRefGoogle Scholar
  54. 54.
    Pakkar, M.S.: An integrated approach to grey relational analysis, analytic hierarchy process and data envelopment analysis. J. Cent. Cathedra Bus. Econ. Res. J. 9(1), 71–86 (2016)CrossRefGoogle Scholar
  55. 55.
    Wu, W.H.; Lin, C.T.; Peng, K.H.; Huang, C.C.: Applying hierarchical grey relation clustering analysis to geographical information systems—a case study of the hospitals in Taipei city. Expert Syst. Appl. 39, 7247–7254 (2012)CrossRefGoogle Scholar
  56. 56.
    Deng, J.L.: Control problems of grey systems. Syst. Control Lett. 1(5), 288–294 (1982)CrossRefMathSciNetzbMATHGoogle Scholar
  57. 57.
    Liu, S.; Lin, Y.: Grey Information Theory and Practical Applications. Springer, New York (2006)Google Scholar
  58. 58.
    Liu, S.; Forrest, J.; Yang, Y.: A brief introduction to grey systems theory. Grey Syst. Theory Appl. 2(2), 89–104 (2012)CrossRefGoogle Scholar
  59. 59.
    Yıldırım, B.F.: Gri ilişkisel analiz. In: Yıldırım, B.F., Önder, E. (eds.) Çok Kriterli Karar Verme Yöntemleri, pp. 229–236. Dora Basım Yayın, Bursa, Turkey (2015)Google Scholar
  60. 60.
    Jin, X.: Grey relational clustering method and its application. J. Grey Syst. 3, 181–188 (1993)zbMATHGoogle Scholar
  61. 61.
    Hinduja, A.; Pandey, M.: Multicriteria recommender system for life insurance plans based on utility theory. Indian J. Sci. Technol. 10(14), 1–8 (2017).  https://doi.org/10.17485/ijst/2017/v10i14/111376 CrossRefGoogle Scholar
  62. 62.
    Ertugrul, I.; Oztas, T.; Ozcil, A.; Oztas, G. Z.: Grey relational analysis approach in academic performance comparison of university: a case study of Turkish universities. Eur. Sci. J. June 2016 Special edition, pp. 128–139 (2016)Google Scholar
  63. 63.
    Wilbur, W.J.; Sirotkin, K.: The automatic identification of stop words. J. Inf. Sci. 18, 45–55 (1992)CrossRefGoogle Scholar
  64. 64.
    Porter, M.: Snowball: A Language for Stemming Algorithms. http://snowball.tartarus.org/texts/ (2001). Accessed 10 Nov 2018
  65. 65.
    Jivani, A.G.: A comparative study of stemming algorithms. Int. J. Comput. Technol. Appl. 2(6), 1930–1938 (2011)Google Scholar
  66. 66.
    Powers, D.W.M.: Evaluation: from precision, recall and f-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)MathSciNetGoogle Scholar
  67. 67.
    Rosenberg, A.; Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, pp. 410–420 (2007)Google Scholar
  68. 68.
    Nizam, H.; Akın, S.S.: Sosyal medyada makine öğrenmesi ile duygu analizinde dengeli ve dengesiz veri setlerinin performanslarının karşılaştırılması. In: XIX. Türkiye’de İnternet Konferansı, İzmir, pp. 129–136 (2014)Google Scholar
  69. 69.
    Chormunge, S.; Jena, S.: Efficiency and effectiveness of clustering algorithms for high dimensional data. Int. J. Comput. Appl. 125(11), 35–40 (2015)Google Scholar
  70. 70.
    Flach, P.; Kull, M.: Precision-recall-gain curves: PR analysis done right. Adv. Neural. Inf. Process. Syst. 28, 838–846 (2015)Google Scholar
  71. 71.
    Maratea, A.; Petrosino, A.; Manzo, M.: Adjusted F-measure and kernel scaling for imbalanced data learning. Inf. Sci. 257, 331–341 (2014)CrossRefGoogle Scholar
  72. 72.
    Boughorbel, S.; Jarray, F.; El-Anbari, M.: Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PLoS ONE 12(6), e0177678 (2017).  https://doi.org/10.1371/journal.pone.0177678 CrossRefGoogle Scholar
  73. 73.
    Liu, Y.; Cheng, J.; Yan, C.; Wu, X.; Chen, F.: Research on Matthews correlation coefficients metrics of personalized recommendation algorithm evaluation. Int. J. Hybrid Inf. Technol. 8(1), 163–172 (2015)CrossRefGoogle Scholar

Copyright information

© King Fahd University of Petroleum & Minerals 2019

Authors and Affiliations

  1. 1.Department of Industrial EngineeringBurdur Mehmet Akif Ersoy UniversityBurdurTurkey
  2. 2.Department of Computer EngineeringBurdur Mehmet Akif Ersoy UniversityBurdurTurkey

Personalised recommendations