Advertisement

Cognitive Computation

, Volume 11, Issue 2, pp 271–293 | Cite as

Automatic Scientific Document Clustering Using Self-organized Multi-objective Differential Evolution

  • Naveen SainiEmail author
  • Sriparna Saha
  • Pushpak Bhattacharyya
Article
  • 95 Downloads

Abstract

Document clustering is the partitioning of a given collection of documents into various K- groups based on some similarity/dissimilarity criterion. This task has applications in scope detection of journals/conferences, development of some automated peer-review support systems, topic-modeling, latest cognitive-inspired works on text summarization, and classification of documents based on semantics, etc. In the current paper, a cognitive-inspired multi-objective automatic document clustering technique is proposed which is a fusion of self-organizing map (SOM) and multi-objective differential evolution approach. The variable number of cluster centers are encoded in different solutions of the population to determine the number of clusters from a data set in an automated way. These solutions undergo various genetic operations during evolution. The concept of SOM is utilized in designing new genetic operators for the proposed clustering technique. In order to measure the goodness of a clustering solution, two cluster validity indices, Pakhira-Bandyopadhyay-Maulik index, and Silhouette index, are optimized simultaneously. The effectiveness of the proposed approach, namely self-organizing map based multi-objective document clustering technique (SMODoc_clust) is shown in automatic classification of some scientific articles and web-documents. Different representation schemas including tf, tf-idf and word-embedding are employed to convert articles in vector-forms. Comparative results with respect to internal cluster validity indices, namely, Dunn index and Davies-Bouldin index, are shown against several state-of-the-art clustering techniques including three multi-objective clustering techniques namely MOCK, VAMOSA, NSGA-II-Clust, single objective genetic algorithm (SOGA) based clustering technique, K-means, and single-linkage clustering. Results obtained clearly show that our approach is better than existing approaches. The validation of the obtained results is also shown using statistical significant t tests.

Keywords

Clustering Cluster validity indices Self Organizing Map (SOM) Differential Evolution (DE) Polynomial mutation Multi-objective Optimization (MOO) 

Notes

Acknowledgments

Dr. Sriparna Saha would like to acknowledge the support from SERB Women in Excellence Award-SB/WEA/08/2017 for conducting this particular research.

Compliance with Ethical Standards

Conflict of interest

The authors declare that they have no conflict of interest.

References

  1. 1.
    Aggarwal CC, Zhai C. Mining text data. Berlin: Springer Science & Business Media; 2012.Google Scholar
  2. 2.
    Al-Radaideh QA, Bataineh DQ. 2018. A hybrid approach for arabic text summarization using domain knowledge and genetic algorithms. Cognitive Computation, 1–19.Google Scholar
  3. 3.
    Arbelaitz O, Gurrutxaga I, Muguerza J, PéRez JM, Perona I. An extensive comparative study of cluster validity indices. Pattern Recogn 2013;46(1):243–256.Google Scholar
  4. 4.
    Bandyopadhyay S, Maulik U. Nonparametric genetic clustering: comparison of validity indices. IEEE Trans Syst, Man, Cybern Part C (Applications and Reviews) 2001;31(1):120–125.Google Scholar
  5. 5.
    Bandyopadhyay S, Maulik U. Genetic clustering for automatic evolution of clusters and application to image classification. Pattern Recogn 2002;35(6):1197–1208.Google Scholar
  6. 6.
    Bandyopadhyay S, Saha S. Gaps: a clustering method using a new point symmetry-based distance measure. Pattern Recogn 2007;40(12):3430–3451.Google Scholar
  7. 7.
    Bandyopadhyay S, Saha S. A new principal axis based line symmetry measurement and its application to clustering. International Conference on Neural Information Processing. Springer; 2008. p. 543–550.Google Scholar
  8. 8.
    Bandyopadhyay S, Saha S. A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Trans Knowl Data Eng 2008b;20(11):1441–1457.Google Scholar
  9. 9.
    Bandyopadhyay S, Maulik U, Mukhopadhyay A. Multiobjective genetic clustering for pixel classification in remote sensing imagery. IEEE Trans Geoscience Remote Sens 2007;45(5):1506–1511.Google Scholar
  10. 10.
    Bandyopadhyay S, Saha S, Maulik U, Deb K. A simulated annealing-based multiobjective optimization algorithm: Amosa. IEEE Trans Evol Comput 2008;12(3):269–283.Google Scholar
  11. 11.
    Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res 2003;3:993–1022.Google Scholar
  12. 12.
    Buitelaar P, Eigner T. Topic extraction from scientific literature for competency management. The 7th International Semantic Web Conference; 2008. p. 25–66.Google Scholar
  13. 13.
    Cardoso-Cachopo A. 2007. Improving Methods for Single-label Text Categorization PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.Google Scholar
  14. 14.
    Carpenter MP, Narin F. Clustering of scientific journals. J Assoc Inform Sci Technol 1973;24(6):425–436.Google Scholar
  15. 15.
    Yw C, Zhou Q, Luo W, Du JX. Classification of chinese texts based on recognition of semantic topics. Cogn Comput 2016;8(1):114–124.  https://doi.org/10.1007/s12559-015-9346-8.Google Scholar
  16. 16.
    Das S, Abraham A, Konar A. Automatic clustering using an improved differential evolution algorithm. IEEE Trans Syst, Man, Cybern-Part A: Syst Human 2008;38(1):218–237.Google Scholar
  17. 17.
    Davies DL, Bouldin DW. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI 1979;1(2):224–227.  https://doi.org/10.1109/TPAMI.1979.4766909.Google Scholar
  18. 18.
    Deb K, Vol. 16. Multi-objective optimization using evolutionary algorithms. New York: Wiley; 2001.Google Scholar
  19. 19.
    Deb K, Tiwari S. Omni-optimizer: a generic evolutionary algorithm for single and multi-objective optimization. Eur J Oper Res 2008;185(3):1062–1087.Google Scholar
  20. 20.
    Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans Evol Comput 2002;6(2):182–197.Google Scholar
  21. 21.
    Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 2006;7(Jan):1–30.Google Scholar
  22. 22.
    Doerre J, Gerstl P, Goeser S, Mueller A, Seiffert R. 2002. Taxonomy generation for document collections. US Patent 6,446,061.Google Scholar
  23. 23.
    Dutta P, Saha S. Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering. Comput Biol Med 2017;89:31–43.Google Scholar
  24. 24.
    Fortuna B, Grobelnik M, Mladenic D. Visualization of text document corpus. Informatica 2005;29:4.Google Scholar
  25. 25.
    Goldstein J, Mittal V, Carbonell J, Kantrowitz M. Multi-document summarization by sentence extraction. Proceedings of the 2000 NAACL-ANLPWorkshop on Automatic Summarization - Volume 4, Association for Computational Linguistics, Stroudsburg, PA, USA, NAACL-ANLP-AutoSum ’00; 2000. p. 40–48.  https://doi.org/10.3115/1117575.1117580.
  26. 26.
    Gu F, Liu HL, Tan KC. A multiobjective evolutionary algorithm using dynamic weight design method. Int J Innovative Comput Inf Control 2012;8:3677–3688.Google Scholar
  27. 27.
    Gupta V, Kaur N. A novel hybrid text summarization system for punjabi text. Cogn Comput 2016;8(2): 261–277.Google Scholar
  28. 28.
    Handl J, Knowles J. An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 2007; 11(1):56–76.Google Scholar
  29. 29.
    Haykin SS, Vol. 3. Neural networks and learning machines. Upper Saddle River: Pearson; 2009.Google Scholar
  30. 30.
    Iorio A, Li X. Rotated problems and rotationally invariant crossover in evolutionary multi-objective optimization. Int J Comput Intell Appl 2008;7(02):149–186.Google Scholar
  31. 31.
    Jain AK, Dubes RC. Algorithms for clustering data. Upper Saddle River: Prentice-Hall, Inc; 1988.Google Scholar
  32. 32.
    Kashef R, Kamel MS. Enhanced bisecting k-means clustering using intermediate cooperation. Pattern Recogn 2009;42(11):2557–2569.Google Scholar
  33. 33.
    Kennedy J. Particle swarm optimization. Encyclopedia of machine learning. Springer; 2011. p. 760–766.Google Scholar
  34. 34.
    Kohonen T. The self-organizing map. Neurocomputing 1998;21(1):1–6.Google Scholar
  35. 35.
    Konak A, Coit DW, Smith AE. Multi-objective optimization using genetic algorithms: a tutorial. Reliability Eng Syst Safety 2006;91(9):992–1007.Google Scholar
  36. 36.
    Korenius T, Laurikkala J, Järvelin K, Juhola M. Stemming and lemmatization in the clustering of finnish text documents. Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM; 2004. p. 625–633.Google Scholar
  37. 37.
    Kovács F, Legány C, Babos A. Cluster validity measurement techniques. 6th International symposium of hungarian researchers on computational intelligence; 2005.Google Scholar
  38. 38.
    Lauren P, Qu G, Yang J, Watta P, Huang GB, Lendasse A. 2018. Generating word embeddings from an extreme learning machine for sentiment analysis and sequence labeling tasks. Cognitive Computation, 1–14.Google Scholar
  39. 39.
    Le Q, Mikolov T. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning (ICML-14); 2014. p. 1188–1196.Google Scholar
  40. 40.
    Li Y, Pan Q, Yang T, Wang S, Tang J, Cambria E. Learning word representations for sentiment analysis. Cogn Comput 2017;9(6):843–851.Google Scholar
  41. 41.
    Lichman M. 2013. UCI machine learning repository. http://archive.ics.uci.edu/ml.
  42. 42.
    Loper E, Bird S. Nltk: the natural language toolkit. Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, ETMTNLP ’02; 2002. p. 63–70.  https://doi.org/10.3115/1118108.1118117.
  43. 43.
    Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge: Cambridge University Press; 2009.Google Scholar
  44. 44.
    Maulik U, Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 2002;24(12):1650–1654.Google Scholar
  45. 45.
    Mikolov T, Chen K, Corrado G, Dean J. 2013. Efficient estimation of word representations in vector space. arXiv:13013781.
  46. 46.
    Moran K, Wallace BC, Brodley CE. Discovering better aaai keywords via clustering with community-sourced constraints. AAAI; 2014. p. 1265–1271.Google Scholar
  47. 47.
    Pakhira MK, Bandyopadhyay S, Maulik U. Validity index for crisp and fuzzy clusters. Pattern Recogn 2004;37(3):487–501.Google Scholar
  48. 48.
    Pennington J, Socher R, Manning C. Glove: global vectors for word representation. Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP); 2014. p. 1532–1543.Google Scholar
  49. 49.
    Price K, Storn RM, Lampinen JA. Differential evolution: a practical approach to global optimization. Berlin: Springer Science & Business Media; 2006.Google Scholar
  50. 50.
    Roussinov DG, Chen H. 1998. A scalable self-organizing map algorithm for textual classification: a neural network approach to thesaurus generation.Google Scholar
  51. 51.
    Saha S, Bandyopadhyay S. A symmetry based multiobjective clustering technique for automatic evolution of clusters. Pattern Recogn 2010;43(3):738–751.Google Scholar
  52. 52.
    Saha S, Bandyopadhyay S. Some connectivity based cluster validity indices. Appl Soft Comput 2012;12(5): 1555–1565.Google Scholar
  53. 53.
    Saha S, Bandyopadhyay S. A generalized automatic clustering algorithm in a multiobjective framework. Appl Soft Comput 2013;13(1):89–108.Google Scholar
  54. 54.
    Sahi M, Gupta V. A novel technique for detecting plagiarism in documents exploiting information sources. Cogn Comput 2017;9(6):852–867.Google Scholar
  55. 55.
    Saini N, Chourasia S, Saha S, Bhattacharyya P. A self organizing map based multi-objective framework for automatic evolution of clusters. International Conference on Neural Information Processing. Springer; 2017. p. 672–682.Google Scholar
  56. 56.
    Saini N, Saha S, Bhattacharyya P. Cascaded Som: an improved technique for automatic email classification. 2018 International Joint Conference on Neural Networks (IJCNN). IEEE; 2018. p. 1–8.Google Scholar
  57. 57.
    Singh J, Gupta V. An efficient corpus-based stemmer. Cogn Comput 2017;9(5):671–688.Google Scholar
  58. 58.
    Starczewski A. A new validity index for crisp clusters. Pattern Anal Applic 2017;20(3):687–700.Google Scholar
  59. 59.
    Steinbach M, Karypis G, Kumar V, et al. A comparison of document clustering techniques. KDD Workshop on text mining, Boston; 2000. p. 525–526.Google Scholar
  60. 60.
    Suresh K, Kundu D, Ghosh S, Das S, Abraham A. Data clustering using multi-objective differential evolution algorithms. Fundamenta Informaticae 2009;97(4):381–403.Google Scholar
  61. 61.
    Wang H. 2014. Introduction to word2vec and its application to find predominant word senses. http://complinghssntuedusg/courses/hg7017/pdf/word2vec and its application to wsd pdf.
  62. 62.
    Welch BL. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 1947;34(1/2):28–35. http://www.jstor.org/stable/2332510.Google Scholar
  63. 63.
    Witten I, Bainbridge D, Paynter G, Boddie S. 2002. Importing documents and metadata into digital libraries: requirements analysis and an extensible architecture. Research and Advanced Technology for Digital Libraries, 219–229.Google Scholar
  64. 64.
    Xu W, Liu X, Gong Y. Document clustering based on non-negative matrix factorization. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM; 2003 . p. 267–273.Google Scholar
  65. 65.
    Zhang H, Zhang X, Gao XZ, Song S. Self-organizing multiobjective optimization based on decomposition with neighborhood ensemble. Neurocomputing 2016;173:1868–1884.Google Scholar
  66. 66.
    Zhang H, Zhou A, Song S, Zhang Q, Gao XZ, Zhang J. A self-organizing multiobjective evolutionary algorithm. IEEE Trans Evol Comput 2016;20(5):792–806.  https://doi.org/10.1109/TEVC.2016.2521868.Google Scholar
  67. 67.
    Zhou A, Qf Z, Zhang G. Multiobjective evolutionary algorithm based on mixture gaussian models. J Softw 2014;25(5):913–928.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology PatnaPatnaIndia

Personalised recommendations