Agglomerative Similarity Measure Based Automated Clustering of Scholarly Articles

  • Dilip Singh SisodiaEmail author
  • Manjula Choudhary
  • Tummala Vandana
  • Rishi Rai
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 748)


The flooding of online scholarly articles necessitates the automated organization of documents according to their most descriptive attributes. In this paper, an agglomerative similarity measure based on common features associated with research articles, such as number of references, authors, citations, and contents are used for automated clustering of scholarly articles. The agglomerative similarity matrix is based on a combination of citation matrix, author matrix, and the content matrix for feature vector representation. The experiments are performed on agglomerative feature vector derived from wiki20 dataset using different unsupervised learning algorithms such as K-Means, K-medoids, and Fuzzy C-means. The clustering result obtained with modified feature vector is compared to the existing bag of words model using separation and cohesion as performance metrics. The Dunn’s index is used for finding the optimal number of clusters.


Similarity matrix K-means K-medoids Fuzzy c-mean TF-IDF 


  1. 1.
    Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and TechnIques. Elsevier (2011)Google Scholar
  2. 2.
    Kaufman, L., Rousseeuw, P.J.: Clustering large applications (Program CLARA). In: Finding Groups in Data: An Introduction to Cluster Analysis, pp. 126–163 (2008)Google Scholar
  3. 3.
    Wang, X., Zhao, Y., Liu, R., Zhang, J.: Knowledge-transfer analysis based on co-citation clustering. Scientometrics 97, 859–869 (2013)CrossRefGoogle Scholar
  4. 4.
    Aljaber, B., Stokes, N., Bailey, J., Pei, J.: Document clustering of scientific texts using citation contexts. Inf. Retr. 13, 101–131 (2010)CrossRefGoogle Scholar
  5. 5.
    Sun, X.: Textual document clustering using topic models. In: 10th International Conference on Semantics, Knowledge and Grids (SKG), pp. 1–4 (2014)Google Scholar
  6. 6.
    Nakazawa, R., Itoh, T., Saito, T.: A visualization of research papers based on the topics and citation network. In: 19th International Conference on Information Visualisation (iV), pp. 283–289 (2015)Google Scholar
  7. 7.
    Shubankar, K., Singh, A., Pudi, V.: A frequent keyword-set based algorithm for topic modeling and clustering of research papers. In: 3rd Conference on Data Mining and Optimization (DMO), pp. 96–102 (2011)Google Scholar
  8. 8.
    Gao, T., Du, J., Wang, S., Chen, L.: Topic detection for emergency events based on FCM document clustering. In: 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT), pp. 1181–1185 (2010)Google Scholar
  9. 9.
    Kummamuru, K., Dhawale, A., Krishnapuram, R.: Fuzzy co-clustering of documents and keywords. In: The 12th IEEE International Conference onFuzzy Systems, 2003(FUZZ’03), pp. 772–777 (2003)Google Scholar
  10. 10.
    Win, T.T., Mon, L.: Document clustering by fuzzy c-mean algorithm. In: 2nd International Conference on Advanced Computer Control (ICACC), pp. 239–242 (2010)Google Scholar
  11. 11.
    Mishra, R.K., Saini, K., Bagri, S.: Text document clustering on the basis of inter passage approach by using K-means. In: International Conference on Computing, Communication & Automation (ICCCA), pp. 110–113 (2015)Google Scholar
  12. 12.
    Chang, H.-C., Hsu, C.-C., Deng, Y.-W.: Unsupervised document clustering based on keyword clusters. In: IEEE International Symposium on Communications and Information Technology (ISCIT 2004), pp. 1198–1203 (2004)Google Scholar
  13. 13.
    Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20, 1217–1229 (2008)CrossRefGoogle Scholar
  14. 14.
    Matei, L.S., Trăuşan-Matu, Ş.: Document clustering based on time series. In: 19th International Conference on System Theory, Control and Computing (ICSTCC 2015), pp. 128–133 (2015)Google Scholar
  15. 15.
    Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)CrossRefGoogle Scholar
  16. 16.
    Ramasubramanian, C., Ramya, R.: Effective pre-processing activities in text mining using improved porter’s stemming algorithm. Int. J. Adv. Res. Comput. Commun. Eng. 2, 4536–4538 (2013)Google Scholar
  17. 17.
    Sisodia, D.S., Verma, S., Vyas, O.P.: A discounted fuzzy relational clustering of web users’ using intuitive augmented sessions dissimilarity metric. IEEE Access. 4, 6883–6893 (2016)CrossRefGoogle Scholar
  18. 18.
    Sisodia, D.S., Verma, S., Vyas, O.P.: Augmented intuitive dissimilarity metric for clustering of Web user sessions. J. Inf. Sci. 43, 480–491 (2016)CrossRefGoogle Scholar
  19. 19.
    Ben-Gal, I.: Outlier detection. Data Mining and Knowledge Discovery Handbook, pp.131–146 (2005)Google Scholar
  20. 20.
    Dunn, J.C.: A Fuzzy Relative of the ISODATA Process and its use in Detecting Compact Well-Separated Clusters (1973)Google Scholar
  21. 21.
    Sisodia, D.S., Verma, S., Vyas, O.P.: Performance evaluation of an augmented session dissimilarity matrix of web user sessions using relational fuzzy C-means clustering. Int. J. Appl. Eng. Res. 11, 6497–6503 (2016)Google Scholar
  22. 22.
    Medelyan, O., Witten, I.H., Milne, D.: Topic indexing with Wikipedia. In: Proceedings of the AAAI WikiAI workshop, pp. 19–24 (2008)Google Scholar
  23. 23.
    Medelyan, O.: Human-Competitive Automatic Topic Indexing (2009)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Dilip Singh Sisodia
    • 1
    Email author
  • Manjula Choudhary
    • 1
  • Tummala Vandana
    • 1
  • Rishi Rai
    • 1
  1. 1.National Institute of Technology RaipurRaipurIndia

Personalised recommendations