Skip to main content

Text Document Clustering Using Community Discovery Approach

  • Conference paper
  • First Online:
Distributed Computing and Internet Technology (ICDCIT 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11969))

  • 1058 Accesses

Abstract

The problem of document clustering is about automatic grouping of text documents into groups containing similar documents. This problem under supervised setting yields good results whereas for unannotated data the unsupervised machine learning approach does not yield good results always. Algorithms like K-Means clustering are most popular when the class labels are not known. The objective of this work is to apply community discovery algorithms from the literature of social network analysis to detect the underlying groups in the text data.

We model the corpus of documents as a graph with distinct non-trivial words from the whole corpus considered as nodes and an edge is added between two nodes if the corresponding word nodes occur together in at least one common document. Edge weight between two word nodes is defined as the number of documents in which those two words co-occur together. We apply the fast Louvain community discovery algorithm to detect communities. The challenge is to interpret the communities as classes. If the number of communities obtained is greater than the required number of classes, a technique for merging is proposed. The community which has the maximum number of similar words with a document is assigned as the community for that document. The main thrust of the paper is to show a novel approach to document clustering using community discovery algorithms. The proposed algorithm is evaluated on a few bench mark data sets and we find that our algorithm gives competitive results on majority of the data sets when compared to the standard clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dataset:BBC (2019). http://mlg.ucd.ie/datasets/bbc.html. Accessed 29 Apr 2019

  2. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp.t 2008(10), 100–108 (2008)

    MATH  Google Scholar 

  3. CD\(\_\)review (2019). https://gist.githubusercontent.com/kunalj101corpus. Accessed 29 Apr 2019

  4. Chen, Z., Liu, B.: Mining topics in documents: Standing on the shoulders of big data. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 1116–1125, New York. ACM (2014)

    Google Scholar 

  5. Fortunato, S.: Community detection in graphs. Phys. Reports 486(3–5), 75–174 (2010)

    Article  MathSciNet  Google Scholar 

  6. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002)

    Article  MathSciNet  Google Scholar 

  7. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. Series in Data Management Systems. Morgan Kaufmann, San Francisco (2012)

    MATH  Google Scholar 

  8. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a k-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)

    MATH  Google Scholar 

  9. Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in Neural Information Processing Systems (NIPS), vol. 28, pp. 919–927 (2015)

    Google Scholar 

  10. Kido, G.S., Igawa, R.A., Barbon, S.: Topic Modeling based on Louvain method in Online Social Networks, XII Brazilian Symposium on Information Systems, Florianópolis, SC, 17–20 May 2016

    Google Scholar 

  11. Kiph, T.N., Welling, N.: Semi-Supervised Classification with graph convoluional networks. In: ICLR (2017)

    Google Scholar 

  12. Liu, X., Li, K., Zhou, M.: Collective semantic role labeling for tweets with clustering, IJCAI, pp. 1832–1837, (2011)

    Google Scholar 

  13. Papadopoulos, S., Kompatsiaris, Y., Vakali, A., Spyridonos, P.: Community detection in social media. Data Min. Knowl. Disc. 24(3), 515–554 (2012)

    Article  Google Scholar 

  14. Sarkar, K., Law, R.: A novel approach to document classification using wordnet. ArXiv preprint arXiv:1510.02755 (2015)

  15. Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detection on social media: a data mining perspective. ACM SIGKDD Explor. Newsl. 19(1), 22–36 (2017)

    Article  Google Scholar 

  16. Dataset: SMSSpamCollection. https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

  17. Wordnet: https://wordnet.princeton.edu (2010)

  18. Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Durga Bhavani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Beniwal, A., Roy, G., Durga Bhavani, S. (2020). Text Document Clustering Using Community Discovery Approach. In: Hung, D., D´Souza, M. (eds) Distributed Computing and Internet Technology. ICDCIT 2020. Lecture Notes in Computer Science(), vol 11969. Springer, Cham. https://doi.org/10.1007/978-3-030-36987-3_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-36987-3_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-36986-6

  • Online ISBN: 978-3-030-36987-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics