Representative Based Document Clustering

  • Arko BanerjeeEmail author
  • Arun K. Pujari
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 27)


In this paper we propose a novel approach to document clustering by introducing a representative-based document similarity model that treats a document as an ordered sequence of words and partitions it into chunks for gaining valuable proximity information between words. Chunks are subsequences in a document that have low internal entropy and high boundary entropy. A chunk can be a phrase, a word or a part of word. We implement a linear time unsupervised algorithm that segments sequence of words into chunks. Chunks that occur frequently are considered as representatives of the document set. The representative based document similarity model, containing a term-document matrix with respect to the representatives, is a compact representation of the vector space model that improves quality of document clustering over traditional methods.


document clustering sequence segmentation word segmentation entropy 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cohen, P., Adams, N., Heeringa, B.: Voting experts: An unsupervised algorithm for segmenting sequences. Journal of Intelligent Data Analysis (2006)Google Scholar
  2. 2.
    Hewlett, D., Cohen, P.: Bootstrap Voting Experts. In: IJCAI, pp. 1071–1076 (2009)Google Scholar
  3. 3.
    Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proc. 21st Ann. Int’l ACM SIGIR Conf., pp. 45–54 (1998)Google Scholar
  4. 4.
    Zamir, O., Etzioni, O.: Grouper: A Dynamic Clustering Interface to Web Search Results. Computer Networks 31(11-16), 1361–1374 (1999)CrossRefGoogle Scholar
  5. 5.
    Hammouda, K., Kamel, M.: Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE Trans. Knowl. Data Eng. 16(10), 1279–1296 (2004)CrossRefGoogle Scholar
  6. 6.
    Sun, J., Shen, Z., Li, H., Shen, Y.: Clustering Via Local Regression. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 456–471. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  7. 7.
    Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)CrossRefGoogle Scholar
  8. 8.
    Wu, M., Scholkopf, B.: A local learning Approach for Clustering. In: Advances in Neural Information Processing Systems, vol. 19 (2006)Google Scholar
  9. 9.
    Zhao, Y., Karypis, G.: Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning 55, 311–331 (2004)CrossRefzbMATHGoogle Scholar
  10. 10.
    Lewis, D.D.: Reuters-21578 text categorization test collection,
  11. 11.
    TREC: Text REtrieval Conference,
  12. 12.
    Strehl, A., Ghosh, J.: Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research 3, 583–617 (2002)MathSciNetGoogle Scholar
  13. 13.
    Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.College of Engineering and ManagementKolaghatIndia
  2. 2.University of HyderabadHyderabadIndia

Personalised recommendations