Representative Based Document Clustering
In this paper we propose a novel approach to document clustering by introducing a representative-based document similarity model that treats a document as an ordered sequence of words and partitions it into chunks for gaining valuable proximity information between words. Chunks are subsequences in a document that have low internal entropy and high boundary entropy. A chunk can be a phrase, a word or a part of word. We implement a linear time unsupervised algorithm that segments sequence of words into chunks. Chunks that occur frequently are considered as representatives of the document set. The representative based document similarity model, containing a term-document matrix with respect to the representatives, is a compact representation of the vector space model that improves quality of document clustering over traditional methods.
Keywordsdocument clustering sequence segmentation word segmentation entropy
Unable to display preview. Download preview PDF.
- 1.Cohen, P., Adams, N., Heeringa, B.: Voting experts: An unsupervised algorithm for segmenting sequences. Journal of Intelligent Data Analysis (2006)Google Scholar
- 2.Hewlett, D., Cohen, P.: Bootstrap Voting Experts. In: IJCAI, pp. 1071–1076 (2009)Google Scholar
- 3.Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proc. 21st Ann. Int’l ACM SIGIR Conf., pp. 45–54 (1998)Google Scholar
- 8.Wu, M., Scholkopf, B.: A local learning Approach for Clustering. In: Advances in Neural Information Processing Systems, vol. 19 (2006)Google Scholar
- 10.Lewis, D.D.: Reuters-21578 text categorization test collection, http://www.daviddlewis.com/resources/testcollections/reuters21578
- 11.TREC: Text REtrieval Conference, http://trec.nist.gov
- 13.Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)Google Scholar