Abstract
Knowledge workers are burdened with information overload. The information they need might be scattered in many places, buried in a file system, in their email, or on the web. Traditional Clustering algorithms help in assimilating these wide sources of information and generating meaningful relationships amongst them. A typical clustering preprocessing involves tokenization, removal of stop words, stemming, pruning etc. In this paper, we propose the use of summary and heuristics of a document as a pre-processing technique. This technique preserves the formatting of a document and uses this information for producing better clusters. In addition, only a summary of a document is used as the basis for clustering instead of the whole document. Clustering algorithms using the proposed pre-processing technique on formatted documents resulted in improved and more meaningful clusters.
Chapter PDF
Similar content being viewed by others
References
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264–323 (1999)
Budzik, J., Hammond, K.J., Birnbaum, L.: Information access in context. Knowledge-Based Systems 14, 37–53 (2001)
Visser, W.T., Wieling, M.B.: Sentence-based Summarization of Scientific Documents. The design and implementation of an online available automatic summarizer. Report (2009), http://home.hccnet.nl/m.b.wieling/files/wielingvisser05automaticsummarization.pdf (last retrieved Febuary 12, 2009)
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for documents datasets. In: International Conference on Information and Knowledge Management, McLean, Virginia, United States, pp. 515–524 (2002)
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical report, Department of Computer Science, University of Minnesota
Rasmussen, M., Karypis, G.: gCLUTO: An interactive clustering, visualization and analysis system. Technical Report 04-021, University of Minnesota (2004)
Reuters-21578 Dataset, http://kdd.ics.uci.edu/databases/reuters_transcribed/reuters_transcribed.html
Reuters Transcribed Subset Dataset, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS-Clustering Categorical Data Using Summaries. In: Proceedings of the ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, United States (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Allamraju, S.H., Chun, R. (2009). Enhancing Document Clustering through Heuristics and Summary-Based Pre-processing. In: Salvendy, G., Smith, M.J. (eds) Human Interface and the Management of Information. Information and Interaction. Human Interface 2009. Lecture Notes in Computer Science, vol 5618. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02559-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-02559-4_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02558-7
Online ISBN: 978-3-642-02559-4
eBook Packages: Computer ScienceComputer Science (R0)