Enhancing Document Clustering through Heuristics and Summary-Based Pre-processing

Allamraju, Sri Harsha; Chun, Robert

doi:10.1007/978-3-642-02559-4_12

Sri Harsha Allamraju¹⁸ &
Robert Chun¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5618))

Included in the following conference series:

Symposium on Human Interface

2526 Accesses

Abstract

Knowledge workers are burdened with information overload. The information they need might be scattered in many places, buried in a file system, in their email, or on the web. Traditional Clustering algorithms help in assimilating these wide sources of information and generating meaningful relationships amongst them. A typical clustering preprocessing involves tokenization, removal of stop words, stemming, pruning etc. In this paper, we propose the use of summary and heuristics of a document as a pre-processing technique. This technique preserves the formatting of a document and uses this information for producing better clusters. In addition, only a summary of a document is used as the basis for clustering instead of the whole document. Clustering algorithms using the proposed pre-processing technique on formatted documents resulted in improved and more meaningful clusters.

Download to read the full chapter text

Chapter PDF

Text Clustering and Text Summarization on the Use of Side Information

Improving Clustering Quality by Automatic Text Summarization

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Keywords

References

Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Budzik, J., Hammond, K.J., Birnbaum, L.: Information access in context. Knowledge-Based Systems 14, 37–53 (2001)
Article Google Scholar
Visser, W.T., Wieling, M.B.: Sentence-based Summarization of Scientific Documents. The design and implementation of an online available automatic summarizer. Report (2009), http://home.hccnet.nl/m.b.wieling/files/wielingvisser05automaticsummarization.pdf (last retrieved Febuary 12, 2009)
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for documents datasets. In: International Conference on Information and Knowledge Management, McLean, Virginia, United States, pp. 515–524 (2002)
Google Scholar
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical report, Department of Computer Science, University of Minnesota
Google Scholar
Rasmussen, M., Karypis, G.: gCLUTO: An interactive clustering, visualization and analysis system. Technical Report 04-021, University of Minnesota (2004)
Google Scholar
Reuters-21578 Dataset, http://kdd.ics.uci.edu/databases/reuters_transcribed/reuters_transcribed.html
Reuters Transcribed Subset Dataset, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS-Clustering Categorical Data Using Summaries. In: Proceedings of the ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, United States (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, San Jose State University, San Jose, CA 95192, USA
Sri Harsha Allamraju & Robert Chun

Authors

Sri Harsha Allamraju
View author publications
You can also search for this author in PubMed Google Scholar
Robert Chun
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Purdue University, Grissom Hall, Room 263, 315 North Grant Street, 47907-2023, West Lafayette, IN, USA
Gavriel Salvendy
Department of Industrial and Systems Engineering, University of Wisconsin, 459 Mechanical Engineering Building, 1513 University Avenue, WI 53706, Madison, USA
Michael J. Smith

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Allamraju, S.H., Chun, R. (2009). Enhancing Document Clustering through Heuristics and Summary-Based Pre-processing. In: Salvendy, G., Smith, M.J. (eds) Human Interface and the Management of Information. Information and Interaction. Human Interface 2009. Lecture Notes in Computer Science, vol 5618. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02559-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-02559-4_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02558-7
Online ISBN: 978-3-642-02559-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Enhancing Document Clustering through Heuristics and Summary-Based Pre-processing

Abstract

Chapter PDF

Similar content being viewed by others

Text Clustering and Text Summarization on the Use of Side Information

Improving Clustering Quality by Automatic Text Summarization

A comprehensive and analytical review of text clustering techniques

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Enhancing Document Clustering through Heuristics and Summary-Based Pre-processing

Abstract

Chapter PDF

Similar content being viewed by others

Text Clustering and Text Summarization on the Use of Side Information

Improving Clustering Quality by Automatic Text Summarization

A comprehensive and analytical review of text clustering techniques

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation