Abstract
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory efficient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented — a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the World Wide Web using WebACE. AI Review, 1998.
Brent Callaghan. NFS Illustrated. Addison-Wesley, 1999.
D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In ACM SIGIR, 1992.
I. Duff, R. Grimes, and J. Lewis. Sparse matrix test problems. ACM Trans Math Soft, pages 1–14, 1989.
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.
I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, January 2001. Also appears as IBM Research Report RJ 10147, July 1999.
W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, New Jersey, 1992.
M. R. Garey, D. S. Johnson, and H. S. Witsenhausen. The complexity of the generalized Lloyd-Max problem. IEEE Trans. Inform. Theory, 28(2):255–256, 1982.
J. Heaps. Information Retrieval - Computational and Theoretical Aspects. Academic Press, 1978.
T. G. Kolda. Limited-Memory Matrix Methods with Applications. PhD thesis, The Applied Mathematics Program, University of Maryland, College Park, Mayland, 1997.
Jon Kleinberg, C. H. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Data Mining and Knowledge Discovery, 2(4):311–324, December 1998.
Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.
D. Musser and A. Saini. STL Tutorial and Reference Guide. Addison-Wesley, 1996.
Bradford Nichols, Bick Buttlar, and Jackie Proulx Farrell. Pthreads Programming. O’Reilly & Associates, Inc., 981 Chestnut Street, Newton, MA 02164, USA, 1996.
Vern Paxson. Flex user manual, November 1996.
E. Rasmussen. Clustering algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 419–442. Prentice Hall, Englewood Cliffs, New Jersey, 1992.
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 4(5):513–523, 1988.
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In Proceedings of the AAAI2000 Workshop on Artificial Intelligence for Web Search, pages 58–64, Austin, Texas, July 2000. AAAI/MIT Press.
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
H. Schütze and C. Silverstein. Projections for efficient document clustering. In ACM SIGIR, 1997.
P. Willet. Recent trends in hierarchic document clustering: a critical review. Information Processing & Management, 24(5):577–597, 1988.
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In ACM SIGIR, 1998.
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison Wesley, Reading, MA, 1949.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Dhillon, I.S., Fan, J., Guan, Y. (2001). Efficient Clustering of Very Large Document Collections. In: Grossman, R.L., Kamath, C., Kegelmeyer, P., Kumar, V., Namburu, R.R. (eds) Data Mining for Scientific and Engineering Applications. Massive Computing, vol 2. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-1733-7_20
Download citation
DOI: https://doi.org/10.1007/978-1-4615-1733-7_20
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-0114-7
Online ISBN: 978-1-4615-1733-7
eBook Packages: Springer Book Archive