Efficient Clustering of Very Large Document Collections

Dhillon, Inderjit S.; Fan, James; Guan, Yuqiang

doi:10.1007/978-1-4615-1733-7_20

Inderjit S. Dhillon,
James Fan &
Yuqiang Guan

Part of the book series: Massive Computing ((MACO,volume 2))

454 Accesses
78 Citations
3 Altmetric

Abstract

An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory efficient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented — a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the World Wide Web using WebACE. AI Review, 1998.
Google Scholar
Brent Callaghan. NFS Illustrated. Addison-Wesley, 1999.
Google Scholar
D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In ACM SIGIR, 1992.
Google Scholar
I. Duff, R. Grimes, and J. Lewis. Sparse matrix test problems. ACM Trans Math Soft, pages 1–14, 1989.
MATH Google Scholar
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.
MATH Google Scholar
I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, January 2001. Also appears as IBM Research Report RJ 10147, July 1999.
Article MATH Google Scholar
W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, New Jersey, 1992.
Google Scholar
M. R. Garey, D. S. Johnson, and H. S. Witsenhausen. The complexity of the generalized Lloyd-Max problem. IEEE Trans. Inform. Theory, 28(2):255–256, 1982.
Article MathSciNet MATH Google Scholar
J. Heaps. Information Retrieval - Computational and Theoretical Aspects. Academic Press, 1978.
MATH Google Scholar
T. G. Kolda. Limited-Memory Matrix Methods with Applications. PhD thesis, The Applied Mathematics Program, University of Maryland, College Park, Mayland, 1997.
Google Scholar
Jon Kleinberg, C. H. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Data Mining and Knowledge Discovery, 2(4):311–324, December 1998.
Article Google Scholar
Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.
MATH Google Scholar
D. Musser and A. Saini. STL Tutorial and Reference Guide. Addison-Wesley, 1996.
Google Scholar
Bradford Nichols, Bick Buttlar, and Jackie Proulx Farrell. Pthreads Programming. O’Reilly & Associates, Inc., 981 Chestnut Street, Newton, MA 02164, USA, 1996.
Google Scholar
Vern Paxson. Flex user manual, November 1996.
Google Scholar
E. Rasmussen. Clustering algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 419–442. Prentice Hall, Englewood Cliffs, New Jersey, 1992.
Google Scholar
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 4(5):513–523, 1988.
Article Google Scholar
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In Proceedings of the AAAI2000 Workshop on Artificial Intelligence for Web Search, pages 58–64, Austin, Texas, July 2000. AAAI/MIT Press.
Google Scholar
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
MATH Google Scholar
H. Schütze and C. Silverstein. Projections for efficient document clustering. In ACM SIGIR, 1997.
Google Scholar
P. Willet. Recent trends in hierarchic document clustering: a critical review. Information Processing & Management, 24(5):577–597, 1988.
Article Google Scholar
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In ACM SIGIR, 1998.
Google Scholar
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison Wesley, Reading, MA, 1949.
Google Scholar

Download references

Authors

Inderjit S. Dhillon
View author publications
You can also search for this author in PubMed Google Scholar
James Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yuqiang Guan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Illinois, Chicago, USA
Robert L. Grossman
Lawrence Livermore National Laboratory, Livermore, USA
Chandrika Kamath
Sandia National Laboratories, Livermore, USA
Philip Kegelmeyer
Army High Performance Computing Research Center (AHPCRC), Minneapolis, USA
Vipin Kumar
Army Research Laboratory, Aberdeen Proving Ground, USA
Raju R. Namburu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dhillon, I.S., Fan, J., Guan, Y. (2001). Efficient Clustering of Very Large Document Collections. In: Grossman, R.L., Kamath, C., Kegelmeyer, P., Kumar, V., Namburu, R.R. (eds) Data Mining for Scientific and Engineering Applications. Massive Computing, vol 2. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-1733-7_20

Download citation

DOI: https://doi.org/10.1007/978-1-4615-1733-7_20
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-0114-7
Online ISBN: 978-1-4615-1733-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics