Skip to main content

Part of the book series: Massive Computing ((MACO,volume 2))

Abstract

An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory efficient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented — a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the World Wide Web using WebACE. AI Review, 1998.

    Google Scholar 

  2. Brent Callaghan. NFS Illustrated. Addison-Wesley, 1999.

    Google Scholar 

  3. D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In ACM SIGIR, 1992.

    Google Scholar 

  4. I. Duff, R. Grimes, and J. Lewis. Sparse matrix test problems. ACM Trans Math Soft, pages 1–14, 1989.

    MATH  Google Scholar 

  5. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.

    MATH  Google Scholar 

  6. I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, January 2001. Also appears as IBM Research Report RJ 10147, July 1999.

    Article  MATH  Google Scholar 

  7. W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, New Jersey, 1992.

    Google Scholar 

  8. M. R. Garey, D. S. Johnson, and H. S. Witsenhausen. The complexity of the generalized Lloyd-Max problem. IEEE Trans. Inform. Theory, 28(2):255–256, 1982.

    Article  MathSciNet  MATH  Google Scholar 

  9. J. Heaps. Information Retrieval - Computational and Theoretical Aspects. Academic Press, 1978.

    MATH  Google Scholar 

  10. T. G. Kolda. Limited-Memory Matrix Methods with Applications. PhD thesis, The Applied Mathematics Program, University of Maryland, College Park, Mayland, 1997.

    Google Scholar 

  11. Jon Kleinberg, C. H. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Data Mining and Knowledge Discovery, 2(4):311–324, December 1998.

    Article  Google Scholar 

  12. Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.

    MATH  Google Scholar 

  13. D. Musser and A. Saini. STL Tutorial and Reference Guide. Addison-Wesley, 1996.

    Google Scholar 

  14. Bradford Nichols, Bick Buttlar, and Jackie Proulx Farrell. Pthreads Programming. O’Reilly & Associates, Inc., 981 Chestnut Street, Newton, MA 02164, USA, 1996.

    Google Scholar 

  15. Vern Paxson. Flex user manual, November 1996.

    Google Scholar 

  16. E. Rasmussen. Clustering algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 419–442. Prentice Hall, Englewood Cliffs, New Jersey, 1992.

    Google Scholar 

  17. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 4(5):513–523, 1988.

    Article  Google Scholar 

  18. A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In Proceedings of the AAAI2000 Workshop on Artificial Intelligence for Web Search, pages 58–64, Austin, Texas, July 2000. AAAI/MIT Press.

    Google Scholar 

  19. G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.

    MATH  Google Scholar 

  20. H. Schütze and C. Silverstein. Projections for efficient document clustering. In ACM SIGIR, 1997.

    Google Scholar 

  21. P. Willet. Recent trends in hierarchic document clustering: a critical review. Information Processing & Management, 24(5):577–597, 1988.

    Article  Google Scholar 

  22. O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In ACM SIGIR, 1998.

    Google Scholar 

  23. G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison Wesley, Reading, MA, 1949.

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Dhillon, I.S., Fan, J., Guan, Y. (2001). Efficient Clustering of Very Large Document Collections. In: Grossman, R.L., Kamath, C., Kegelmeyer, P., Kumar, V., Namburu, R.R. (eds) Data Mining for Scientific and Engineering Applications. Massive Computing, vol 2. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-1733-7_20

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-1733-7_20

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4020-0114-7

  • Online ISBN: 978-1-4615-1733-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics