Skip to main content

Techniques for Clustering Massive Data Sets

  • Chapter
Clustering and Information Retrieval

Part of the book series: Network Theory and Applications ((NETA,volume 11))

Abstract

The wealth of information embedded in huge databases belonging to corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the areas of knowledge discovery and data mining. Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data. The problem of clustering can be defined as follows: given n data points in a d-dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in different clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining. Proceedings of the Symposium on Management of Data (SIGMOD), 1998.

    Google Scholar 

  2. N. Alon, S. Dar, M. Parnas, and D. Ron. Testing of clustering. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.

    Google Scholar 

  3. Vijay Arya, Naveen Garg, Rohit Khandekar, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In Proceedings of the Symposium on Theory of Computing (STOC), pages 21–29, 2001.

    Google Scholar 

  4. Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, and Kyuseok Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. Proceedings of the International Conference on Very Large Databases (VLDB), pages 490–501, 1995.

    Google Scholar 

  5. Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45 (6): 891–923, 1998.

    Article  MathSciNet  MATH  Google Scholar 

  6. Pankaj K. Agarwal and Cecilia Procopiuc. Approximation algorithms for projective clustering. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 538–547, 2000.

    Google Scholar 

  7. Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k -medians and related problems. In Proceedings of the Symposium on Theory of Computing (STOC), pages 106–113, 1998.

    Google Scholar 

  8. Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60 (3): 630–659, 2000.

    Article  MathSciNet  MATH  Google Scholar 

  9. Y. Bartal, M. Charikar, and D. Raz. Approximating min-sum k-clustering in metric spaces. Proceedings of the Symposium on Theory of Computing (STOC), 2001.

    Google Scholar 

  10. N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. Proceedings of the ACM SIGMOD Conference on Management of Data, pages 322–331, 1990.

    Google Scholar 

  11. A. Borodin, R. Ostrovsky, and Y. Rabani. Subquadratic approximation algorithms for clustering problems in high dimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), 1999.

    Google Scholar 

  12. Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES ‘87), pages 21–29. IEEE Computer Society, 1998.

    Google Scholar 

  13. Moses Charikar, Chandra Chekuri, Tomas Feder, and Rajeev Motwani. Incremental clustering and dynamic information retrieval. In ACM Symposium on Theory of Computing, pages 626–635, 1997.

    Google Scholar 

  14. Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for the facility location and k-median problems. In IEEE Symposium on Foundations of Computer Science, pages 378–388, 1999.

    Google Scholar 

  15. M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys. A constant factor approximation algorithm for the k-median problem. Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, 1999.

    Google Scholar 

  16. M. Charikar Approximation algorithms for clustering problems. PhD Thesis, Stanford University, 2000.

    Google Scholar 

  17. F. Chudak. Improved approximation algorithms for uncapacitated facility location. Proceedings of Integer Programming and Combinatorial Optimization, LNCS 1412: 180–194, 1998.

    MathSciNet  Google Scholar 

  18. Moses Charikar, Samir Khullera, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 642–651, 2001.

    Google Scholar 

  19. D. Cutting, D. Karger, Jan Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. SIGIR, 1992.

    Google Scholar 

  20. K. L. Clarkson. A randomized algorithm for closestpoint queries. SIAM Journal on Computing, 17, 1988.

    Google Scholar 

  21. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, 2nd ed. MIT Press, 2001.

    Google Scholar 

  22. H. S. M. Coxeter. An upper bound for the number of equal nonoverlapping speheres that can touch each another of the same size. Symposia in Pure Mathematics, 7: 53–71, 1964.

    Google Scholar 

  23. Moses Charikar and Rina Panigrahy. Clustering to minimize the sum of cluster diameters. Proceedings of the Symposium on Theory of Computing (STOC), pages 1–10, 2001.

    Google Scholar 

  24. C. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. Proceedings of the Symposium on Theory of Computing (STOL), 1987.

    Google Scholar 

  25. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973.

    MATH  Google Scholar 

  26. P. Drineas, R. Kannan, A. Frieze, and V. Vinay. Clustering in large graphs and matrices. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 1999.

    Google Scholar 

  27. M. Ester, H. Kriegel, J. Snader, and X. Xu. A density-based algorithm for discovering clusters in large spatial database with noise. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), 1996.

    Google Scholar 

  28. M. Ester, H. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), 1995.

    Google Scholar 

  29. J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3: 209–226, 1977.

    Article  MATH  Google Scholar 

  30. Toms Feder and Daniel H. Greene. Optimal algorithms for appropriate clustering. Proceedings of the Symposium on Theory of Computing (STOC), pages 434–444, 1988.

    Google Scholar 

  31. V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS—Clustering categorical data using summaries. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-99), 1999.

    Google Scholar 

  32. D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. Proceedings of the 24’th International Conference on Very Large Data Bases, 1998.

    Google Scholar 

  33. S. Guha, H. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Approximate xml joins. Proceedings of the Symposium on Management of Data (SIGMOD), pages 287–298, 2002.

    Google Scholar 

  34. S. Guha and S. Khuller. Greedy strikes back: Improved facility location algorithms. Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 649–657, 1998.

    Google Scholar 

  35. S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.

    Google Scholar 

  36. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient algorithm for clustering large databases. Proceedings of the Symposium on Management of Data (SIGMOD), 1998.

    Google Scholar 

  37. S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. Proceedings of ICDE, 1999.

    Google Scholar 

  38. T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, pages 293–306, 1985.

    Google Scholar 

  39. Sudipto Guha. Approximation algorithms for facility location problems. Ph.D. Thesis, Stanford University, 2000.

    Google Scholar 

  40. E. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on association rule hypergraphs. Technical report, 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.

    Google Scholar 

  41. D. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem. Math of Operations Research, 10 (2): 180–184, 1985.

    Article  MathSciNet  MATH  Google Scholar 

  42. P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. Proceedings of the Symposium on Theory of Computing (STOC), 1998.

    Google Scholar 

  43. Piotr Indyk, Rajeev Motwani, Prabhakar Raghavan, and Santosh Vem-pala. Locality-preserving hashing in multidimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), pages 618–625, 1997.

    Google Scholar 

  44. P. Indyk. Sublinear time algorithms for metric space problems. Proceedings of the Symposium on Theory of Computing, 1999.

    Google Scholar 

  45. Piotr Indyk. A sublinear time approximation scheme for clustering in metric spaces. Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 154–159, 1999.

    Google Scholar 

  46. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

    Google Scholar 

  47. K. Jain, M. Mandian, and A. Saberi. A new greedy approach for facility location problem. Proceedings of the Symposium on Theory of Computing (STOC), 2002.

    Google Scholar 

  48. K. Jain and V. Vazirani. Primal-dual approximation algorithms for metric facility location and k-median problems. Proceedings of the Twenty-Ninth Annual IEEE Symposium on Foundations of Computer Science, 1999.

    Google Scholar 

  49. George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Multilevel hypergraph partitioning: Application in VLSI domain. Proceedings of the ACM/IEEE Design Automation Conference, 1997.

    Google Scholar 

  50. O. Kariv and S. L. Hakimi. An algorithmic approach to network location problems, part ii: p-media ns. SIAM Journal on Applied Mathematics, pages 539–560, 1979.

    Google Scholar 

  51. Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), pages 614–623, 1998.

    Google Scholar 

  52. S. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the euclidean k-median problem. Proc. 7th European Symposium on Algorithms, pages 378–389, 1999.

    Google Scholar 

  53. Ravi Kannan, Santosh Vempala, and Adrian Vetta. On clusterings: Good, bad and spectral. Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 367–377, 2000.

    Google Scholar 

  54. J. H. Lin and J. S. Vitter. Approximation algorithms for geometric median problems. Information Processing Letters, 44: 245–249, 1992.

    Article  MathSciNet  MATH  Google Scholar 

  55. J. H. Lin and J. S. Vitter. c-approximations with minimum packing constraint violations. Proceedings of the Twenty-Fourth Annual ACM Symposium on Theory of Computing, 1992.

    Google Scholar 

  56. O. L. Managasarian. Mathematical programming in data mining. Data Mining and Knowledge Discovery, 1997.

    Google Scholar 

  57. P. Mirchandani and R. Francis, editors. Discrete Location Theory. John Wiley and Sons, Inc., New York, 1990.

    MATH  Google Scholar 

  58. Nina Mishra, Dan Oblinger, and Leonard Pitt. Sublinear time approximate clustering. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2001.

    Google Scholar 

  59. R. Mettu and C. G. Plaxton. The onlike median problem. Proceedings of the 41st IEEE Foundations of Computer Science, 2000.

    Google Scholar 

  60. Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. Manuscript, 2002.

    Google Scholar 

  61. R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.

    Google Scholar 

  62. S. Muthukrishnan. Efficient algorithms for document retrieval problems. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002.

    Google Scholar 

  63. Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. Proceedings of the 20’th International Conference on Very Large Data Bases, 1994.

    Google Scholar 

  64. C. F. Olson. Parallel algorithms for hierarchical clustering. Technical report, University of California at Berkeley, 1993.

    Google Scholar 

  65. Liadan O’Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani. Streaming-data algorithms for high-quality clustering. Proceedings of ICDE, 2002.

    Google Scholar 

  66. Rafail Ostrovsky and Yuval Rabani. Polynomial time approximation schemes for geometric k-clustering. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.

    Google Scholar 

  67. Cecilia Procopiuc, Michael Jones, Pankaj K. Agarwal, and T. M. Murali. A monte cario algorithm for fast projective clustering. Proceedings of the Symposium on Management of Data (SIGMOD), 2002.

    Google Scholar 

  68. Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison Wesley, 1990.

    Google Scholar 

  69. T. Sellis, N. Roussopoulos, and C. Faloutsos. The R+tree: a dynamic index for multi-dimensional objects. Proceedings of the 13th International Conference on Very Large Data Bases, pages 507–518, 1987.

    Google Scholar 

  70. Kyuseok Shim, Ramakrishnan Srikant, and Rakesh Agrawal. High-dimensional similarity joins. pages 301–311, 1997.

    Google Scholar 

  71. D. B. Shmoys, É. Tardos, and K. Aardal. Approximation algorithms for facility location problems. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pages 265–274, 1997.

    Google Scholar 

  72. Mikkel Thorup. Quick k-median, k-center, and facility location for sparse graphs. ICALP, pages 249–260, 2001.

    Google Scholar 

  73. H. Toivonen. Samping large databases for association rules. Proceedings of the International Conference on Very Large Databases (VLDB), 1996.

    Google Scholar 

  74. Vijay Vazirani. Approximation Algorithms. Springer Verlag, 2001.

    Google Scholar 

  75. J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11 (1): 37–57, 1985.

    Article  MathSciNet  MATH  Google Scholar 

  76. Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103–114, 1996.

    Google Scholar 

  77. K. Zhang and D. Sasha. Tree pattern matching. In Apocolisto and Galil, editors, Pattern Matching Algorithms. Oxford University Press, 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Kluwer Academic Publishers

About this chapter

Cite this chapter

Guha, S., Rastogi, R., Shim, K. (2004). Techniques for Clustering Massive Data Sets. In: Clustering and Information Retrieval. Network Theory and Applications, vol 11. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0227-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4613-0227-8_2

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-7949-2

  • Online ISBN: 978-1-4613-0227-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics